source: docs/HPCA2012/07-avx.tex @ 4490

Last change on this file since 4490 was 1692, checked in by lindanl, 8 years ago

Some figure adjustment to the new template

File size: 5.7 KB
1\section{Parabix on AVX}
3In this section, we discuss the scalability and performance advantages
4of our 256-bit AVX (Advanced Vector Extensions) Parabix-XML port.  The
5Parabix runtime libraries originally targeted the 128-bit SSE2 SIMD
6technology, available on all modern 64-bit Intel and AMD processors.
7It was recently been ported to AVX, which is commercially
8available on the latest the \SB\ microarchitecture Intel
9processors. Although the runtime had to be ported to
10the new ISA, no modifications were made to the application.
12\subsection{3-Operand Form}
13In addition to widening the 128-bit operations to 256-bit,
14 AVX technology uses a nondestructive 3-operand instruction
15format. Previous SSE implementations used a destructive 2-operand
16instruction format. In the 2-operand format a single register is used
17as both a source and destination register. As such, 2-operand instructions that require the
18value of both $a$ and $b$, must either copy an additional register
19value beforehand, or reconstitute or reload a register value
20afterwards to recover the value.  With the 3-operand format, output
21may now be directed to the third register independently of the source
22operands. By avoiding the need to
23copy or reconstitute operand values, a considerable reduction
24in instructions required for unloading from and loading into
25registers.  AVX technology makes available the 3-operand form for both
26the new 256-bit operations as well as the base 128-bit SSE operations.
28\subsection{256-bit Operations}
29With the introduction of 256-bit SIMD registers, and under ideal
30conditions, one would anticipate a corresponding 50\% reduction in the
31SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
32implementation, Intel has focused primarily on floating point
33operations as opposed to the integer based operations.  256-bit SIMD
34is available for loads, stores, bitwise logic and floating operations,
35whereas SIMD integer operations and shifts are only available in the
36128-bit form.
39\subsection{Performance Results}
41We implemented two versions of Parabix-XML using AVX technology.  The
42first was simply the recompilation of the existing Parabix-XML source code
43to take advantage of the 3-operand form of AVX instructions
44while retaining a uniform 128-bit SIMD processing width.  The second
45involved rewriting the Parabix runtime library to
46leverage the 256-bit AVX instructions wherever possible and to simulate
47the remaining operations using pairs of 128-bit operations. Figure
48\ref{insmix} shows the reduction in instruction counts achieved in
49these two versions. For each workload, the base instruction count of
50the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse;''
51the version that only takes advantage of the AVX 3-operand mode is
52labeled ``128-bit avx,'' and the version uses the 256-bit
53operations wherever possible is labeled ``256-bit avx.''  The
54instruction counts are divided into three classes: ``non-SIMD''
55operations are the general purpose instructions.  The ``bitwise SIMD''
56class comprises the bitwise logic operations, that are available in
57both 128-bit form and 256-bit form --- excluding bitwise shifts which are
58only available in 128-bit form.  The ``other SIMD'' class
59comprises all other SIMD operations, primarily comprising the integer
60SIMD operations that are available only at 128-bit widths even under
68\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
76\caption{Parabix Performance (y-axis: ns per kB)}
80Note that, in each workload, the number of non-SIMD instructions
81remains relatively constant with each workload.  As expected,
82the number of bitwise SIMD operations remains the same
83for both SSE and 128-bit while dropping dramatically when operating
84256-bits at a time. The reduction was measured at 32\%--39\% depending
85on markup density of the workload. The ``other SIMD'' class
86shows a substantial 30\%--35\% reduction with AVX 128-bit technology
87compared to SSE. This reduction is due to elimination of register
88unloading and reloading when SIMD operations are compiled using
893-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
90reduction is also observed when Parabix-XML utilized the AVX runtime
93%[AS] Check numbers.
94The reductions in instruction counts are quite dramatic with the AVX
95extensions in Parabix demonstrating the ability of our runtime
96framework to exploit the available hardware resources. As shown in
97Figure \ref{avx}, the benefits of the reduced SIMD instruction count
98are achieved only in the AVX 128-bit version.  In this case, the
99benefits of 3-operand form seem to fully translate to performance
100benefits.  Based on the reduction of overall Bitwise-SIMD instructions
101we expected a 11\% improvement in performance.  Instead, perhaps
102bizarrely, the performance of Parabix in the 256-bit AVX
103implementation does not improve significantly and actually degrades
104for files with higher markup density ($\sim11\%$). dew.xml, on
105which bitwise-SIMD instructions reduced by 39\%, saw a performance
106improvement of 8\%.  We believe that this is primarily due to the
107intricacies of the first generation AVX implementation in \SB{}, with
108significant latency in many of the 256-bit instructions in comparison
109to their 128-bit counterparts. The 256-bit instructions also have
110different scheduling constraints that seem to reduce overall
111throughput.  If these latency issues can be addressed in future AVX
112implementations, further performance and energy benefits
113could be realized in Parabix-XML.
Note: See TracBrowser for help on using the repository browser.