source: docs/HPCA2011/07-avx.tex @ 1320

Last change on this file since 1320 was 1302, checked in by lindanl, 8 years ago

Create a directory for HPCA

File size: 6.2 KB
RevLine 
[1302]1\section{Scaling Parabix2 for AVX}
2
3In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix2 port.
4Parabix2 originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but
5has recently been ported to AVX. AVX technology is commercially available on the
6latest the \SB\ microarchitecture Intel processors.
7
8\begin{figure*}
9\begin{center}
10\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
11\end{center}
12\caption{Parabix2 Instruction Counts (y-axis: Instructions per kB)}
13\label{insmix}
14\end{figure*}
15
16\begin{figure}
17\begin{center}
18\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
19\end{center}
20\caption{Parabix2 Performance (y-axis: CPU Cycles per kB)}
21\label{avx}
22\end{figure}
23
24\subsection{Three Operand Form}
25
26In addition to the widening of 128-bit operations to 256-bit operations, AVX technology
27uses a nondestructive 3-operand instruction format. Previous SSE implementations
28used a destructive 2-operand instruction format. In the 2-operand format
29a single register is used as both a source and
30destination register. For example, $a = a~\texttt{[op]}~b$.
31As such, 2-operand instructions that require the value of both $a$ and $b$,
32must either copy an additional register value beforehand, or reconstitute or reload a register value
33afterwards to recover the value.
34With the 3-operand format, output may now be directed to the third register independently
35of the source operands. For example, $c = a~\texttt{[op]}~b$.
36By avoiding the copying or reconstituting of operand values, a considerable
37reduction in instruction count in the form of reduced load and store instructions is possible.
38AVX technology makes available the 3-operand form for both the new 256-bit
39operations as well as the base 128-bit SSE operations.
40
41\subsection{256-bit AVX Operations}
42
43With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding
4450\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX
45implementation, Intel has focused primarily on floating point operations
46as opposed to the integer based operations. 
47256-bit SIMD is available for loads, stores, bitwise logic and
48floating operations, whereas SIMD integer operations and shifts are
49only available in the 128-bit form.  Nevertheless, with loads, stores
50and bitwise logic comprising a major portion of the Parabix2
51SIMD instruction mix, a substantial reduction in instruction count
52and consequent performance improvement was anticipated but not achieved.
53
54\subsection{Performance Results}
55
56We implemented two versions of Parabix2 using AVX technology.  The first
57was simply the recompilation of the existing Parabix2 source code
58written to take advantage of the 3-operand form of AVX instructions while retaining
59a uniform 128-bit SIMD processing width.  The second involved rewriting the
60core library functions of Parabix2 to leverage the 256-bit AVX operations wherever
61possible and to simulate the remaining operations using pairs of 128-bit
62operations.   
63
64Figure \ref{insmix} shows the reduction in instruction
65counts achieved in these two versions.   For each workload, the
66base instruction count of the Parabix2 binary compiled in SSE-only
67mode is shown with the caption ``sse,'' the version obtained by
68simple recompilation with AVX-mode enabled is labeled ``128-bit avx,''
69and the version reimplemented to use 256-bit operations wherever
70possible is labelled ``256-bit avx.''    The instruction counts
71are divided into three classes.  The ``non-SIMD'' operations
72are the general purpose instructions that use neither SSE nor
73AVX technology.   The ``bitwise SIMD'' class comprises
74the bitwise logic operations, that are available in both 128-bit
75form and 256-bit form.  The ``other SIMD'' class comprises
76all other SIMD operations, primarily comprising the integer SIMD
77operations that are available only at 128-bit widths even with
78256-bit AVX technology.
79
80Note that, in each workload, the number of non-SIMD instructions
81remains relatively constant with each workload.  As may be
82expected, however, the number of ``bitwise SIMD'' operations
83remains the same for both SSE and 128-bit while dropping
84dramatically when operating 256-bits at a time.   Ideally
85one one may expect up to a 50\% reduction in these instructions versus
86the 128-bit AVX.  The actual reduction measured was 32\%--39\%
87depending on workload.   Because some bitwise logic is needed
88in implementation of simulated 256-bit operations, the full 50\%
89reduction in bitwise logic was not achieved.
90
91The ``other SIMD'' class shows a substantial 30\%-35\% reduction
92with AVX 128-bit technology compared to SSE.  This reduction is
93due to eliminated copies or reloads when SIMD operations
94are compiled using 3-operand AVX form versus 2-operand SSE form.
95A further 10\%--20\% reduction is observed with Parabix2 version
96rewritten to use 256-bit operations. 
97
98While the successive reductions in SIMD instruction counts are quite
99dramatic with the two AVX implementations of Parabix2, the performance
100benefits are another story.   As shown in Figure \ref{avx}, the
101benefits of the reduced SIMD instruction count are achieved only
102in the AVX 128-bit version.  In this case, the benefits of 3-operand
103form seem to fully translate to performance benefits. 
104Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.
105Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation
106does not improve significantly and actually degrades for files with
107higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%.
108We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{},
109with significant latency in many of the 256-bit instructions in comparison to their
110128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed
111in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2.
Note: See TracBrowser for help on using the repository browser.