source: docs/HPCA2012/07-avx.tex @ 1407

Last change on this file since 1407 was 1407, checked in by ashriram, 8 years ago

Minor bug fixes

File size: 5.9 KB
Line 
1\section{Scaling Parabix for AVX}
2\label{section:avx}
3In this section, we discuss the scalability and performance advantages
4of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The
5Parabix SIMD libraries originally targeted the 128-bit SSE2 SIMD
6technology available on all modern 64-bit Intel and AMD processors but
7has recently been ported to AVX. AVX technology is commercially
8available on the latest the \SB\ microarchitecture Intel
9processors. While we have to port our runtime framework the
10application didn't need to be modified.
11
12
13\paragraph{3-Operand Form}
14In addition to the widening of 128-bit operations to 256-bit
15operations, AVX technology uses a nondestructive 3-operand instruction
16format. Previous SSE implementations used a destructive 2-operand
17instruction format. In the 2-operand format a single register is used
18as both a source and destination register. For example, $a =
19a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the
20value of both $a$ and $b$, must either copy an additional register
21value beforehand, or reconstitute or reload a register value
22afterwards to recover the value.  With the 3-operand format, output
23may now be directed to the third register independently of the source
24operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the
25copying or reconstituting of operand values, a considerable reduction
26in instructions required for unloading from and loading into
27registers.  AVX technology makes available the 3-operand form for both
28the new 256-bit operations as well as the base 128-bit SSE operations.
29
30\subsection{256-bit Operations}
31With the introduction of 256-bit SIMD registers, and under ideal
32conditions, one would anticipate a corresponding 50\% reduction in the
33SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
34implementation, Intel has focused primarily on floating point
35operations as opposed to the integer based operations.  256-bit SIMD
36is available for loads, stores, bitwise logic and floating operations,
37whereas SIMD integer operations and shifts are only available in the
38128-bit form.
39
40
41\subsection{Performance Results}
42
43We implemented two versions of Parabix using AVX technology.  The
44first was simply the recompilation of the existing Parabix source code
45written to take advantage of the 3-operand form of AVX instructions
46while retaining a uniform 128-bit SIMD processing width.  The second
47involved rewriting the internal library functions of Parabix to
48leverage the 256-bit AVX operations wherever possible and to simulate
49the remaining operations using pairs of 128-bit operations.Figure
50\ref{insmix} shows the reduction in instruction counts achieved in
51these two versions.  For each workload, the base instruction count of
52the Parabix binary compiled in SSE-only mode is indicated by ``sse,''
53the version which only takes advantage of the AVX 3-operand mode is
54labeled ``128-bit avx,'' and the version reimplemented to use 256-bit
55operations wherever possible is labelled ``256-bit avx.''  The
56instruction counts are divided into three classes: ``non-SIMD''
57operations are the general purpose instructions.  The ``bitwise SIMD''
58class comprises the bitwise logic operations, that are available in
59both 128-bit form and 256-bit form.  The ``other SIMD'' class
60comprises all other SIMD operations, primarily comprising the integer
61SIMD operations that are available only at 128-bit widths even under
62AVX.
63
64
65\begin{figure*}[!h]
66\begin{center}
67\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
68\end{center}
69\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
70\label{insmix}
71\end{figure*}
72
73\begin{figure}[!h]
74\begin{center}
75\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
76\end{center}
77\caption{Parabix Performance (y-axis: ns per kB)}
78\label{avx}
79\end{figure}
80
81Note that, in each workload, the number of non-SIMD instructions
82remains relatively constant with each workload.  As may be expected
83the number of \textit{bit-parallel SIMD} operations remains the same
84for both SSE and 128-bit while dropping dramatically when operating
85256-bits at a time.  The reduction measured was 32\%--39\% depending
86on workload because some bitwise logic needed in implementation is
87composed of 128-bit operations. The limits the performance gains
88achieved when using the AVX instructions.  The ``other SIMD'' class
89shows a substantial 30\%-35\% reduction with AVX 128-bit technology
90compared to SSE.  This reduction is due to elimination of register
91unloading and reloading when SIMD operations are compiled using
923-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
93reduction is observed with Parabix version rewritten to use 256-bit
94operations.
95
96%[AS] Check numbers.
97The reductions in instruction counts are quite dramatic with the AVX
98extensions in Parabix demonstrating the ability of our runtime
99framework to exploit the available hardware resources. As shown in
100Figure \ref{avx}, the benefits of the reduced SIMD instruction count
101are achieved only in the AVX 128-bit version.  In this case, the
102benefits of 3-operand form seem to fully translate to performance
103benefits.  Based on the reduction of overall Bitwise-SIMD instructions
104we expected a 11\% improvement in performance.  Instead, perhaps
105bizarrely, the performance of Parabix in the 256-bit AVX
106implementation does not improve significantly and actually degrades
107for files with higher markup density (average 11\%). dew.xml, on
108which bitwise-SIMD instructions reduced by 39\%, saw a performance
109improvement of 8\%.  We believe that this is primarily due to the
110intricacies of the first generation AVX implementation in \SB{}, with
111significant latency in many of the 256-bit instructions in comparison
112to their 128-bit counterparts. The 256-bit instructions also have
113different scheduling constraints that seem to reduce overall
114throughput.  If these latency issues can be addressed in future AVX
115implementations, further substantial performance and energy benefits
116could be realized in XML parsing with Parabix.
Note: See TracBrowser for help on using the repository browser.