source: docs/HPCA2012/07-avx.tex @ 1379

Last change on this file since 1379 was 1365, checked in by ashriram, 8 years ago

Fixed methodology

File size: 5.9 KB
Line 
1\section{Scaling Parabix for AVX}
2\label{section:avx}
3In this section, we discuss the scalability and performance advantages
4of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The
5Parabix SIMD libraries originally targetted the 128-bit SSE2 SIMD
6technology available on all modern 64-bit Intel and AMD processors but
7has recently been ported to AVX. AVX technology is commercially
8available on the latest the \SB\ microarchitecture Intel
9processors. While we have to port our runtime framework the
10application didn't need to be modified.
11
12\begin{figure*}
13\begin{center}
14\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
15\end{center}
16\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
17\label{insmix}
18\end{figure*}
19
20\begin{figure}
21\begin{center}
22\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
23\end{center}
24\caption{Parabix Performance (y-axis: ns per kB)}
25\label{avx}
26\end{figure}
27
28\paragraph{3-Operand Form}
29In addition to the widening of 128-bit operations to 256-bit
30operations, AVX technology uses a nondestructive 3-operand instruction
31format. Previous SSE implementations used a destructive 2-operand
32instruction format. In the 2-operand format a single register is used
33as both a source and destination register. For example, $a =
34a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the
35value of both $a$ and $b$, must either copy an additional register
36value beforehand, or reconstitute or reload a register value
37afterwards to recover the value.  With the 3-operand format, output
38may now be directed to the third register independently of the source
39operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the
40copying or reconstituting of operand values, a considerable reduction
41in instructions required for unloading from and loading into
42registers.  AVX technology makes available the 3-operand form for both
43the new 256-bit operations as well as the base 128-bit SSE operations.
44
45\subsection{256-bit Operations}
46With the introduction of 256-bit SIMD registers, and under ideal
47conditions, one would anticipate a corresponding 50\% reduction in the
48SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
49implementation, Intel has focused primarily on floating point
50operations as opposed to the integer based operations.  256-bit SIMD
51is available for loads, stores, bitwise logic and floating operations,
52whereas SIMD integer operations and shifts are only available in the
53128-bit form.
54
55
56\subsection{Performance Results}
57
58We implemented two versions of Parabix using AVX technology.  The
59first was simply the recompilation of the existing Parabix source code
60written to take advantage of the 3-operand form of AVX instructions
61while retaining a uniform 128-bit SIMD processing width.  The second
62involved rewriting the internal library functions of Parabix to
63leverage the 256-bit AVX operations wherever possible and to simulate
64the remaining operations using pairs of 128-bit operations.Figure
65\ref{insmix} shows the reduction in instruction counts achieved in
66these two versions.  For each workload, the base instruction count of
67the Parabix binary compiled in SSE-only mode is indicated by ``sse,''
68the version which only takes advantage of the AVX 3-operand mode is
69labeled ``128-bit avx,'' and the version reimplemented to use 256-bit
70operations wherever possible is labelled ``256-bit avx.''  The
71instruction counts are divided into three classes: ``non-SIMD''
72operations are the general purpose instructions.  The ``bitwise SIMD''
73class comprises the bitwise logic operations, that are available in
74both 128-bit form and 256-bit form.  The ``other SIMD'' class
75comprises all other SIMD operations, primarily comprising the integer
76SIMD operations that are available only at 128-bit widths even under
77AVX.
78
79Note that, in each workload, the number of non-SIMD instructions
80remains relatively constant with each workload.  As may be expected
81the number of \textit{bit-parallel SIMD} operations remains the same
82for both SSE and 128-bit while dropping dramatically when operating
83256-bits at a time.  The reduction measured was 32\%--39\% depending
84on workload because some bitwise logic needed in implementation is
85composed of 128-bit operations. The limits the performance gains
86achieved when using the AVX instructions.  The ``other SIMD'' class
87shows a substantial 30\%-35\% reduction with AVX 128-bit technology
88compared to SSE.  This reduction is due to elimination of register
89unloading and reloading when SIMD operations are compiled using
903-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
91reduction is observed with Parabix version rewritten to use 256-bit
92operations.
93
94%[AS] Check numbers.
95The reductions in instruction counts are quite dramatic with the AVX
96extensions in Parabix demonstrating the ability of our runtime
97framework to exploit the available hardware resources. As shown in
98Figure \ref{avx}, the benefits of the reduced SIMD instruction count
99are achieved only in the AVX 128-bit version.  In this case, the
100benefits of 3-operand form seem to fully translate to performance
101benefits.  Based on the reduction of overall Bitwise-SIMD instructions
102we expected a 11\% improvement in performance.  Instead, perhaps
103bizzarely, the performance of Parabix in the 256-bit AVX
104implementation does not improve significantly and actually degrades
105for files with higher markup density (average 11\%). Dewiki.xml, on
106which bitwise-SIMD instructions reduced by 39\%, saw a performance
107improvement of 8\%.  We believe that this is primarily due to the
108intricacies of the first generation AVX implemention in \SB{}, with
109significant latency in many of the 256-bit instructions in comparison
110to their 128-bit counterparts. The 256-bit instructions also have
111different scheduling constraints that seem to reduce overall
112throughput.  If these latency issues can be addressed in future AVX
113implementations, further substantial performance and energy benefits
114could be realized in XML parsing with Parabix.
Note: See TracBrowser for help on using the repository browser.