source: docs/HPCA2012/07-avx.tex @ 1339

Last change on this file since 1339 was 1339, checked in by cameron, 8 years ago

Intro updates; section cross-references

File size: 6.2 KB
Line 
1\section{Scaling Parabix2 for AVX}
2\label{section:avx}
3In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix2 port.
4Parabix2 originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but
5has recently been ported to AVX. AVX technology is commercially available on the
6latest the \SB\ microarchitecture Intel processors.
7
8\begin{figure*}
9\begin{center}
10\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
11\end{center}
12\caption{Parabix2 Instruction Counts (y-axis: Instructions per kB)}
13\label{insmix}
14\end{figure*}
15
16\begin{figure}
17\begin{center}
18\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
19\end{center}
20\caption{Parabix2 Performance (y-axis: CPU Cycles per kB)}
21\label{avx}
22\end{figure}
23
24\subsection{Three Operand Form}
25
26In addition to the widening of 128-bit operations to 256-bit operations, AVX technology
27uses a nondestructive 3-operand instruction format. Previous SSE implementations
28used a destructive 2-operand instruction format. In the 2-operand format
29a single register is used as both a source and
30destination register. For example, $a = a~\texttt{[op]}~b$.
31As such, 2-operand instructions that require the value of both $a$ and $b$,
32must either copy an additional register value beforehand, or reconstitute or reload a register value
33afterwards to recover the value.
34With the 3-operand format, output may now be directed to the third register independently
35of the source operands. For example, $c = a~\texttt{[op]}~b$.
36By avoiding the copying or reconstituting of operand values, a considerable
37reduction in instruction count in the form of reduced load and store instructions is possible.
38AVX technology makes available the 3-operand form for both the new 256-bit
39operations as well as the base 128-bit SSE operations.
40
41\subsection{256-bit AVX Operations}
42
43With the introduction of 256-bit SIMD registers, and under ideal
44conditions, one would anticipate a corresponding 50\% reduction in the
45SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX
46implementation, Intel has focused primarily on floating point
47operations as opposed to the integer based operations.  256-bit SIMD
48is available for loads, stores, bitwise logic and floating operations,
49whereas SIMD integer operations and shifts are only available in the
50128-bit form.  Nevertheless, with loads, stores and bitwise logic
51comprising a major portion of the Parabix2 SIMD instruction mix, a
52substantial reduction in instruction count and consequent performance
53improvement was anticipated but not achieved.
54
55\subsection{Performance Results}
56
57We implemented two versions of Parabix2 using AVX technology.  The first
58was simply the recompilation of the existing Parabix2 source code
59written to take advantage of the 3-operand form of AVX instructions while retaining
60a uniform 128-bit SIMD processing width.  The second involved rewriting the
61core library functions of Parabix2 to leverage the 256-bit AVX operations wherever
62possible and to simulate the remaining operations using pairs of 128-bit
63operations.   
64
65Figure \ref{insmix} shows the reduction in instruction
66counts achieved in these two versions.   For each workload, the
67base instruction count of the Parabix2 binary compiled in SSE-only
68mode is shown with the caption ``sse,'' the version obtained by
69simple recompilation with AVX-mode enabled is labeled ``128-bit avx,''
70and the version reimplemented to use 256-bit operations wherever
71possible is labelled ``256-bit avx.''    The instruction counts
72are divided into three classes.  The ``non-SIMD'' operations
73are the general purpose instructions that use neither SSE nor
74AVX technology.   The ``bitwise SIMD'' class comprises
75the bitwise logic operations, that are available in both 128-bit
76form and 256-bit form.  The ``other SIMD'' class comprises
77all other SIMD operations, primarily comprising the integer SIMD
78operations that are available only at 128-bit widths even with
79256-bit AVX technology.
80
81Note that, in each workload, the number of non-SIMD instructions
82remains relatively constant with each workload.  As may be expected,
83however, the number of ``bitwise SIMD'' operations remains the same
84for both SSE and 128-bit while dropping dramatically when operating
85256-bits at a time.  Ideally one one may expect up to a 50\% reduction
86in these instructions versus the 128-bit AVX.  The actual reduction
87measured was 32\%--39\% depending on workload.  Because some bitwise
88logic is needed in implementation of simulated 256-bit operations, the
89full 50\% reduction in bitwise logic was not achieved.
90
91The ``other SIMD'' class shows a substantial 30\%-35\% reduction
92with AVX 128-bit technology compared to SSE.  This reduction is
93due to eliminated copies or reloads when SIMD operations
94are compiled using 3-operand AVX form versus 2-operand SSE form.
95A further 10\%--20\% reduction is observed with Parabix2 version
96rewritten to use 256-bit operations. 
97
98While the successive reductions in SIMD instruction counts are quite
99dramatic with the two AVX implementations of Parabix2, the performance
100benefits are another story.  As shown in Figure \ref{avx}, the
101benefits of the reduced SIMD instruction count are achieved only in
102the AVX 128-bit version.  In this case, the benefits of 3-operand form
103seem to fully translate to performance benefits.  Based on the
104reduction of overall Bitwise-SIMD instructions we expected a 11\%
105improvement in performance.  Instead, perhaps bizzarely, the
106performance of Parabix2 in the 256-bit AVX implementation does not
107improve significantly and actually degrades for files with higher
108markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD
109instructions reduced by 39\%, saw a performance improvement of 8\%.
110We believe that this is primarily due to the intricacies of the first
111generation AVX implemention in \SB{}, with significant latency in many
112of the 256-bit instructions in comparison to their 128-bit
113counterparts. The 256-bit instructions also have different scheduling
114constraints that seem to reduce overall SIMD throughput.  If these
115latency issues can be addressed in future AVX implementations, further
116substantial performance and energy benefits could be realized in XML
117parsing with Parabix2.
Note: See TracBrowser for help on using the repository browser.