source: docs/HPCA2012/final_ieee/07-avx.tex @ 1751

Last change on this file since 1751 was 1751, checked in by ashriram, 8 years ago

minor fixes

File size: 5.4 KB
Line 
1\section{Parabix on AVX}
2
3
4\label{section:avx}
5In this section, we discuss the scalability and performance advantages
6of our 256-bit AVX (Advanced Vector Extensions) Parabix-XML port.  The
7Parabix runtime libraries originally targeted the 128-bit SSE2 SIMD
8technology, available on all modern 64-bit Intel and AMD processors.
9It was recently been ported to AVX, which is commercially
10available on the latest the \SB\ microarchitecture Intel
11processors. Although the runtime had to be ported to
12the new ISA, no modifications were made to the application.
13
14\subsection{3-Operand Form}
15In addition to widening the 128-bit operations to 256-bit,
16 AVX technology uses a nondestructive 3-operand instruction
17format. Previous SSE implementations used a destructive 2-operand
18instruction format. In the 2-operand format a single register is used
19as both a source and destination register. As such, 2-operand instructions that require the
20value of both $a$ and $b$, must either copy an additional register
21value beforehand, or reconstitute or reload a register value
22afterwards to recover the value.  With the 3-operand format, output
23may now be directed to the third register independently of the source
24operands. By avoiding the need to
25copy or reconstitute operand values, a considerable reduction
26in instructions required for unloading from and loading into
27registers.  AVX technology makes available the 3-operand form for both
28the new 256-bit AVX and as the 128-bit SSE operations.
29
30\subsection{256-bit Operations}
31With the introduction of 256-bit SIMD registers, and under ideal
32conditions, one would anticipate a corresponding 50\% reduction in the
33SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
34implementation, Intel has focused primarily on floating point
35operations.  256-bit SIMD is available for loads, stores, bitwise
36logic and floating operations, whereas SIMD integer operations and
37shifts are only available in the 128-bit form.
38
39
40\subsection{Performance Results}
41
42We implemented two versions of Parabix-XML using AVX technology.  The
43first was simply the recompilation of the existing Parabix-XML source code
44to take advantage of the 3-operand form of AVX instructions
45while retaining a uniform 128-bit SIMD processing width.  The second
46involved rewriting the Parabix runtime library to
47leverage the 256-bit AVX instructions wherever possible and to simulate
48the remaining operations using pairs of 128-bit operations. Figure
49\ref{insmix} shows the reduction in instruction counts achieved in
50these two versions. For each workload, the base instruction count of
51the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse;''
52the version that only takes advantage of the AVX 3-operand mode is
53labeled ``128-bit avx,'' and the version uses the 256-bit
54operations wherever possible is labeled ``256-bit avx.''  The
55instruction counts are divided into three classes: ``non-SIMD''
56operations are the general purpose instructions.  The ``bitwise SIMD''
57class comprises the bitwise logic operations, that are available in
58both 128-bit form and 256-bit form --- excluding bitwise shifts which are
59only available in 128-bit form.  The ``other SIMD'' class
60comprises all other SIMD operations, primarily comprising the integer
61SIMD operations that are available only at 128-bit widths even under
62AVX.
63
64
65Note that, in each workload, the number of non-SIMD instructions
66remains relatively constant with each workload.  As expected,
67the number of bitwise SIMD operations remains the same
68for both SSE and 128-bit while dropping dramatically when operating
69256-bits at a time. The reduction was measured at 32\%--39\% depending
70on markup density of the workload. The ``other SIMD'' class
71shows a substantial 30\%--35\% reduction with AVX 128-bit technology
72compared to SSE. This reduction is due to elimination of register
73unloading and reloading when SIMD operations are compiled using
743-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
75reduction is also observed when Parabix-XML utilized the AVX runtime
76library.
77
78
79%[AS] Check numbers.
80The reductions in instruction counts are quite dramatic with the AVX
81extensions in Parabix demonstrating the ability of our runtime
82framework to exploit the available hardware resources. As shown in
83Figure \ref{avx}, the benefits of the reduced SIMD instruction count
84are achieved only in the AVX 128-bit version.  In this case, the
85benefits of 3-operand form seem to fully translate to performance
86benefits.  Based on the reduction of overall Bitwise-SIMD instructions
87we expected a 11\% improvement in performance.  Instead, perhaps
88bizarrely, the performance of Parabix in the 256-bit AVX
89implementation does not improve significantly and actually degrades
90for files with higher markup density ($\sim11\%$). dew.xml, on
91which bitwise-SIMD instructions reduced by 39\%, saw a performance
92improvement of 8\%.  We believe that this is primarily due to the
93intricacies of the first generation AVX implementation in \SB{}, with
94significant latency in many of the 256-bit instructions in comparison
95to their 128-bit counterparts. The 256-bit instructions also have
96different scheduling constraints that seem to reduce overall
97throughput.  If these latency issues can be addressed in future AVX
98implementations, further performance and energy benefits
99could be realized in Parabix-XML.
100
101
102\begin{figure}[!htb]
103\begin{center}
104\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
105\end{center}
106\caption{Parabix Performance (y-axis: ns per kB)}
107\label{avx}
108\end{figure}
109
Note: See TracBrowser for help on using the repository browser.