source: docs/HPCA2012/final_ieee/07-avx.tex @ 1783

Last change on this file since 1783 was 1783, checked in by ashriram, 7 years ago

Final pass

File size: 5.3 KB
Line 
1\section{Parabix on AVX}
2
3
4\label{section:avx}
5In this section, we discuss the scalability and performance advantages
6of our 256-bit AVX (Advanced Vector Extensions) Parabix-XML port.  The
7Parabix runtime libraries originally targeted the 128-bit SSE2 SIMD
8technology, available on all modern 64-bit Intel and AMD processors.
9It was recently been ported to AVX, which is commercially
10available on the latest the \SB\ microarchitecture Intel
11processors. Although the runtime had to be ported to
12the new ISA, no modifications were made to the application.
13
14\subsection{3-Operand Form}
15In addition to widening the 128-bit operations to 256-bit operations,
16 AVX technology uses a nondestructive 3-operand instruction
17format. Previous SSE implementations used a destructive 2-operand
18instruction format. In the 2-operand format a single register is used
19as both a source and destination register. As such, 2-operand instructions that require the
20value of both $a$ and $b$, must either copy an additional register
21value beforehand, or reconstitute a register value
22afterwards to recover the value.  With the 3-operand format, output
23may now be directed to the third register independently of the source
24operands. By avoiding the need to
25copy or reconstitute operand values, a considerable reduction
26in instructions required for unloading from and loading into
27registers is achieved.  AVX technology makes available the 3-operand form for both
28the new 256-bit AVX as well as the 128-bit SSE operations.
29
30\subsection{256-bit Operations}
31With the introduction of 256-bit SIMD registers, and under ideal
32conditions, one would anticipate a corresponding 50\% reduction in the
33SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
34implementation, Intel has focused primarily on floating point
35operations.  256-bit SIMD is available for loads, stores, bitwise
36logic and floating operations, whereas SIMD integer operations and
37shifts are only available in the 128-bit form.
38
39
40\subsection{Performance Results}
41
42We implemented two versions of Parabix-XML using AVX technology.  The
43first was simply the recompilation of the existing Parabix-XML source code
44to take advantage of the 3-operand form of AVX instructions
45while retaining a uniform 128-bit SIMD processing width.  The second
46involved rewriting the Parabix runtime library to
47leverage the 256-bit AVX instructions wherever possible and to simulate
48the remaining operations using pairs of 128-bit operations. Figure
49\ref{insmix} shows the reduction in instruction count achieved in
50each version. For each workload, the base instruction count of
51the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse'';
52the version that only takes advantage of the AVX 3-operand mode is
53labeled ``128-bit avx'', and the version uses the 256-bit
54operations wherever possible is labeled ``256-bit avx''.  The
55instruction counts are divided into three classes: ``non-SIMD''
56operations are the general purpose instructions.  The ``bitwise SIMD''
57class comprises the bitwise logic operations, that are available in
58both 128-bit form and 256-bit form --- excluding bitwise shifts which are
59only available in 128-bit form.  The ``other SIMD'' class
60comprises all other SIMD operations, primarily comprising the integer
61SIMD operations that are available only at 128-bit widths even under
62AVX.
63
64
65The number of non-SIMD instructions remains relatively constant with
66each implementation.  The number of bitwise SIMD
67operations remains the same for both SSE and 128-bit AVX while
68dropping dramatically when operating 256-bits at a time. The reduction
69was measured at 32\%--39\% depending on markup density of the
70workload. The ``other SIMD'' class shows a substantial 30\%--35\%
71reduction with AVX 128-bit technology compared to SSE. This reduction
72is due to elimination of register unloading and reloading when SIMD
73operations are compiled using 3-operand AVX form versus 2-operand SSE
74form.  A further 10\%--20\% reduction is also observed when
75Parabix-XML utilized the AVX runtime library.
76
77
78%[AS] Check numbers.
79The reductions in instruction counts are significant with the AVX
80extensions demonstrating the ability of Parabix to
81exploit wider SIMD extensions. Figure
82\ref{avx} shows the benefits of the reduced SIMD instruction count are
83achieved only in the AVX 128-bit version; The 3-operand form seems to fully translate to performance benefits.
84Based on the reduction of overall Bitwise-SIMD instructions we
85expected a 11\% improvement in performance.  Surprisingly, the
86performance of Parabix in the 256-bit AVX implementation does not
87improve significantly and actually degrades for files with higher
88markup density ($\sim11\%$). dew.xml, on which bitwise-SIMD
89instructions were reduced by 39\%, saw a performance improvement of
908\%.  We believe that this is primarily due to the intricacies of the
91first generation AVX implementation in \SB{}, with significant latency
92in many of the 256-bit instructions in comparison to their 128-bit
93counterparts. The 256-bit instructions also have different scheduling
94constraints that seem to reduce overall throughput.  If these latency
95issues can be addressed in future AVX implementations, further
96performance and energy benefits could be realized by Parabix.
97
98
99\begin{figure}[!htb]
100\begin{center}
101\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
102\end{center}
103\caption{Parabix Performance (y-axis: ns per kB)}
104\label{avx}
105\end{figure}
106
Note: See TracBrowser for help on using the repository browser.