source: docs/HPCA2012/07-avx.tex @ 4455

Last change on this file since 4455 was 1692, checked in by lindanl, 8 years ago

Some figure adjustment to the new template

File size: 5.7 KB
RevLine 
[1416]1\section{Parabix on AVX}
[1339]2\label{section:avx}
[1365]3In this section, we discuss the scalability and performance advantages
[1410]4of our 256-bit AVX (Advanced Vector Extensions) Parabix-XML port.  The
[1650]5Parabix runtime libraries originally targeted the 128-bit SSE2 SIMD
[1410]6technology, available on all modern 64-bit Intel and AMD processors.
7It was recently been ported to AVX, which is commercially
[1365]8available on the latest the \SB\ microarchitecture Intel
[1418]9processors. Although the runtime had to be ported to
10the new ISA, no modifications were made to the application.
[1302]11
[1410]12\subsection{3-Operand Form}
13In addition to widening the 128-bit operations to 256-bit,
14 AVX technology uses a nondestructive 3-operand instruction
[1365]15format. Previous SSE implementations used a destructive 2-operand
16instruction format. In the 2-operand format a single register is used
[1410]17as both a source and destination register. As such, 2-operand instructions that require the
[1365]18value of both $a$ and $b$, must either copy an additional register
19value beforehand, or reconstitute or reload a register value
20afterwards to recover the value.  With the 3-operand format, output
21may now be directed to the third register independently of the source
[1410]22operands. By avoiding the need to
23copy or reconstitute operand values, a considerable reduction
[1365]24in instructions required for unloading from and loading into
25registers.  AVX technology makes available the 3-operand form for both
26the new 256-bit operations as well as the base 128-bit SSE operations.
[1302]27
[1365]28\subsection{256-bit Operations}
[1335]29With the introduction of 256-bit SIMD registers, and under ideal
30conditions, one would anticipate a corresponding 50\% reduction in the
[1365]31SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
[1335]32implementation, Intel has focused primarily on floating point
33operations as opposed to the integer based operations.  256-bit SIMD
34is available for loads, stores, bitwise logic and floating operations,
35whereas SIMD integer operations and shifts are only available in the
[1365]36128-bit form.
[1302]37
[1365]38
[1302]39\subsection{Performance Results}
40
[1410]41We implemented two versions of Parabix-XML using AVX technology.  The
42first was simply the recompilation of the existing Parabix-XML source code
43to take advantage of the 3-operand form of AVX instructions
[1365]44while retaining a uniform 128-bit SIMD processing width.  The second
[1650]45involved rewriting the Parabix runtime library to
[1410]46leverage the 256-bit AVX instructions wherever possible and to simulate
47the remaining operations using pairs of 128-bit operations. Figure
[1365]48\ref{insmix} shows the reduction in instruction counts achieved in
[1410]49these two versions. For each workload, the base instruction count of
50the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse;''
51the version that only takes advantage of the AVX 3-operand mode is
52labeled ``128-bit avx,'' and the version uses the 256-bit
[1411]53operations wherever possible is labeled ``256-bit avx.''  The
[1365]54instruction counts are divided into three classes: ``non-SIMD''
55operations are the general purpose instructions.  The ``bitwise SIMD''
56class comprises the bitwise logic operations, that are available in
[1410]57both 128-bit form and 256-bit form --- excluding bitwise shifts which are
58only available in 128-bit form.  The ``other SIMD'' class
[1365]59comprises all other SIMD operations, primarily comprising the integer
60SIMD operations that are available only at 128-bit widths even under
61AVX.
[1302]62
[1407]63
[1692]64\begin{figure*}[htbp]
[1407]65\begin{center}
66\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
67\end{center}
68\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
69\label{insmix}
70\end{figure*}
71
72\begin{figure}[!h]
73\begin{center}
74\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
75\end{center}
76\caption{Parabix Performance (y-axis: ns per kB)}
77\label{avx}
78\end{figure}
79
[1335]80Note that, in each workload, the number of non-SIMD instructions
[1410]81remains relatively constant with each workload.  As expected,
82the number of bitwise SIMD operations remains the same
[1335]83for both SSE and 128-bit while dropping dramatically when operating
[1410]84256-bits at a time. The reduction was measured at 32\%--39\% depending
85on markup density of the workload. The ``other SIMD'' class
86shows a substantial 30\%--35\% reduction with AVX 128-bit technology
87compared to SSE. This reduction is due to elimination of register
[1365]88unloading and reloading when SIMD operations are compiled using
893-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
[1410]90reduction is also observed when Parabix-XML utilized the AVX runtime
91library.
[1302]92
[1365]93%[AS] Check numbers.
94The reductions in instruction counts are quite dramatic with the AVX
95extensions in Parabix demonstrating the ability of our runtime
96framework to exploit the available hardware resources. As shown in
97Figure \ref{avx}, the benefits of the reduced SIMD instruction count
98are achieved only in the AVX 128-bit version.  In this case, the
99benefits of 3-operand form seem to fully translate to performance
100benefits.  Based on the reduction of overall Bitwise-SIMD instructions
101we expected a 11\% improvement in performance.  Instead, perhaps
[1389]102bizarrely, the performance of Parabix in the 256-bit AVX
[1365]103implementation does not improve significantly and actually degrades
[1410]104for files with higher markup density ($\sim11\%$). dew.xml, on
[1365]105which bitwise-SIMD instructions reduced by 39\%, saw a performance
106improvement of 8\%.  We believe that this is primarily due to the
[1389]107intricacies of the first generation AVX implementation in \SB{}, with
[1365]108significant latency in many of the 256-bit instructions in comparison
109to their 128-bit counterparts. The 256-bit instructions also have
110different scheduling constraints that seem to reduce overall
111throughput.  If these latency issues can be addressed in future AVX
[1410]112implementations, further performance and energy benefits
113could be realized in Parabix-XML.
Note: See TracBrowser for help on using the repository browser.