source: docs/PACT2011/07-avx.tex @ 1094

Last change on this file since 1094 was 1048, checked in by ksherdy, 9 years ago

General edits.

File size: 6.1 KB
1\section{Scaling Parabix2 for AVX Technology}
3Parabix2 was originally developed for 128-bit SSE2 technology widely
4and is available on all 64-bit Intel and AMD processors.  In this section,
5we discuss the scalability and performance of Parabix2 to take
6advantage of the new 256-bit AVX (Advanced Vector Extensions)
7technology that has just become commercially available in the
8latest Intel processors based on the \SB\ microarchitecture.
14\caption{Parabix2 Instruction Counts (y-axis: Instructions per KByte)}
22\caption{Parabix2 Performance (y-axis: CPU cycles per KB)}
26\subsection{Three Operand Form}
28In addition to the introduction of 256-bit operations, AVX technology
29also makes a change in the structure of the base SSE instructions,
30moving from a destructive 2-operand form long used with SSE technologies
31to a nondestructive 3-operand form.   In the 2-operand form,
32one register is used as both a source and
33destination register, equivalent to the assignment $a = a~\texttt{[op]}~b$.
34Thus, whenever the subsequent instructions used the value of both $a$ and $b$,
35one of them had to be copied beforehand, or reconstituted or reloaded
36afterwards in order to recover the value.
37With 3-operand form, output may be directed to a third register independent
38of the source operands, as reflected by the assignment $c = a~\texttt{[op]}~b$.
39By avoiding the copying or reconstituting of operand values, a considerable
40reduction in instruction count may be possible.
41AVX technology makes available the 3-operand form both with the new 256-bit
42operations as well as with base 128-bit operations of SSE.
44\subsection{256-bit Operations}
46With the introduction of 256-bit SIMD registers with AVX technology,
47one might ideally expect up to a 50\% reduction in the instruction
48count for the SIMD workload of Parabix2.   However, in the \SB\
49implementation, Intel has focused on implementing floating point
50operations as opposed to the integer based operations.  That is,
51256-bit SIMD is available for loads, stores, bitwise logic and
52floating operations, while SIMD integer operations and shifts are
53only available in 128-bit form.   Nevertheless, with loads, stores
54and bitwise logic comprising a major portion of the Parabix2
55SIMD instruction mix, a substantial reduction in instruction count
56and consequent performance improvement was anticipated.
58\subsection{Performance Results}
60We implemented two versions of Parabix2 using AVX technology.   The first
61was simply the recompilation of the existing Parabix2 source code
62to take advantage of the 3-operand form of AVX instructions while retaining
63a uniform 128-bit SIMD processing width.  The second involved rewriting
64core library functions for Parabix2 to use 256-bit AVX operations wherever
65possible and to simulate the remaining operations using pairs of 128-bit
68Figure \ref{insmix} shows the reduction in instruction
69counts achieved in these two versions.   For each workload, the
70base instruction count of the Parabix2 binary compiled in SSE-only
71mode is shown with the caption ``sse,'' the version obtained by
72simple recompilation with AVX-mode enabled is labeled ``avx 128-bit,''
73and the version reimplemented to use 256-bit operations wherever
74possible is labelled ``avx 256-bit.''    The instruction counts
75are divided into three classes.  The ``non-SIMD'' operations
76are the general purpose instructions that use neither SSE nor
77AVX technology.   The ``bitwise SIMD'' class comprises
78the bitwise logic operations, that are available in both 128-bit
79form and 256-bit form.  The ``other SIMD'' class comprises
80all other SIMD operations, primarily comprising the integer SIMD
81operations that are available only at 128-bit widths even with
82256-bit AVX technology.
84Note that, in each workload, the number of non-SIMD instructions
85remains relatively constant with each workload.  As may be
86expected, however, the number of ``bitwise SIMD'' operations
87remains the same for both SSE and 128-bit while dropping
88dramatically when operating 256-bits at a time.   Ideally
89one one may expect up to a 50\% reduction in these instructions versus
90the 128-bit AVX.  The actual reduction measured was 32\%--39\%
91depending on workload.   Because some bitwise logic is needed
92in implementation of simulated 256-bit operations, the full 50\%
93reduction in bitwise logic was not achieved.
95The ``other SIMD'' class shows a substantial ``30\%-35\%'' reduction
96with AVX 128-bit technology compared to SSE.  This reduction is
97due to eliminated copies or reloads when SIMD operations
98are compiled using 3-operand AVX form versus 2-operand SSE form.
99A further 10\%--20\% reduction is observed with Parabix2 version
100rewritten to use 256-bit operations. 
102While the successive reductions in SIMD instruction counts are quite
103dramatic with the two AVX implementations of Parabix2, the performance
104benefits are another story.   As shown in Figure \ref{avx}, the
105benefits of the reduced SIMD instruction count are achieved only
106in the AVX 128-bit version.  In this case, the benefits of 3-operand
107form seem to fully translate to performance benefits. 
108Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.
109Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation
110does not improve significantly and actually degrades for files with
111higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%.
112We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{},
113with significant latency in many of the 256-bit instructions in comparison to their
114128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed
115in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2.
Note: See TracBrowser for help on using the repository browser.