Ignore:
Timestamp:
Aug 24, 2011, 1:55:27 PM (8 years ago)
Author:
ashriram
Message:

Fixed methodology

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/07-avx.tex

    r1361 r1365  
    1 \section{Scaling Parabix2 for AVX}
     1\section{Scaling Parabix for AVX}
    22\label{section:avx}
    3 In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix2 port.
    4 Parabix2 originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but
    5 has recently been ported to AVX. AVX technology is commercially available on the
    6 latest the \SB\ microarchitecture Intel processors.
     3In this section, we discuss the scalability and performance advantages
     4of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The
     5Parabix SIMD libraries originally targetted the 128-bit SSE2 SIMD
     6technology available on all modern 64-bit Intel and AMD processors but
     7has recently been ported to AVX. AVX technology is commercially
     8available on the latest the \SB\ microarchitecture Intel
     9processors. While we have to port our runtime framework the
     10application didn't need to be modified.
    711
    812\begin{figure*}
     
    1014\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
    1115\end{center}
    12 \caption{Parabix2 Instruction Counts (y-axis: Instructions per kB)}
     16\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
    1317\label{insmix}
    1418\end{figure*}
     
    1822\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
    1923\end{center}
    20 \caption{Parabix2 Performance (y-axis: ns per kB)}
     24\caption{Parabix Performance (y-axis: ns per kB)}
    2125\label{avx}
    2226\end{figure}
    2327
    24 \subsection{Three Operand Form}
     28\paragraph{3-Operand Form}
     29In addition to the widening of 128-bit operations to 256-bit
     30operations, AVX technology uses a nondestructive 3-operand instruction
     31format. Previous SSE implementations used a destructive 2-operand
     32instruction format. In the 2-operand format a single register is used
     33as both a source and destination register. For example, $a =
     34a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the
     35value of both $a$ and $b$, must either copy an additional register
     36value beforehand, or reconstitute or reload a register value
     37afterwards to recover the value.  With the 3-operand format, output
     38may now be directed to the third register independently of the source
     39operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the
     40copying or reconstituting of operand values, a considerable reduction
     41in instructions required for unloading from and loading into
     42registers.  AVX technology makes available the 3-operand form for both
     43the new 256-bit operations as well as the base 128-bit SSE operations.
    2544
    26 In addition to the widening of 128-bit operations to 256-bit operations, AVX technology
    27 uses a nondestructive 3-operand instruction format. Previous SSE implementations
    28 used a destructive 2-operand instruction format. In the 2-operand format
    29 a single register is used as both a source and
    30 destination register. For example, $a = a~\texttt{[op]}~b$.
    31 As such, 2-operand instructions that require the value of both $a$ and $b$,
    32 must either copy an additional register value beforehand, or reconstitute or reload a register value
    33 afterwards to recover the value.
    34 With the 3-operand format, output may now be directed to the third register independently
    35 of the source operands. For example, $c = a~\texttt{[op]}~b$.
    36 By avoiding the copying or reconstituting of operand values, a considerable
    37 reduction in instruction count in the form of reduced load and store instructions is possible.
    38 AVX technology makes available the 3-operand form for both the new 256-bit
    39 operations as well as the base 128-bit SSE operations.
    40 
    41 \subsection{256-bit AVX Operations}
    42 
     45\subsection{256-bit Operations}
    4346With the introduction of 256-bit SIMD registers, and under ideal
    4447conditions, one would anticipate a corresponding 50\% reduction in the
    45 SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX
     48SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
    4649implementation, Intel has focused primarily on floating point
    4750operations as opposed to the integer based operations.  256-bit SIMD
    4851is available for loads, stores, bitwise logic and floating operations,
    4952whereas SIMD integer operations and shifts are only available in the
    50 128-bit form.  Nevertheless, with loads, stores and bitwise logic
    51 comprising a major portion of the Parabix2 SIMD instruction mix, a
    52 substantial reduction in instruction count and consequent performance
    53 improvement was anticipated but not achieved.
     53128-bit form.
     54
    5455
    5556\subsection{Performance Results}
    5657
    57 We implemented two versions of Parabix2 using AVX technology.  The first
    58 was simply the recompilation of the existing Parabix2 source code
    59 written to take advantage of the 3-operand form of AVX instructions while retaining
    60 a uniform 128-bit SIMD processing width.  The second involved rewriting the
    61 core library functions of Parabix2 to leverage the 256-bit AVX operations wherever
    62 possible and to simulate the remaining operations using pairs of 128-bit
    63 operations.   
    64 
    65 Figure \ref{insmix} shows the reduction in instruction
    66 counts achieved in these two versions.   For each workload, the
    67 base instruction count of the Parabix2 binary compiled in SSE-only
    68 mode is shown with the caption ``sse,'' the version obtained by
    69 simple recompilation with AVX-mode enabled is labeled ``128-bit avx,''
    70 and the version reimplemented to use 256-bit operations wherever
    71 possible is labelled ``256-bit avx.''    The instruction counts
    72 are divided into three classes.  The ``non-SIMD'' operations
    73 are the general purpose instructions that use neither SSE nor
    74 AVX technology.   The ``bitwise SIMD'' class comprises
    75 the bitwise logic operations, that are available in both 128-bit
    76 form and 256-bit form.  The ``other SIMD'' class comprises
    77 all other SIMD operations, primarily comprising the integer SIMD
    78 operations that are available only at 128-bit widths even with
    79 256-bit AVX technology.
     58We implemented two versions of Parabix using AVX technology.  The
     59first was simply the recompilation of the existing Parabix source code
     60written to take advantage of the 3-operand form of AVX instructions
     61while retaining a uniform 128-bit SIMD processing width.  The second
     62involved rewriting the internal library functions of Parabix to
     63leverage the 256-bit AVX operations wherever possible and to simulate
     64the remaining operations using pairs of 128-bit operations.Figure
     65\ref{insmix} shows the reduction in instruction counts achieved in
     66these two versions.  For each workload, the base instruction count of
     67the Parabix binary compiled in SSE-only mode is indicated by ``sse,''
     68the version which only takes advantage of the AVX 3-operand mode is
     69labeled ``128-bit avx,'' and the version reimplemented to use 256-bit
     70operations wherever possible is labelled ``256-bit avx.''  The
     71instruction counts are divided into three classes: ``non-SIMD''
     72operations are the general purpose instructions.  The ``bitwise SIMD''
     73class comprises the bitwise logic operations, that are available in
     74both 128-bit form and 256-bit form.  The ``other SIMD'' class
     75comprises all other SIMD operations, primarily comprising the integer
     76SIMD operations that are available only at 128-bit widths even under
     77AVX.
    8078
    8179Note that, in each workload, the number of non-SIMD instructions
    82 remains relatively constant with each workload.  As may be expected,
    83 however, the number of ``bitwise SIMD'' operations remains the same
     80remains relatively constant with each workload.  As may be expected
     81the number of \textit{bit-parallel SIMD} operations remains the same
    8482for both SSE and 128-bit while dropping dramatically when operating
    85 256-bits at a time.  Ideally one one may expect up to a 50\% reduction
    86 in these instructions versus the 128-bit AVX.  The actual reduction
    87 measured was 32\%--39\% depending on workload.  Because some bitwise
    88 logic is needed in implementation of simulated 256-bit operations, the
    89 full 50\% reduction in bitwise logic was not achieved.
     83256-bits at a time.  The reduction measured was 32\%--39\% depending
     84on workload because some bitwise logic needed in implementation is
     85composed of 128-bit operations. The limits the performance gains
     86achieved when using the AVX instructions.  The ``other SIMD'' class
     87shows a substantial 30\%-35\% reduction with AVX 128-bit technology
     88compared to SSE.  This reduction is due to elimination of register
     89unloading and reloading when SIMD operations are compiled using
     903-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
     91reduction is observed with Parabix version rewritten to use 256-bit
     92operations.
    9093
    91 The ``other SIMD'' class shows a substantial 30\%-35\% reduction
    92 with AVX 128-bit technology compared to SSE.  This reduction is
    93 due to eliminated copies or reloads when SIMD operations
    94 are compiled using 3-operand AVX form versus 2-operand SSE form.
    95 A further 10\%--20\% reduction is observed with Parabix2 version
    96 rewritten to use 256-bit operations. 
    97 
    98 While the successive reductions in SIMD instruction counts are quite
    99 dramatic with the two AVX implementations of Parabix2, the performance
    100 benefits are another story.  As shown in Figure \ref{avx}, the
    101 benefits of the reduced SIMD instruction count are achieved only in
    102 the AVX 128-bit version.  In this case, the benefits of 3-operand form
    103 seem to fully translate to performance benefits.  Based on the
    104 reduction of overall Bitwise-SIMD instructions we expected a 11\%
    105 improvement in performance.  Instead, perhaps bizzarely, the
    106 performance of Parabix2 in the 256-bit AVX implementation does not
    107 improve significantly and actually degrades for files with higher
    108 markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD
    109 instructions reduced by 39\%, saw a performance improvement of 8\%.
    110 We believe that this is primarily due to the intricacies of the first
    111 generation AVX implemention in \SB{}, with significant latency in many
    112 of the 256-bit instructions in comparison to their 128-bit
    113 counterparts. The 256-bit instructions also have different scheduling
    114 constraints that seem to reduce overall SIMD throughput.  If these
    115 latency issues can be addressed in future AVX implementations, further
    116 substantial performance and energy benefits could be realized in XML
    117 parsing with Parabix2.
     94%[AS] Check numbers.
     95The reductions in instruction counts are quite dramatic with the AVX
     96extensions in Parabix demonstrating the ability of our runtime
     97framework to exploit the available hardware resources. As shown in
     98Figure \ref{avx}, the benefits of the reduced SIMD instruction count
     99are achieved only in the AVX 128-bit version.  In this case, the
     100benefits of 3-operand form seem to fully translate to performance
     101benefits.  Based on the reduction of overall Bitwise-SIMD instructions
     102we expected a 11\% improvement in performance.  Instead, perhaps
     103bizzarely, the performance of Parabix in the 256-bit AVX
     104implementation does not improve significantly and actually degrades
     105for files with higher markup density (average 11\%). Dewiki.xml, on
     106which bitwise-SIMD instructions reduced by 39\%, saw a performance
     107improvement of 8\%.  We believe that this is primarily due to the
     108intricacies of the first generation AVX implemention in \SB{}, with
     109significant latency in many of the 256-bit instructions in comparison
     110to their 128-bit counterparts. The 256-bit instructions also have
     111different scheduling constraints that seem to reduce overall
     112throughput.  If these latency issues can be addressed in future AVX
     113implementations, further substantial performance and energy benefits
     114could be realized in XML parsing with Parabix.
Note: See TracChangeset for help on using the changeset viewer.