Ignore:
Timestamp:
Mar 25, 2011, 3:16:33 PM (8 years ago)
Author:
cameron
Message:

Section 7

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/07-avx.tex

    r993 r999  
    1 \section{AVX}
     1\section{Scaling Parabix2 for AVX Technology}
    22
    3 In this section, we briefly highlight the improvements made in the Advanced Vector Extensions (AVX) extension to the x86 instruction set architecture
    4 and discuss the impact of these improvements on Parabix2. As neither Expat nor Xerces-C benefit from AVX, we do not discuss them in this section.
    5 %The results of our experiments with the AVX and Sandy Bridge architecture can be seen in Figure \ref{avx}.
     3Parabix2 was originally developed for 128-bit SSE2 technology widely
     4available on all 64-bit Intel and AMD processors.  In this section,
     5we discuss the scalability and performance of Parabix2 to take
     6advantage of the new 256-bit AVX (Advanced Vector Extensions)
     7technology that has just become commercially available in the
     8latest Intel processors based on the Sandy Bridge microarchitecture.
    69
    7 % Following AMD's announcement of their SSE5 architecture, Intel announced their intention to develop the AVX
     10\begin{figure*}
     11\begin{center}
     12\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
     13\end{center}
     14\caption{Parabix2 Instruction Counts (y-axis: Instructions per Byte)}
     15\label{insmix}
     16\end{figure*}
    817
    918\begin{figure}
    1019\begin{center}
    11 \includegraphics[width=85mm]{plots/avx.pdf}
     20\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
    1221\end{center}
    13 \caption{Total CPU cycles /KB on AVX}
     22\caption{Parabix2 Performance (y-axis: CPU cycles per KB)}
    1423\label{avx}
    15 \end{figure}
    16 
    17 \begin{figure}
    18 \begin{center}
    19 \includegraphics[width=85mm]{plots/InsMix.pdf}
    20 \end{center}
    21 \caption{Instructions per byte on Sandybridge}
    22 \label{insmix}
    2324\end{figure}
    2425
    2526\subsection{Three Operand Form}
    2627
    27 Originally, SIMD SSE instructions operated using a two-operand form.
    28 This meant that given any SIMD instruction $a~\texttt{[op]}~b$ the result of that instruction would replace the value of $a$ or $b$ with the result.
    29 Thus whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be either reconstructed,
    30 or an additional store and load operation was required to recover that value.
    31 Utilizing the new VEX instruction coding scheme \textbf{[citation needed]},
    32 Intel now allows the use of non-destructive three-operand operations in their SSE and AVX instruction sets.
    33 As shown in Figure \ref{insmix}, the total number of non-bitwise logic SIMD operations, which involve many memory movements is 32\% to 34\% less.
    34 Simply enabling three-operand form on the existing 128-bit SSE instructions reduced the overall cycle count by between 11.7\% and 13.5\%, which is shown in Figure \ref{avx}.
    35 While this is a one-time savings, it provided a significant performance improvement that traditional parsers cannot leverage since they cannot benefit from the three-operand form designed for SIMD instruction set and as shown in Figure \ref{insmix}, the total number of non-vector instructions does not change.
     28In addition to the introduction of 256-bit operations, AVX technology
     29also makes a change in the structure of the base SSE instructions,
     30moving from a destructive 2-operand form long used with SSE technologies
     31to a nondestructive 3-operand form.   In the 2-operand form,
     32one register is used as both a source and
     33destination register, equivalent to the assignment $a = a~\texttt{[op]}~b$.
     34Thus, whenever the subsequent instructions used the value of both $a$ and $b$,
     35one of them had to be copied beforehand, or reconstituted or reloaded
     36afterwards in order to recover the value.
     37With 3-operand form, output may be directed to a third register independent
     38of the source operands, as reflected by the assignment $c = a~\texttt{[op]}~b$.
     39By avoiding the copying or reconstituting of operand values, a considerable
     40reduction in instruction count may be possible.
     41AVX technology makes available the 3-operand form both with the new 256-bit
     42operations as well as with base 128-bit operations of SSE.
    3643
    3744\subsection{256-bit Operations}
    3845
    39 The AVX instruction set provided on the Sandy Bridge allows the use of 256-bit SIMD registers.
    40 Ideally, we only need half of the SIMD instructions compared with the version that uses SSE instruction set (three-operand form).
    41 Therefore, Parabix2 should be able to achieve 50\% performance improvement on SIMD operations, which means 26\% to 38\% improvement of total processing time simply by using AVX intruction set instead of SSE instruction set.
    42 However, Intel focused on implementing floating point operations as opposed to the integer based operations, we only gain from bitwise logic operations and SIMD loading operations.
    43 As shown in Figure \ref{insmix}, the total number of SIMD instructions executed with AVX instruction set is 71\% to 79\% of the SIMD instructions with SSE instruction set.
    44 The number of bitwise logic operations, which is expected to be 50\% less, only goes down by 33\% to 39\% because they are used to simulate some other 256-bit operations that exsit on SSE but is not provided by AVX instruction set.
    45 As the total number of instructions goes down by 11\% to 23\%, we should be able to see less processing time and better performance.
    46 However, as shown in Figure \ref{avx}, the processing time is longer except the one with 23\% less instructions.
    47 The reason is that AVX instruction has longer latency. (cite Agner Fog?)
     46With the introduction of 256-bit SIMD registers with AVX technology,
     47one might ideally expect up to a 50\% reduction in the instruction
     48count for the SIMD workload of Parabix2.   However, in the Sandy Bridge
     49implementation, Intel has focused on implementing floating point
     50operations as opposed to the integer based operations.  That is,
     51256-bit SIMD is available for loads, stores, bitwise logic and
     52floating operations, while SIMD integer operations and shifts are
     53only available in 128-bit form.   Nevertheless, with loads, stores
     54and bitwise logic comprising a major portion of the Parabix2
     55SIMD instruction mix, a substantial reduction in instruction count
     56and consequent performance improvement was anticipated.
    4857
     58\subsection{Performance Results}
     59
     60We implemented two versions of Parabix2 using AVX technology.   The first
     61was simply the recompilation of the existing Parabix2 source code
     62to take advantage of the 3-operand form of AVX instructions while retaining
     63a uniform 128-bit SIMD processing width.  The second involved rewriting
     64core library functions for Parabix2 to use 256-bit AVX operations wherever
     65possible and to simulate the remaining operations using pairs of 128-bit
     66operations.   
     67
     68Figure \ref{insmix} shows the reduction in instruction
     69counts achieved in these two versions.   For each workload, the
     70base instruction count of the Parabix2 binary compiled in SSE-only
     71mode is shown with the caption ``sse,'' the version obtained by
     72simple recompilation with AVX-mode enabled is labeled ``avx 128-bit,''
     73and the version reimplemented to use 256-bit operations wherever
     74possible is labelled ``avx 256-bit.''    The instruction counts
     75are divided into three classes.  The ``non-SIMD'' operations
     76are the general purpose instructions that use neither SSE nor
     77AVX technology.   The ``bitwise SIMD'' class comprises
     78the bitwise logic operations, that are available in both 128-bit
     79form and 256-bit form.  The ``other SIMD'' class comprises
     80all other SIMD operations, primarily comprising the integer SIMD
     81operations that are available only at 128-bit widths even with
     82256-bit AVX technology.
     83
     84Note that, in each workload, the number of non-SIMD instructions
     85remains relatively constant with each workload.  As may be
     86expected, however, the number of ``bitwise SIMD'' operations
     87remains the same for both SSE and 128-bit while dropping
     88dramatically when operating 256-bits at a time.   Ideally
     89one one may expect up to a 50\% reduction in these instructions versus
     90the 128-bit AVX.  The actual reduction measured was 32\%--39\%
     91depending on workload.   Because some bitwise logic is needed
     92in implementation of simulated 256-bit operations, the full 50\%
     93reduction in bitwise logic was not achieved.
     94
     95The ``other SIMD'' class shows a substantial ``30\%-35\%'' reduction
     96with AVX 128-bit technology compared to SSE.  This reduction is
     97due to eliminated copies or reloads when SIMD operations
     98are compiled using 3-operand AVX form versus 2-operand SSE form.
     99A further 10\%--20\% reduction is observed with Parabix2 version
     100rewritten to use 256-bit operations. 
     101
     102While the successive reductions in SIMD instruction counts are quite
     103dramatic with the two AVX implementations of Parabix2, the performance
     104benefits are another story.   As shown in Figure \ref{avx}, the
     105benefits of the reduced SIMD instruction count are achieved only
     106in the AVX 128-bit version.  In this case, the benefits of 3-operand
     107form seem to fully translate to performance benefits.   Bizarrely,
     108perhaps, the performance of Parabix2 in the 256-bit AVX implementation
     109does not improve significantly and actually degrades for files with
     110higher markup density.  We believe that this is primarily due to
     111the current AVX implemention in Sandy Bridge, with significant
     112latency in many of the 256-bit instructions in comparison to their
     113128-bit counterparts.   If these latency issues can be addressed
     114in future AVX implementations, further substantial performance
     115and energy benefits could be realized in XML parsing with Parabix2
Note: See TracChangeset for help on using the changeset viewer.