Changeset 1410 for docs


Ignore:
Timestamp:
Aug 31, 2011, 5:57:04 PM (8 years ago)
Author:
ksherdy
Message:

edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/07-avx.tex

    r1407 r1410  
    22\label{section:avx}
    33In this section, we discuss the scalability and performance advantages
    4 of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The
    5 Parabix SIMD libraries originally targeted the 128-bit SSE2 SIMD
    6 technology available on all modern 64-bit Intel and AMD processors but
    7 has recently been ported to AVX. AVX technology is commercially
     4of our 256-bit AVX (Advanced Vector Extensions) Parabix-XML port.  The
     5Parabix runtime libraries originally targeted the 128-bit SSE2 SIMD
     6technology, available on all modern 64-bit Intel and AMD processors.
     7It was recently been ported to AVX, which is commercially
    88available on the latest the \SB\ microarchitecture Intel
    9 processors. While we have to port our runtime framework the
    10 application didn't need to be modified.
     9processors. Although the Parabix runtime framework had to be ported to
     10the new ISA, no modifications to Parabix-XML itself were needed.
    1111
    12 
    13 \paragraph{3-Operand Form}
    14 In addition to the widening of 128-bit operations to 256-bit
    15 operations, AVX technology uses a nondestructive 3-operand instruction
     12\subsection{3-Operand Form}
     13In addition to widening the 128-bit operations to 256-bit,
     14 AVX technology uses a nondestructive 3-operand instruction
    1615format. Previous SSE implementations used a destructive 2-operand
    1716instruction format. In the 2-operand format a single register is used
    18 as both a source and destination register. For example, $a =
    19 a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the
     17as both a source and destination register. As such, 2-operand instructions that require the
    2018value of both $a$ and $b$, must either copy an additional register
    2119value beforehand, or reconstitute or reload a register value
    2220afterwards to recover the value.  With the 3-operand format, output
    2321may now be directed to the third register independently of the source
    24 operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the
    25 copying or reconstituting of operand values, a considerable reduction
     22operands. By avoiding the need to
     23copy or reconstitute operand values, a considerable reduction
    2624in instructions required for unloading from and loading into
    2725registers.  AVX technology makes available the 3-operand form for both
     
    4139\subsection{Performance Results}
    4240
    43 We implemented two versions of Parabix using AVX technology.  The
    44 first was simply the recompilation of the existing Parabix source code
    45 written to take advantage of the 3-operand form of AVX instructions
     41We implemented two versions of Parabix-XML using AVX technology.  The
     42first was simply the recompilation of the existing Parabix-XML source code
     43to take advantage of the 3-operand form of AVX instructions
    4644while retaining a uniform 128-bit SIMD processing width.  The second
    47 involved rewriting the internal library functions of Parabix to
    48 leverage the 256-bit AVX operations wherever possible and to simulate
    49 the remaining operations using pairs of 128-bit operations.Figure
     45involved rewriting the Parabix runtime library to
     46leverage the 256-bit AVX instructions wherever possible and to simulate
     47the remaining operations using pairs of 128-bit operations. Figure
    5048\ref{insmix} shows the reduction in instruction counts achieved in
    51 these two versions.  For each workload, the base instruction count of
    52 the Parabix binary compiled in SSE-only mode is indicated by ``sse,''
    53 the version which only takes advantage of the AVX 3-operand mode is
    54 labeled ``128-bit avx,'' and the version reimplemented to use 256-bit
     49these two versions. For each workload, the base instruction count of
     50the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse;''
     51the version that only takes advantage of the AVX 3-operand mode is
     52labeled ``128-bit avx,'' and the version uses the 256-bit
    5553operations wherever possible is labelled ``256-bit avx.''  The
    5654instruction counts are divided into three classes: ``non-SIMD''
    5755operations are the general purpose instructions.  The ``bitwise SIMD''
    5856class comprises the bitwise logic operations, that are available in
    59 both 128-bit form and 256-bit form.  The ``other SIMD'' class
     57both 128-bit form and 256-bit form --- excluding bitwise shifts which are
     58only available in 128-bit form.  The ``other SIMD'' class
    6059comprises all other SIMD operations, primarily comprising the integer
    6160SIMD operations that are available only at 128-bit widths even under
     
    8079
    8180Note that, in each workload, the number of non-SIMD instructions
    82 remains relatively constant with each workload.  As may be expected
    83 the number of \textit{bit-parallel SIMD} operations remains the same
     81remains relatively constant with each workload.  As expected,
     82the number of bitwise SIMD operations remains the same
    8483for both SSE and 128-bit while dropping dramatically when operating
    85 256-bits at a time.  The reduction measured was 32\%--39\% depending
    86 on workload because some bitwise logic needed in implementation is
    87 composed of 128-bit operations. The limits the performance gains
    88 achieved when using the AVX instructions.  The ``other SIMD'' class
    89 shows a substantial 30\%-35\% reduction with AVX 128-bit technology
    90 compared to SSE.  This reduction is due to elimination of register
     84256-bits at a time. The reduction was measured at 32\%--39\% depending
     85on markup density of the workload. The ``other SIMD'' class
     86shows a substantial 30\%--35\% reduction with AVX 128-bit technology
     87compared to SSE. This reduction is due to elimination of register
    9188unloading and reloading when SIMD operations are compiled using
    92893-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
    93 reduction is observed with Parabix version rewritten to use 256-bit
    94 operations.
     90reduction is also observed when Parabix-XML utilized the AVX runtime
     91library.
    9592
    9693%[AS] Check numbers.
     
    105102bizarrely, the performance of Parabix in the 256-bit AVX
    106103implementation does not improve significantly and actually degrades
    107 for files with higher markup density (average 11\%). dew.xml, on
     104for files with higher markup density ($\sim11\%$). dew.xml, on
    108105which bitwise-SIMD instructions reduced by 39\%, saw a performance
    109106improvement of 8\%.  We believe that this is primarily due to the
     
    113110different scheduling constraints that seem to reduce overall
    114111throughput.  If these latency issues can be addressed in future AVX
    115 implementations, further substantial performance and energy benefits
    116 could be realized in XML parsing with Parabix.
     112implementations, further performance and energy benefits
     113could be realized in Parabix-XML.
Note: See TracChangeset for help on using the changeset viewer.