Changeset 3500

Sep 15, 2013, 3:16:08 PM (5 years ago)

AVX section written up

1 edited


  • docs/Working/re/avx2.tex

    r3498 r3500  
    11\section{SIMD Scalability}\label{sec:AVX2}
     4Although commodity processors have provided 128-bit SIMD operations
     5more than a decade, the extension to 256-bit integer SIMD operations
     6has just recently taken place with the availability of AVX2
     7instructions in Intel Haswell architecture chips as of mid 2013.
     8This provides an excellent opportunity to assess the scalability
     9of the bitwise data-parallel approach to regular expression matching.
     11For the most part, adapting the Parabix tool chain to the new AVX2
     12instructions was straightforward.   This mostly involved regenerating
     13library functions using the new AVX2 intrinsics.   There were minor
     14issues in the core transposition algorithm because the doublebyte-to-byte
     15pack instructions are confined to independent operation within two
     16128-bit lanes. 
     19\subsection{AVX Stream Addition}
     20 \begin{figure*}[tbh]
     23void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {
     24  bitblock256_t all_ones = simd256<1>::constant<1>();
     25  bitblock256_t gen = simd_and(x, y);
     26  bitblock256_t prop = simd_xor(x, y);
     27  bitblock256_t partial_sum = simd256<64>::add(x, y);
     28  bitblock256_t carry = simd_or(gen, simd_andc(prop, partial_sum));
     29  bitblock256_t bubble = simd256<64>::eq(partial_sum, all_ones);
     30  uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
     31  uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
     32  uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask;
     33  uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask);
     34  carry_out = convert(increments >> 4);
     35  uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001;
     36  sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread))));
     42\caption{AVX2 256-bit Addition}
     47Bitstream addition at the 256-bit block size was implemented using the
     48long-stream addition technique.   Figure \ref{fig:AVX2add} shows our
     49implementation.  Spreading bits from the calculated increments mask
     50was achieved somewhat awkwardly with a 64-bit multiply to spread
     51into 16-bit fields followed by SIMD zero extend of the 16-bit fields
     52to 64-bits each.
     54We also compiled new versions of the {\tt grep} and {\tt nrgrep} programs
     55using the {\tt -march=core-avx2} flag in case the compiler is able
     56to vectorize some of the code.
     63ylabel=AVX Instruction Reduction,
     65tick label style={font=\tiny},
     66enlarge x limits=0.15,
     67enlarge y limits={0.15, upper},
     69legend style={at={(0.5,-0.15)},
     70anchor=north,legend columns=-1},
     72bar width=7pt,
     75file {data/avxinstructions1.dat};
     77file {data/avxinstructions2.dat};
     79file {data/avxinstructions3.dat};
     85\caption{Instruction Reduction}\label{fig:AVXInstrReduction}
     89Figure \ref{fig:AVXInstrReduction} shows the reduction in instruction
     90count achieved for each of the applications.   Working at a block
     91size of 256 bytes at a time rather than 128 bytes at a time,
     92the bitstreams implementation scaled dramatically well with reductions in
     93instruction count over a factor of two in each case.   Although a factor
     94of two would seem an outside limit, we attribute the change to
     95greater instruction efficiency. 
     96AVX2 instructions use a
     97non destructive three-operand
     98form instead of the destructive two-operand form of SSE2.
     99In the two-operand form, binary instructions must always used
     100one of the source registers as a destination register.   As a
     101result the SSE2 object code generates many data movement operations
     102that are unnecessary with the AVX2 set.
     104As expected, there was no observable reduction in instruction
     105count with the recompiled grep and nrgrep applications.
    33 \caption{AVX Speedup}
     136\caption{AVX Speedup}\label{fig:AVXSpeedup}
    36 \begin{figure}
    37 \begin{center}
    38 \begin{tikzpicture}
    39 \begin{axis}[
    40 xtick=data,
    41 ylabel=AVX Instruction Reduction,
    42 xticklabels={@,Date,Email,URIorEmail,xquote},
    43 tick label style={font=\tiny},
    44 enlarge x limits=0.15,
    45 enlarge y limits={0.15, upper},
    46 ymin=0,
    47 legend style={at={(0.5,-0.15)},
    48 anchor=north,legend columns=-1},
    49 ybar,
    50 bar width=7pt,
    51 ]
    52 \addplot[fill=black]
    53 file {data/avxinstructions1.dat};
    54 \addplot[fill=gray]
    55 file {data/avxinstructions2.dat};
    56 \addplot[fill=white]
    57 file {data/avxinstructions3.dat};
     139As shown in Figure \ref{fig:AVXSpeedup} the reduction in
     140instruction count was reflected in a considerable speed-up
     141in the bitstreams implementation.  However, the speed-up was
     142considerably less than expected.  As shown in \label{fig:AVXIPC}
     143the AVX2 version has lost some of the superscalar efficiency
     144of the SSE2 code.   This is a performance debugging issue
     145that we have yet to resolve.
    59 \legend{Bitstreams,NRGrep,Grep,Annot}
    60 \end{axis}
    61 \end{tikzpicture}
    62 \end{center}
    63 \caption{Instruction Reduction}
    64 \end{figure}
    95 \caption{Change in Instructions Per Cycle With AVX}
     177\caption{Change in Instructions Per Cycle With AVX}\label{fig:AVXIPC}
     180Overall, the results on our AVX2 machine were quite good,
     181demonstrating very good scalability of the bitwise data-parallel approach.
    101 \subsection{AVX Stream Addition}
    102  \begin{figure*}[tbh]
    103 \begin{center}
    104 \begin{verbatim}
    105 void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {
    106   bitblock256_t all_ones = simd256<1>::constant<1>();
    107   bitblock256_t gen = simd_and(x, y);
    108   bitblock256_t prop = simd_xor(x, y);
    109   bitblock256_t partial_sum = simd256<64>::add(x, y);
    110   bitblock256_t carry = simd_or(gen, simd_andc(prop, partial_sum));
    111   bitblock256_t bubble = simd256<64>::eq(partial_sum, all_ones);
    112   uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
    113   uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
    114   uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask;
    115   uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask);
    116   carry_out = convert(increments >> 4);
    117   uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001;
    118   sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread))));
    119 }
    121 \end{verbatim}
    123 \end{center}
    124 \caption{AVX2 256-bit Addition}
    125 \label{fig:AVX2add}
    126 \end{figure*}
Note: See TracChangeset for help on using the changeset viewer.