Ignore:
Timestamp:
Feb 18, 2014, 4:45:09 AM (6 years ago)
Author:
cameron
Message:

Updates for stream addition

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/avx2.tex

    r3617 r3625  
    1717
    1818
    19 \paragraph*{AVX2 Stream Addition}
    20  \begin{figure*}[tbh]
     19\paragraph*{AVX2 256-Bit Addition}
     20 \begin{figure}[tbh]
    2121
    2222\begin{center} \small
    2323\begin{verbatim}
    24 void add_ci_co(bitblock_t x, bitblock_t y, carry_t carry_in, carry_t & carry_out, bitblock_t & sum) {
    25   bitblock_t all_ones = simd256<1>::constant<1>();
    26   bitblock_t gen = simd_and(x, y);
    27   bitblock_t prop = simd_xor(x, y);
    28   bitblock_t partial_sum = simd256<64>::add(x, y);
    29   bitblock_t carry = simd_or(gen, simd_andc(prop, partial_sum));
    30   bitblock_t bubble = simd256<64>::eq(partial_sum, all_ones);
    31   uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
    32   uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
    33   uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask;
    34   uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask);
    35   carry_out = convert(increments >> 4);
    36   uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001;
    37   sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread))));
     24bitblock_t spread(uint64_t bits) {
     25  uint64_t s = 0x0000200040008001 * bits;
     26  uint64_t t = s & 0x0001000100010001;
     27  return _mm256_cvtepu16_epi64(t);
    3828}
    3929\end{verbatim}
    4030\end{center}
    41 \caption{AVX2 256-bit Addition}
    42 \label{fig:AVX2add}
     31\caption{AVX2 256-bit Spread}
     32\label{fig:AVX2spread}
    4333
    44 \end{figure*}
     34\end{figure}
    4535
    4636Bitstream addition at the 256-bit block size was implemented using the
    47 long-stream addition technique.   Figure \ref{fig:AVX2add} shows our
    48 implementation.  Spreading bits from the calculated increments mask
    49 was achieved somewhat awkwardly with a 64-bit multiply to spread
    50 into 16-bit fields followed by SIMD zero extend of the 16-bit fields
    51 to 64-bits each.
     37long-stream addition technique.   The AVX2 instruction set directly
     38supports the \verb#hsimd<64>::mask(X)# operation using
     39the \verb#_mm256_movemask_pd#  intrinsic, extracting
     40the required 4-bit mask directly from the 256-bit vector.
     41The \verb#hsimd<64>::spread(X)# is slightly more
     42problematic, requiring a short sequence of instructions
     43to convert the computed 4-bit increment mask back
     44into a vector of 4 64-bit values.   One method is to
     45use the AVX2 broadcast instruction to make 4 copies
     46of the mask to be spread, followed by appropriate
     47bit manipulation.   Another uses multiplication to
     48first spread to 16-bit fields as shown in Figure \ref{fig:AVX2spread}.
    5249
    53 We also compiled new versions of the {\tt grep} and {\tt nrgrep} programs
     50We also compiled new versions of the {\tt egrep} and {\tt nrgrep} programs
    5451using the {\tt -march=core-avx2} flag in case the compiler is able
    5552to vectorize some of the code.
Note: See TracChangeset for help on using the changeset viewer.