Changeset 3625 for docs


Ignore:
Timestamp:
Feb 18, 2014, 4:45:09 AM (5 years ago)
Author:
cameron
Message:

Updates for stream addition

Location:
docs/Working/re
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/avx2.tex

    r3617 r3625  
    1717
    1818
    19 \paragraph*{AVX2 Stream Addition}
    20  \begin{figure*}[tbh]
     19\paragraph*{AVX2 256-Bit Addition}
     20 \begin{figure}[tbh]
    2121
    2222\begin{center} \small
    2323\begin{verbatim}
    24 void add_ci_co(bitblock_t x, bitblock_t y, carry_t carry_in, carry_t & carry_out, bitblock_t & sum) {
    25   bitblock_t all_ones = simd256<1>::constant<1>();
    26   bitblock_t gen = simd_and(x, y);
    27   bitblock_t prop = simd_xor(x, y);
    28   bitblock_t partial_sum = simd256<64>::add(x, y);
    29   bitblock_t carry = simd_or(gen, simd_andc(prop, partial_sum));
    30   bitblock_t bubble = simd256<64>::eq(partial_sum, all_ones);
    31   uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
    32   uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
    33   uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask;
    34   uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask);
    35   carry_out = convert(increments >> 4);
    36   uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001;
    37   sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread))));
     24bitblock_t spread(uint64_t bits) {
     25  uint64_t s = 0x0000200040008001 * bits;
     26  uint64_t t = s & 0x0001000100010001;
     27  return _mm256_cvtepu16_epi64(t);
    3828}
    3929\end{verbatim}
    4030\end{center}
    41 \caption{AVX2 256-bit Addition}
    42 \label{fig:AVX2add}
     31\caption{AVX2 256-bit Spread}
     32\label{fig:AVX2spread}
    4333
    44 \end{figure*}
     34\end{figure}
    4535
    4636Bitstream addition at the 256-bit block size was implemented using the
    47 long-stream addition technique.   Figure \ref{fig:AVX2add} shows our
    48 implementation.  Spreading bits from the calculated increments mask
    49 was achieved somewhat awkwardly with a 64-bit multiply to spread
    50 into 16-bit fields followed by SIMD zero extend of the 16-bit fields
    51 to 64-bits each.
     37long-stream addition technique.   The AVX2 instruction set directly
     38supports the \verb#hsimd<64>::mask(X)# operation using
     39the \verb#_mm256_movemask_pd#  intrinsic, extracting
     40the required 4-bit mask directly from the 256-bit vector.
     41The \verb#hsimd<64>::spread(X)# is slightly more
     42problematic, requiring a short sequence of instructions
     43to convert the computed 4-bit increment mask back
     44into a vector of 4 64-bit values.   One method is to
     45use the AVX2 broadcast instruction to make 4 copies
     46of the mask to be spread, followed by appropriate
     47bit manipulation.   Another uses multiplication to
     48first spread to 16-bit fields as shown in Figure \ref{fig:AVX2spread}.
    5249
    53 We also compiled new versions of the {\tt grep} and {\tt nrgrep} programs
     50We also compiled new versions of the {\tt egrep} and {\tt nrgrep} programs
    5451using the {\tt -march=core-avx2} flag in case the compiler is able
    5552to vectorize some of the code.
  • docs/Working/re/re-main.tex

    r3624 r3625  
    433433is far from ideal.
    434434
    435 We have developed a technique using SIMD methods for constant-time
     435We have developed a general model using SIMD methods for constant-time
    436436long-stream addition up to 4096 bits.   
    437437We assume the availability of the following SIMD/SIMT operations
     
    450450\end{itemize}
    451451
     452In this model, the \verb#hsimd<64>::mask(X)# and
     453\verb#simd<64>::spread(x)# model the minimum
     454communication requirements between the parallel processing units
     455(SIMD lanes or SIMT processors).    In essence, we just need
     456the ability to quickly send and receive 1 bit of information
     457per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
     458gathers 1 bit from each of the processors to a central resource.
     459After calculations on the gather bits are performed, we then
     460just need an operation to invert the communication, i.e.,
     461sending 1 bit each from the central processor to each of
     462the parallel units.   There are a variety of ways in which
     463these facilities may be implemented depending on the
     464underlying architecture; details of our AVX2 and GPU implementations
     465are presented later.   
     466
    452467Given these operations, our method for long stream addition of
    453468two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
     
    471486\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
    472487incremented to produce the final sum.  Here we find a new application of
    473 MatchStar!
     488MatchStar.
    474489\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
    475490
     
    481496will generate another carry.  In fact, if there is a sequence of
    482497digits that are all ones, then the carry must bubble through
    483 each of them.   This is just MatchStar!
     498each of them.   This is just MatchStar.
    484499
    485500\item Compute the final result {\tt Z}.
     
    538553of addition as a primitive and its particular application to regular
    539554expression matching as shown herein, it seems reasonable to expect
    540 such instructions to become available.
    541 
     555such instructions to become available.    Alternatively, it may
     556be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)#
     557\verb#simd<64>::spread(X)# operations are efficiently supported.
    542558
    543559
  • docs/Working/re/reference.bib

    r3522 r3625  
    255255        year={2009},
    256256        title={Is Larrabee For the Rest of Us?},
    257         journal={Dr.Dobb’s J}
     257        journal={Dr.Dobb's Journal}
    258258}
    259259
Note: See TracChangeset for help on using the changeset viewer.