Changeset 3475


Ignore:
Timestamp:
Sep 14, 2013, 7:50:36 AM (5 years ago)
Author:
cameron
Message:

Long stream addition figure

Location:
docs/Working/re
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/ppopp-re.tex

    r3473 r3475  
    163163This approach relies on a bitwise data parallel view of text
    164164streams as well as a surprising use of addition to match
    165 runs of characters in a single step.  The closest previous
     165runs of characters in a sin`gle step.  The closest previous
    166166work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
    167167technology together with a parallel scanning primitive also
     
    317317\section{Block-at-a-Time Processing}\label{sec:blockwise}
    318318
    319 
     319The unbounded stream model of the previous section must of course
     320be translated an implementation that proceeds block-at-a-time for
     321realistic application.  In this, we primarily rely on the Pablo
     322compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
     323statements expressed as arbitrary-length bitstream equations, Pablo
     324produces block-at-a-time C++ code that initializes and maintains all the necessary
     325carry bits for each of the additions and shifts involved in the
     326bitstream calculations.   
     327
     328In the present work, our principal contribution to the block-at-a-time
     329model is the technique of long-stream addition described below.
     330Otherwise, we were able to use Pablo directly in compiling our
     331SSE2 and AVX2 implementations.   Our GPU implementation required
     332some scripting to modify the output of the Pablo compiler for our
     333purpose.
    320334
    321335\paragraph*{Long-Stream Addition.}  The maximum word size for
     
    336350\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
    337351of \verb:x: each with the constant value -1 (all bits 1), producing
    338 an $f$-bit mask value
     352an $f$-bit mask value,
    339353\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
    340354field into a single compressed $f$-bit mask value, and
     
    360374
    361375\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
    362 an incoming carry bit. 
     376an incoming carry bit.  This is the {\em bubble mask}.
    363377\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
    364378
    365379\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
    366 incremented to produce the final sum.  This is just MatchStar.
     380incremented to produce the final sum.  Here we find a new application of
     381MatchStar!
    367382\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
    368383
    369 \item Computed the final result {\tt Z}.
     384This is the key step.  The mask {\tt c} of outgoing carries must be
     385shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
     386with the next digit.  At the incoming position, the carry will
     387increment the 64-bit digit.   However, if this digit is all ones (as
     388signalled by the corresponding bit of bubble mask {\tt b}, then the addition
     389will generate another carry.  In fact, if there is a sequence of
     390digits that are all ones, then the carry must bubble through
     391each of them.   This is just MatchStar!
     392
     393\item Compute the final result {\tt Z}.
    370394\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
    371395
    372396\end{enumerate}
    373 
    374 Step 5 is the key step to determine which field-values of {\tt R}
    375 must be incremented due to incoming carries.  Note that {\tt c}
    376 is the bit mask of outgoing carries, so the multiplication by 2
    377 is necessary to move each carry to its incoming position.
    378 At the incoming position, the carry will require an increment
    379 to the corresponding field in {\tt R}, as well as all further
    380 fields that are reachable by carry propagation through a
    381 consecutive run of ones.   
    382 
    383 Figure \ref{fig:longadd} shows the operation of long-stream addition
    384 through an example.
    385 
    386 
    387 
    388 
    389 
     397\begin{figure}
     398\begin{center}
     399\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
     400{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
     401{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
     402{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
     403{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
     404{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
     405{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
     406{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
     407{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
     408{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
     409{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
     410{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
     411\end{tabular}
     412\end{center}
     413\caption{Long Stream Addition}\label{fig:longadd}
     414\end{figure}
     415
     416Figure \ref{fig:longadd} illustrates the process.  In the figure,
     417we illustrate the process with 8-bit fields rather than 64-bit fields
     418and show all field values in hexadecimal notation.  Note that
     419two of the individual 8-bit additions produce carries, while two
     420others produce {\tt FF} values that generate bubble bits.  The
     421net result is that four of the original 8-bit sums must be
     422incremented to produce the long stream result.
     423
     424A slight extension to the process produces a long-stream full adder
     425that can be used in chained addition.  In this case, the
     426adder must take an additional carry-in bit
     427{\tt p} and produce a carry-out bit {\tt q}.
     428This may be accomplished by incorporating {\tt p}
     429in calculating the increment mask in the low bit position,
     430and then extracting the carry-out {\tt q} from the high bit position.
     431\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
     432\[\text{\tt q} = \text{\tt i >> f}\]
     433\raggedbottom
    390434\section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis}
    391435
     
    411455 \begin{figure*}[tbh]
    412456\begin{center}
    413 \begin{code}
    414 static IDISA_ALWAYS_INLINE void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {
     457\begin{verbatim}
     458void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {
    415459  bitblock256_t all_ones = simd256<1>::constant<1>();
    416460  bitblock256_t gen = simd_and(x, y);
     
    428472}
    429473
    430 \end{code}
     474\end{verbatim}
    431475
    432476\end{center}
Note: See TracChangeset for help on using the changeset viewer.