Changeset 3472

Sep 13, 2013, 4:49:59 PM (6 years ago)

Long-stream addition; clean-out old version.

6 deleted
1 edited


  • docs/Working/re/ppopp-re.tex

    r3471 r3472  
    174174addition technique involving a further application of MatchStar
    175175that enables us to scale the technique to $n$-bit addition
    176 in $log_{64} n$ steps.   We ultimately apply this technique,
     176in $\lceil\lg_{64}{n}\rceil)$ steps.   We ultimately apply this technique,
    177177for example, to perform
    178178synchronized 4096-bit addition on GPU warps of 64 threads.
    230230case, we rely on the Parabix tool chain \cite{lin2012parabix}
    231231to handle the details of compilation to block-by-block processing.
    232 As weh show later, however, we have also adapted Parabix technology to processing
    233 blocks of 4K bytes at time in our GPGPU implementation, relying
    234 on our implementation of 4096-bit addition implemented using
    235 SIMT processing of 64 threads.
     232As we show later, however, we have also adapted Parabix technology to processing
     233blocks of 4K bytes at time in our GPGPU implementation,
     234relying on the application of our long-stream addition technique
     235to perform 4096-bit additions using 64 threads working in lock-step
     236SIMT fashion each on 64-bit processors.
    237238A key concept in this streaming approach is the derivation of bit streams
    249250streams such that each bit $j$ of the stream specifies
    250251whether character $j$ of the input stream is in the class
    251 or not.   {\bf perhaps a figure here.}
    254 \section{Block-at-a-Time Processing}\label{sec:blockwise}
     252or not.  Figure \ref{fig:streams} shows an example of an
     253input byte stream in ASCII, the eight basis bit streams of the
     254transposed representation, and several character class bit streams
     255that may be computed from the basis bit streams using bitwise logic.
     256Transposition and character class construction are straightforward
     257using the Parabix tool chain \cite{lin2012parabix}.
     259\paragraph*{Marker Streams.}  Now consider how bit-parallel data
     260streams can be used in regular expression matching.   Consider
     261the problem of searching the input stream of Figure \ref{fig:streams}
     262to finding occurrence of strings matching
     263the regular expression \verb:a[0-9]*z:.
     264The matching process involves the concept of {\em marker streams}, that
     265is streams that mark the positions of current matches during the
     266overall process.  In this case there are three marker streams computed
     267during the match process, namely,
     268$M_1$ representing match positions after an initial \verb:a:
     269character has been found, $M_2$ representing positions
     270reachable from positions marked by $M_1$ by further matching zero or
     271more digits (\verb:[0-9]*:) and finally $M_3$ the stream
     272marking positions after a final \verb:z: has been found.
     273Without describing the details of how these streams are computed
     274for the time being, Figure \ref{fig:streams} shows what each
     275of these streams should be for our example matching problem.
     276Note our convention that a marker stream contains a 1 bit
     277at the next character position to be matched, that is,
     278immediately past the last position that was matched.
     282MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
     284Figure \ref{fig:matchstar} illustrates the MatchStar method.  The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data.
     286In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.  $T_1$ has 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
     287output marker stream is obtained by ORing $T_2$ with the initial marker stream.
     292source data & \verb`--142578---125-231-----127--5394---94761205-`\\
     293$M_0$ & \verb`.......1......1..1..1...1.............1..1..`\\
     294$D = $\verb:[0-9]: & \verb`..111111...111.111.....111..1111...11111111.`\\
     295$T_0 = M_0 \wedge D$ & \verb`.......1.........1......1.............1..1..`\\
     296$T_1 = T_0 + D$ & \verb`.1.........1111.......1..1..1111..1...1...1.`\\
     297$T_2 = T_1 \oplus D$ & \verb`.1111111......1111....111.........1111.111..`\\
     298$M_1 = T_2 \, | \, M_0$ & \verb`.1111111......1111..1.111.........11111111..`
     302\caption{Match Star}
     307In general, given a marker stream $M$ and a character class stream $C$,
     308the operation of MatchStar is defined by the following equation. 
     309\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
     310Given a set of initial marker positions, the result stream marks
     311all possible positions that can be reached by 0 or more occurrences
     312of characters in class $C$ from each position in $M$. 
     317\section{Block-at-a-Time Processing}\label{sec:blockwise}
     321\paragraph*{Long-Stream Addition.}  The maximum word size for
     322addition on commodity processors is typically 64 bits.  In order
     323to implement long-stream addition for block sizes of 256 or larger,
     324a method for propagating carries through the individual stages of
     32564-bit addition is required.  However, the normal technique of
     326sequential addition using add-with-carry instructions, for example,
     327is far from ideal.
     329We have developed a technique using SIMD or SIMT methods for constant-time
     330long-stream addition up to 4096 bits.   
     331We assume the availability of the following SIMD/SIMT operations
     332operating on vectors of $f$ 64-bit fields.
     334\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
     335in two vectors to produce a result vector of $f$ 64-bit fields.
     336\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
     337of \verb:x: each with the constant value -1 (all bits 1), producing
     338an $f$-bit mask value
     339\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
     340field into a single compressed $f$-bit mask value, and
     341\item  \verb#simd<64>::spread(x)# : distributing the bits of
     342an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector, and
     343\item normal bitwise logic operations $f$ bit masks.
     346Given these operations, our method for long stream addition of
     347two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
     349\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
     350\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
     352\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
     353\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
     354\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
     355\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
     357\item Compute an $f$-bit mask of carries generated for each of the
     35864-bit additions of \verb:X: and \verb:Y:.
     359\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
     361\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
     362an incoming carry bit. 
     363\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
     365\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
     366incremented to produce the final sum.  This is just MatchStar.
     367\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
     369\item Computed the final result {\tt Z}.
     370\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
     374Step 5 is the key step to determine which field-values of {\tt R}
     375must be incremented due to incoming carries.  Note that {\tt c}
     376is the bit mask of outgoing carries, so the multiplication by 2
     377is necessary to move each carry to its incoming position.
     378At the incoming position, the carry will require an increment
     379to the corresponding field in {\tt R}, as well as all further
     380fields that are reachable by carry propagation through a
     381consecutive run of ones.   
     383Figure \ref{fig:longadd} shows the operation of long-stream addition
     384through an example.
Note: See TracChangeset for help on using the changeset viewer.