# Changeset 3472 for docs/Working/re/ppopp-re.tex

Ignore:
Timestamp:
Sep 13, 2013, 4:49:59 PM (6 years ago)
Message:

 r3471 addition technique involving a further application of MatchStar that enables us to scale the technique to $n$-bit addition in $log_{64} n$ steps.   We ultimately apply this technique, in $\lceil\lg_{64}{n}\rceil)$ steps.   We ultimately apply this technique, for example, to perform synchronized 4096-bit addition on GPU warps of 64 threads. case, we rely on the Parabix tool chain \cite{lin2012parabix} to handle the details of compilation to block-by-block processing. As weh show later, however, we have also adapted Parabix technology to processing blocks of 4K bytes at time in our GPGPU implementation, relying on our implementation of 4096-bit addition implemented using SIMT processing of 64 threads. As we show later, however, we have also adapted Parabix technology to processing blocks of 4K bytes at time in our GPGPU implementation, relying on the application of our long-stream addition technique to perform 4096-bit additions using 64 threads working in lock-step SIMT fashion each on 64-bit processors. A key concept in this streaming approach is the derivation of bit streams streams such that each bit $j$ of the stream specifies whether character $j$ of the input stream is in the class or not.   {\bf perhaps a figure here.} \section{Block-at-a-Time Processing}\label{sec:blockwise} or not.  Figure \ref{fig:streams} shows an example of an input byte stream in ASCII, the eight basis bit streams of the transposed representation, and several character class bit streams that may be computed from the basis bit streams using bitwise logic. Transposition and character class construction are straightforward using the Parabix tool chain \cite{lin2012parabix}. \paragraph*{Marker Streams.}  Now consider how bit-parallel data streams can be used in regular expression matching.   Consider the problem of searching the input stream of Figure \ref{fig:streams} to finding occurrence of strings matching the regular expression \verb:a[0-9]*z:. The matching process involves the concept of {\em marker streams}, that is streams that mark the positions of current matches during the overall process.  In this case there are three marker streams computed during the match process, namely, $M_1$ representing match positions after an initial \verb:a: character has been found, $M_2$ representing positions reachable from positions marked by $M_1$ by further matching zero or more digits (\verb:[0-9]*:) and finally $M_3$ the stream marking positions after a final \verb:z: has been found. Without describing the details of how these streams are computed for the time being, Figure \ref{fig:streams} shows what each of these streams should be for our example matching problem. Note our convention that a marker stream contains a 1 bit at the next character position to be matched, that is, immediately past the last position that was matched. \paragraph*{MatchStar.} MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream. Figure \ref{fig:matchstar} illustrates the MatchStar method.  The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data. In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.  $T_1$ has 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The output marker stream is obtained by ORing $T_2$ with the initial marker stream. \begin{figure*}[tbh] \begin{center} \begin{tabular}{cr}\\ source data & \verb--142578---125-231-----127--5394---94761205-\\ $M_0$ & \verb.......1......1..1..1...1.............1..1..\\ $D =$\verb:[0-9]: & \verb..111111...111.111.....111..1111...11111111.\\ $T_0 = M_0 \wedge D$ & \verb.......1.........1......1.............1..1..\\ $T_1 = T_0 + D$ & \verb.1.........1111.......1..1..1111..1...1...1.\\ $T_2 = T_1 \oplus D$ & \verb.1111111......1111....111.........1111.111..\\ $M_1 = T_2 \, | \, M_0$ & \verb.1111111......1111..1.111.........11111111.. \end{tabular} \end{center} \caption{Match Star} \label{fig:matchstar1} \end{figure*} In general, given a marker stream $M$ and a character class stream $C$, the operation of MatchStar is defined by the following equation. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) | M$ Given a set of initial marker positions, the result stream marks all possible positions that can be reached by 0 or more occurrences of characters in class $C$ from each position in $M$. \section{Block-at-a-Time Processing}\label{sec:blockwise} \paragraph*{Long-Stream Addition.}  The maximum word size for addition on commodity processors is typically 64 bits.  In order to implement long-stream addition for block sizes of 256 or larger, a method for propagating carries through the individual stages of 64-bit addition is required.  However, the normal technique of sequential addition using add-with-carry instructions, for example, is far from ideal. We have developed a technique using SIMD or SIMT methods for constant-time long-stream addition up to 4096 bits. We assume the availability of the following SIMD/SIMT operations operating on vectors of $f$ 64-bit fields. \begin{itemize} \item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields in two vectors to produce a result vector of $f$ 64-bit fields. \item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields of \verb:x: each with the constant value -1 (all bits 1), producing an $f$-bit mask value \item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit field into a single compressed $f$-bit mask value, and \item  \verb#simd<64>::spread(x)# : distributing the bits of an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector, and \item normal bitwise logic operations $f$ bit masks. \end{itemize} Given these operations, our method for long stream addition of two $f \times 64$ bit values \verb:X: and \verb:Y: is the following. \begin{enumerate} \item Form the vector of 64-bit sums of \verb:x: and \verb:y:. $\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$ \item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:. $\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$ $\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$ $\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$ \item Compute an $f$-bit mask of carries generated for each of the 64-bit additions of \verb:X: and \verb:Y:. $\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$ \item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with an incoming carry bit. $\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$ \item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be incremented to produce the final sum.  This is just MatchStar. $\text{\tt i} = \text{\tt MatchStar(c*2, b)}$ \item Computed the final result {\tt Z}. $\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$ \end{enumerate} Step 5 is the key step to determine which field-values of {\tt R} must be incremented due to incoming carries.  Note that {\tt c} is the bit mask of outgoing carries, so the multiplication by 2 is necessary to move each carry to its incoming position. At the incoming position, the carry will require an increment to the corresponding field in {\tt R}, as well as all further fields that are reachable by carry propagation through a consecutive run of ones. Figure \ref{fig:longadd} shows the operation of long-stream addition through an example.