# Changeset 1377 for docs/HPCA2012/03-research.tex

Ignore:
Timestamp:
Aug 25, 2011, 11:48:22 AM (8 years ago)
Message:

minor edits; added logic notation definition back in

File:
1 edited

### Legend:

Unmodified
 r1374 \section{The Parabix Framework} \label{section:parabix} %Describe key technology behind Parabix %Introduce SIMD; %Talk about SSE %Highlight which SSE instructions are important %TAlk about each pass in the parser; How SSE is used in every phase... %Benefits of SSE in each phase. % Extract section 2.2 and merge into 3.   Add a new subsection % in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical % operations and then mention the pack operations that allow % us to implement transposition efficiently in parallel. % Also note that the SIMD registers support bitwise logic across % their full width and that this is extensively used in our work. % % Also, it could be good to have a small excerpt of a byte-at-a-time % scanning loop for XML, e.g., extracted from Xerces in section 2.1. % Just a few lines showing the while loop - Linda can tell you the file. % % This section focuses on the % With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position % within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, % 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and % shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel % \cite{CameronLin2009}. % The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0. % It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte. % This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources. % The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption. % Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage. This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}. first transposes the text data to a transform domain consisting of 8 parallel bit streams, known as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream of $k$th bit of each byte in the source text.  That is, the $k$th bit of $i$th byte in the source text is in the $i$th (bit) position of the $k$th basis bit stream, $b_{k}$. of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$. For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string {\ttfamily b7\verb<A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example. Character-class bit streams allow us to perform up to 128 comparisons'' in parallel with a single operation by using a series of boolean-logic operations to merge multiple basis bit streams into by using a series of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.} to merge multiple basis bit streams into a single character-class stream that marks the positions of key characters with a $1$. For example, a character is an \verb<' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb<[a-zA-Z]+> and wish to find all instances of it in the source text. %We know from this statement that only strings starting with \verb< that contain at least one letter and end with a \verb> are matches. We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all \verb>'s, and $C_{2}$, which marks all \verb<'s. In $L_0$ the position of every \verb<' is advanced by one to locate the first character of each Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each each parsing position are mostly eliminated, which, as Section XXX shows, minimizes branch misprediction penalties. each parsing position are mostly eliminated, which, as Section \ref{section:XML-branches} shows, minimizes branch misprediction penalties. Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading. \begin{tabular}{r l} \ttfamily{\bfseries INPUT:} & \verbdigit = [0-9] \\ \ttfamily{\bfseries INPUT:} & \verbdigit = [0-9] \\ \\ \ttfamily{\bfseries OUTPUT:} & \verbtemp1 = (basis_bits.bit_0 | basis_bits.bit_1) \\ & \verbtemp2 = (basis_bits.bit_2 & basis_bits.bit_3) \\ \end{figure} Pablo is a compiler that abstracts away the details of The Pablo compiler abstracts away the details of programming parallel bit stream code in terms of finite SIMD register widths and application buffer sizes.  Input to Pablo is a language for expressing bitstream operations on unbounded bitstreams.  The operations include bitwise SIMD register widths and application buffer sizes. Input to Pablo is a language for expressing bitstream operations on unbounded bitstreams. The operations include bitwise logic, the {\tt Advance} and {\tt ScanThru} operations described in he previous subsection as well as if and while control structures. \begin{tabular}{r l} \ttfamily{\bfseries INPUT:} & \verbdef parse_tags(classes, errors): \\ & \verb        classes.C0 = Alpha \\ & \verb        classes.C1 = Rangle \\ & \verb        classes.C2 = Langle \\ & \verb        L0 = bitutil.Advance(C2) \\ & \verb        errors.E0 = L0 &~ C0 \\ & \verb        L1 = bitutil.ScanThru(L0, C0) \\ & \verb        errors.E1 = L1 &~ C1 \\ \\ \ttfamily{\bfseries OUTPUT:} & \verbstruct Parse_tags { \\ & \verb  Parse_tags() {  \\ & \verb  CarryInit(carryQ, 2); } \\ \ttfamily{\bfseries INPUT:} & \verbdef parse_tags(classes, errors): \\ & \verb  classes.C0 = Alpha \\ & \verb  classes.C1 = Rangle \\ & \verb  classes.C2 = Langle \\ & \verb  L0 = bitutil.Advance(C2) \\ & \verb  errors.E0 = L0 &~ C0 \\ & \verb  L1 = bitutil.ScanThru(L0, C0) \\ & \verb  errors.E1 = L1 &~ C1 \\ \\ \ttfamily{\bfseries OUTPUT:} & \verbstruct Parse_tags { \\ & \verb  Parse_tags() { CarryInit(carryQ, 2); } \\ & \verb  void do_block(Classes & classes, Errors & errors) { \\ & \verb                BitBlock L0, L1; \\ & \verb        classes.C0 = Alpha; \\ & \verb        classes.C1 = Rangle; \\ & \verb        classes.C2 = Langle; \\ & \verb        L0 = BitBlock_advance_ci_co(C2, carryQ, 0); \\ & \verb        errors.E0 = simd_andc(L0, C0); \\ & \verb        L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1); \\ & \verb        errors.E1 = simd_andc(L1, C1); \\ & \verb        CarryQ_Adjust(carryQ, 2); \\ & \verb    BitBlock L0, L1; \\ & \verb    classes.C0 = Alpha; \\ & \verb    classes.C1 = Rangle; \\ & \verb    classes.C2 = Langle; \\ & \verb    L0 = BitBlock_advance_ci_co(C2, carryQ, 0); \\ & \verb    errors.E0 = simd_andc(L0, C0); \\ & \verb    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1); \\ & \verb    errors.E1 = simd_andc(L1, C1); \\ & \verb    CarryQ_Adjust(carryQ, 2); \\ & \verb  } \\ & \verb  CarryDeclare(carryQ, 2); \\ & \verb  }; \\ & \verb}; \\ \end{tabular}