Changeset 1377 for docs


Ignore:
Timestamp:
Aug 25, 2011, 11:48:22 AM (8 years ago)
Author:
ksherdy
Message:

minor edits; added logic notation definition back in

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/03-research.tex

    r1374 r1377  
    11\section{The Parabix Framework}
    22\label{section:parabix}
    3 
    4 %Describe key technology behind Parabix
    5 %Introduce SIMD;
    6 %Talk about SSE
    7 %Highlight which SSE instructions are important
    8 %TAlk about each pass in the parser; How SSE is used in every phase...
    9 %Benefits of SSE in each phase.
    10 
    11 
    12 % Extract section 2.2 and merge into 3.   Add a new subsection
    13 % in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
    14 % operations and then mention the pack operations that allow
    15 % us to implement transposition efficiently in parallel. 
    16 % Also note that the SIMD registers support bitwise logic across
    17 % their full width and that this is extensively used in our work.
    18 %
    19 % Also, it could be good to have a small excerpt of a byte-at-a-time
    20 % scanning loop for XML, e.g., extracted from Xerces in section 2.1. 
    21 % Just a few lines showing the while loop - Linda can tell you the file.
    22 %
    23 
    24 % This section focuses on the
    25 
    26 % With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position
    27 % within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit,
    28 % 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and
    29 % shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel
    30 % \cite{CameronLin2009}.
    31 
    32 % The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0.
    33 % It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte.
    34 % This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources.
    35 % The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption.
    36 % Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage.
    373
    384This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}.
     
    5016first transposes the text data to a transform domain consisting of 8 parallel bit streams, known
    5117as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream
    52 of $k$th bit of each byte in the source text.  That is, the $k$th bit of $i$th byte in the source
    53 text is in the $i$th (bit) position of the $k$th basis bit stream, $b_{k}$.
     18of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source
     19text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.
    5420For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
    5521streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
     
    13298
    13399Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel with a single operation
    134 by using a series of boolean-logic operations to merge multiple basis bit streams into
     100by using a series of boolean-logic operations
     101\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.}
     102to merge multiple basis bit streams into
    135103a single character-class stream that marks the positions of key characters with a $1$.
    136104For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
     
    163131For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
    164132instances of it in the source text.
    165 %We know from this statement that only strings starting with \verb`<` that contain at least one letter and end with a \verb`>` are matches.
    166133We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
    167134$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
     
    205172
    206173Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
    207 each parsing position are mostly eliminated, which, as Section XXX shows, minimizes branch misprediction penalties.
     174each parsing position are mostly eliminated, which, as Section \ref{section:XML-branches} shows, minimizes branch misprediction penalties.
    208175Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading.
    209176
     
    234201
    235202\begin{tabular}{r l}
    236 \ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\
     203\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
     204
    237205\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
    238206& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
     
    249217\end{figure}
    250218
    251 Pablo is a compiler that abstracts away the details of
     219The Pablo compiler abstracts away the details of
    252220programming parallel bit stream code in terms of finite
    253 SIMD register widths and application buffer sizes.  Input
    254 to Pablo is a language for expressing bitstream operations
    255 on unbounded bitstreams.  The operations include bitwise
     221SIMD register widths and application buffer sizes.
     222Input to Pablo is a language for expressing bitstream operations on unbounded bitstreams. 
     223The operations include bitwise
    256224logic, the {\tt Advance} and {\tt ScanThru} operations described in he
    257225previous subsection as well as if and while control structures.
     
    278246
    279247\begin{tabular}{r l}
    280 \ttfamily{\bfseries INPUT:} & \verb`def parse_tags(classes, errors):` \\
    281 & \verb`        classes.C0 = Alpha` \\
    282 & \verb`        classes.C1 = Rangle` \\
    283 & \verb`        classes.C2 = Langle` \\
    284 & \verb`        L0 = bitutil.Advance(C2)` \\
    285 & \verb`        errors.E0 = L0 &~ C0` \\
    286 & \verb`        L1 = bitutil.ScanThru(L0, C0)` \\
    287 & \verb`        errors.E1 = L1 &~ C1` \\ \\
    288 
    289 \ttfamily{\bfseries OUTPUT:} & \verb`struct Parse_tags {` \\
    290 & \verb`  Parse_tags() { ` \\
    291 & \verb`  CarryInit(carryQ, 2); }` \\
     248\ttfamily{\bfseries INPUT:}
     249& \verb`def parse_tags(classes, errors):` \\
     250& \verb`  classes.C0 = Alpha` \\
     251& \verb`  classes.C1 = Rangle` \\
     252& \verb`  classes.C2 = Langle` \\
     253& \verb`  L0 = bitutil.Advance(C2)` \\
     254& \verb`  errors.E0 = L0 &~ C0` \\
     255& \verb`  L1 = bitutil.ScanThru(L0, C0)` \\
     256& \verb`  errors.E1 = L1 &~ C1` \\ \\
     257
     258\ttfamily{\bfseries OUTPUT:}
     259& \verb`struct Parse_tags {` \\
     260& \verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
    292261& \verb`  void do_block(Classes & classes, Errors & errors) {` \\
    293 & \verb`                BitBlock L0, L1;` \\
    294 & \verb`        classes.C0 = Alpha;` \\
    295 & \verb`        classes.C1 = Rangle;` \\
    296 & \verb`        classes.C2 = Langle;` \\
    297 & \verb`        L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
    298 & \verb`        errors.E0 = simd_andc(L0, C0);` \\
    299 & \verb`        L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
    300 & \verb`        errors.E1 = simd_andc(L1, C1);` \\
    301 & \verb`        CarryQ_Adjust(carryQ, 2);` \\
     262& \verb`    BitBlock L0, L1;` \\
     263& \verb`    classes.C0 = Alpha;` \\
     264& \verb`    classes.C1 = Rangle;` \\
     265& \verb`    classes.C2 = Langle;` \\
     266& \verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
     267& \verb`    errors.E0 = simd_andc(L0, C0);` \\
     268& \verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
     269& \verb`    errors.E1 = simd_andc(L1, C1);` \\
     270& \verb`    CarryQ_Adjust(carryQ, 2);` \\
    302271& \verb`  }` \\
    303272& \verb`  CarryDeclare(carryQ, 2);` \\
    304 & \verb`  };` \\
     273& \verb`};` \\
    305274\end{tabular}
    306275
Note: See TracChangeset for help on using the changeset viewer.