Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Author:
ashriram
Message:

Minor bug fixes up to 04

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/03-research.tex

    r1384 r1393  
    22\label{section:parabix}
    33
    4 This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}.
    5 The framework has three components: a unifying architectural view of text processing in terms
    6 of parallel bit streams, a tool chain for automating the generation of parallel bit stream
    7 code from higher-level specifications, and a run-time environment providing a portable
    8 SIMD programming abstraction, independent of the specific facilities available
    9 on particular target architectures.
     4This section presents an overview of the SIMD-based parallel bit
     5stream text processing framework, \emph{Parabix}.  The framework has
     6three components: a unifying architectural view of text processing in
     7terms of parallel bit streams, a tool chain for automating the
     8generation of parallel bit stream code from higher-level
     9specifications, and a run-time environment providing a portable SIMD
     10programming abstraction, independent of the specific facilities
     11available on particular target architectures.
    1012
    1113
    1214\subsection{Parallel Bit Streams}
    1315
    14 The fundamental difference between the Parabix framework and traditional text processing models is in how
    15 Parabix represents the source data.   Given a traditional byte-oriented text stream, Parabix
    16 first transposes the text data to a transform domain consisting of 8 parallel bit streams, known
    17 as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream
    18 of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source
    19 text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.
    20 For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
    21 streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
     16The fundamental difference between the Parabix framework and
     17traditional text processing models is in how Parabix represents the
     18source data.  Given a traditional byte-oriented text stream, Parabix
     19first transposes the text data to a transform domain consisting of 8
     20parallel bit streams, known as {\em basis bit streams}.  In essence,
     21each basis bit stream $b_{k}$ represents the stream of $k$-th bit of
     22each byte in the source text.  That is, the $k$-th bit of $i$-th byte
     23in the source text is in the $i$-th (bit) position of the $k$-th basis
     24bit stream, $b_{k}$.  For example, in Figure\ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily
     25  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
     26  \ldots 7}$. The bits used to construct $\tt b_7$ have been
     27highlighted in this example.
    2228
    2329\begin{figure}[h]
     
    4652
    4753The advantage of the parallel bit stream representation is that we can
    48 use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on
    49 Intel) to process 128 byte positions at a time using bitwise logic, shifting and
    50 other operations.
     54use the 128-bit SIMD registers commonly found on commodity processors
     55(e.g. SSE on Intel) to process 128 byte positions at a time using
     56bitwise logic, shifting and other operations.
    5157
    5258Just as forward and inverse Fourier transforms are used to transform
    53 between the time and frequency domains in signal processing, bit stream
    54 transposition and inverse transposition provides ``byte space'' and ``bit space''
    55 views of text.  The goal of the Parabix framework is to support efficient
    56 text processing using these two equivalent representations in the same
    57 way that efficient signal processing benefits from the use of the frequency
    58 domain in some cases and the time domain in others.
     59between the time and frequency domains in signal processing, bit
     60stream transposition and inverse transposition provides ``byte space''
     61and ``bit space'' views of text.  The goal of the Parabix framework is
     62to support efficient text processing using these two equivalent
     63representations in the same way that efficient signal processing
     64benefits from the use of the frequency domain in some cases and the
     65time domain in others.
    5966
    6067In the Parabix framework, basis bit streams are used as the starting
    61 point to determine other bit streams.   In particular, Parabix uses the basis bit streams to
    62 construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense
    63 of a significant character (or class of characters) in the parsing process.
    64 Character-class bit streams may then be used to compute \emph{lexical bit streams} and \emph{error bit streams}, which
    65 Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream.
     68point to determine other bit streams.  In particular, Parabix uses the
     69basis bit streams to construct \emph{character-class bit streams} in
     70which each $\tt 1$ bit indicates the presense of a significant
     71character (or class of characters) in the parsing process.
     72Character-class bit streams may then be used to compute \emph{lexical
     73  bit streams} and \emph{error bit streams}, which Parabix uses to
     74process and validate the source document. The remainder of this
     75section will discuss each type of bit stream.
    6676
    6777{\bf Basis Bit Streams:}
     
    7080so that Parabix can efficiently produce the character-class bit streams.
    7181Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
    72 of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
     82of approximately 1 cycle per byte.
    7383%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
    7484%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
     
    8797
    8898{\bf Character-class Bit Streams:}
    89 Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between
    90 data and metadata parsing.
     99Typically, as text parsers process input data, they locate specific
     100characters to determine if and when to transition between data and
     101metadata parsing.
    91102% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
    92 For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
    93 Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set
    94 of known code points and branching appropriately when one is found, typically using an if or switch statement.
    95 Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
    96 to do so correctly.
     103For example, in XML, any opening angle bracket character, `\verb`<`',
     104may indicate that we are starting a new markup tag.  Traditional
     105byte-at-a-time parsers find these characters by comparing the value of
     106each code point with a set of known code points and branching
     107appropriately when one is found, typically using an if or switch
     108statement.  Using this method to perform multiple transitions in
     109parallel is non-trivial and may require fairly sophisticated
     110algorithms to do so correctly.
    97111% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
    98112
    99 Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel with a single operation
    100 by using a series of boolean-logic operations
    101 \footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.}
    102 to merge multiple basis bit streams into
    103 a single character-class stream that marks the positions of key characters with a $1$.
    104 For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
    105 Classes of characters can be found with similar formulas.
    106 For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
    107 An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute
    108 than individual characters.
    109 Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper.
     113Character-class bit streams allow us to perform up to 128
     114``comparisons'' in parallel with a single operation by using a series
     115of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
     116  denote the boolean AND, OR and NOT operations.}  to merge multiple
     117basis bit streams into a single character-class stream that marks the
     118positions of key characters with a $1$.  For example, a character is
     119an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
     120b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of
     121characters can be found with similar formulas.  For example, a
     122character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
     123\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
     124important observation here is that a range of characters can sometimes
     125take fewer operations and require fewer basis bit streams to compute
     126than individual characters.  Finding optimal solutions to all
     127character-classes is non-trivial and goes beyond the scope of this
     128paper.
    110129
    111130% The advantage of character-class bit streams over traditional ``byte streams'' is that
     
    115134
    116135
    117 {\bf Lexical and Error Bit Streams:}
    118 To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
    119 a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
    120 Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
    121 Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
    122 issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be
    123 trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity
    124 of the error.
    125 
    126 To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.
    127 The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
    128 and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
    129 $\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
    130 $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$. 
    131 For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
    132 instances of it in the source text.
    133 We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
    134 $C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
    135 token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
    136 the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
    137 token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
    138 With additional post bit-stream processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can be removed or ignored; the details
    139 of which go beyond the scope of this paper.
     136{\bf Lexical and Error Bit Streams:} To perform lexical analysis on
     137the input data, Parabix computes lexical and error bit streams from
     138the character-class bit streams using a mixture of both boolean logic
     139and integer math. Lexical bit streams typically mark multiple current
     140parsing positions.  Unlike the single-cursor approach of traditional
     141text parsers, these allow Parabix to process multiple cursors in
     142parallel.  Error bit streams are derived from the lexical bit streams
     143and can be used to identify any well-formedness issues found during
     144the parsing process. The presense of a $\tt 1$ bit in an error stream
     145tends to mean that the lexical stream cannot be trusted to be
     146completely accurate and Parabix may need to perform some sequential
     147parsing on that section to determine the cause and severity of the
     148error.
     149
     150To form lexical bit streams, we have to introduce a few new
     151operations: Advance and ScanThru.  The $\texttt{Advance}$ operator
     152accepts one input parameter, $c$, which is typically viewed as a bit
     153stream containing multiple cursor bits, and advances each cursor one
     154position forward.  On little-endian architectures, shifting forward
     155means shifting to the right.  $\texttt{ScanThru}$ accepts two input
     156parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved
     157to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land
     158\lnot m$.  For example, in Figure \ref{fig:ParabixParsingExample}
     159suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to
     160find all instances of it in the source text.  We begin by constructing
     161the character classes $C_{0}$, which consists of all letters, $C_{1}$,
     162which contains all `\verb`>`'s, and $C_{2}$, which marks all
     163`\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by
     164one to locate the first character of each token. By computing $E_{0}$,
     165the parser notes that ``\verb`<>`'' does not match the expected
     166pattern. To find the end positions of each token, the parser
     167calculates $L_{1}$ by moving the cursors in $L_0$ through the letter
     168bits in $C_0$. $L_1$ is then validated to ensure that each token ends
     169with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to
     170match the expected pattern.  With additional post bit stream
     171processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can
     172be removed or ignored; the details of which go beyond the scope of
     173this paper.
    140174
    141175\begin{figure}[h]
     
    171205% details, refer to the technical report \cite{Cameron2010}.
    172206
    173 Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
    174 each parsing position are mostly eliminated, which, as Section \ref{section:XML-branches} shows, minimizes branch misprediction penalties.
    175 Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading.
    176 
    177 
    178 \subsection{Parabix Tool Chain}
     207Using this parallel bit stream approach, conditional branch statements
     208used to identify key positions and/or syntax errors at each each
     209parsing position are mostly eliminated, which, as Section
     210\ref{section:XML-branches} shows, minimizes branch misprediction
     211penalties.  Accurate parsing and parallel lexical analysis is done
     212through processor-friendly equations and requires neither speculation
     213nor multithreading.
     214
     215
     216\subsection{Parabix Compilers}
    179217\label{parabix tool chain}
    180218
    181 To support the Parabix framework, we are developing tool technology to automate various aspects
    182 of parallel bit stream programming. At present, our tool chain consists of two compilers: a character class
    183 compiler (ccc) and a unbounded bit stream to C/C++ block-at-a-time processing compiler (Pablo).
    184 
    185 The character class compiler is used to automatically produce bit stream logic for all the individual characters (e.g.,
    186 delimiters) and character classes (e.g., digits, letters) used in a particular application.
    187 Input is specified using a character class syntax adapted from the standard regular expression notations.
    188 Output is a minimized set of three-address bitwise operations, such as $a = b~\&~c$,
    189 to compute each of the character classes from the basis bit streams.
    190 
    191 For example, Figure \ref{fig:CCC} shows the input and output produced by the
    192 character class compiler for the example of \verb`[0-9]`
    193 discussed in the previous section.  The output operations
    194 may be viewed as operations on a single block of input
    195 at a time, or may be viewed as operations on unbounded bitstreams
    196 as supported by the Pablo compiler.
     219To support the Parabix execution framework, we have developed a tool
     220chain to the automate various aspects of parallel bit stream
     221programming. Our tool chain consists of two compilers: a
     222character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
     223block-at-a-time processing compiler (\textit{Pablo}).
     224
     225The character class compiler is used to automatically produce bit
     226stream logic for all the individual characters (e.g., delimiters) and
     227character classes (e.g., digits, letters) used in a particular
     228application.  Input is specified using a character class syntax
     229adapted from the standard regular expression notations.  Output is a
     230minimized set of three-address bitwise operations, such as $a =
     231b~\&~c$, to compute each of the character classes from the basis bit
     232streams.
     233
     234For example, Figure \ref{fig:CCC} shows the input and output produced
     235by the character class compiler for the example of \verb`[0-9]`
     236discussed in the previous section.  The output operations may be
     237viewed as operations on a single block of input at a time, or may be
     238viewed as operations on unbounded bit streams as supported by the Pablo
     239compiler.
    197240
    198241\begin{figure}[tbh]
     
    256299\end{figure}
    257300
    258 The Pablo compiler abstracts away the details of
    259 programming parallel bit stream code in terms of finite
    260 SIMD register widths and application buffer sizes.
    261 Input to Pablo is a language for expressing bitstream operations on unbounded bitstreams. 
    262 The operations include bitwise
    263 logic, the {\tt Advance} and {\tt ScanThru} operations described in he
    264 previous subsection as well as if and while control structures.
    265 Pablo translates these operations to block-at-a-time
    266 code in C/C++.
     301The Pablo compiler abstracts away the details of programming parallel
     302bit stream code in terms of finite SIMD register widths and
     303application buffer sizes.  Input to Pablo is a language for expressing
     304bitstream operations on unbounded bitstreams.  The operations include
     305bitwise logic, the {\tt Advance} and {\tt ScanThru} operations
     306described in the previous subsection as well as ``if-then'' and
     307``while {}'' control structures.  Pablo translates these operations to
     308block-at-a-time code in C/C++.
    267309%, where the block size is the register width for SIMD operations on the selected target architecture.
    268310The key functionality of Pablo is to arrange for block-to-block
     
    271313{\tt Advance} and {\tt ScanThru}.
    272314
    273 For example, we can translate the simple parsing example
    274 of \ref{fig:ParabixParsingExample} above into Pablo code
    275 to produce the output as shown in Figure \ref{fig:Pablo}.
    276 In this example, Pablo has the primary responsibility of inserting
    277 carry variable declarations that allow the results of
    278 Advance and ScanThru operations to be carried over from
    279 block to block.  A separate carry variable is required for every
    280 {\tt Advance} or {\tt ScanThru} operation.  A function containing
    281 such operations is translated into a public C++ class (struct),
    282 which includes a Carry Queue to hold all the carry
    283 variables from iteration to iteration, together with the
    284 a method {\tt do\_block} to implement the processing
    285 for a single block (based on the SIMD register width).
    286 Macros {\tt CarryDeclare} and {\tt CarryInit} declare and
    287 initialize the Carry Queue structure depending on the
    288 specific architecture and Carry Queue representation.
    289 The unbounded bitstream {\tt Advance} and {\tt ScanThru}
    290 operations are translated into block-by-block equivalents
    291 with explicit carry-in and carry-out processing. 
    292 At the end of each block, the {\tt CarryQ\_Adjust}
    293 operation implements any necessary adjustment of the
    294 Carry Queue to prepare for the next iteration.
    295 The Pablo language and compiler also support conditional
    296 and iterative bitstream logic on unbounded streams
    297 (if and while constructs) which involves additional
    298 carry-test insertion in control branches.   Explaining the
    299 full details of the translation
    300 is beyond the scope of this paper, however.
     315For example, we can translate the simple parsing example of
     316\ref{fig:ParabixParsingExample} above into Pablo code to produce the
     317output as shown in Figure \ref{fig:Pablo}.  In this example, Pablo has
     318the primary responsibility of inserting carry variable declarations
     319that allow the results of Advance and ScanThru operations to be
     320carried over from block to block.  A separate carry variable is
     321required for every {\tt Advance} or {\tt ScanThru} operation.  A
     322function containing such operations is translated into a public C++
     323class (struct), which includes a carry queue to hold all the carry
     324variables from iteration to iteration, together with the a method {\tt
     325  do\_block} to implement the processing for a single block (based on
     326the SIMD register width).  Macros {\tt CarryDeclare} and {\tt
     327  CarryInit} declare and initialize the carry queue structure
     328depending on the specific architecture and carry queue representation.
     329The unbounded bitstream {\tt Advance} and {\tt ScanThru} operations
     330are translated into block-by-block equivalents with explicit carry-in
     331and carry-out processing.  At the end of each block, the {\tt
     332  CarryQ\_Adjust} operation implements any necessary adjustment of the
     333carry queue to prepare for the next iteration.  The Pablo language and
     334compiler also support conditional and iterative bitstream logic on
     335unbounded streams (if and while constructs) which involves additional
     336carry-test insertion in control branches.  Explaining the full details
     337of the translation is beyond the scope of this paper, however.
    301338
    302339\subsection{Parabix Run-Time Libraries}
    303340
    304 The Parabix architecture also includes run-time libraries
    305 that support a machine-independent view of basic SIMD
    306 operations, as well as a set of core function libraries.
    307 For machine-independence, we program all operations using
    308 an abstract SIMD machine based on the Inductive Doubling
    309 Instruction Set Architecture (IDISA) \cite{CameronLin2009}.
    310 The abstract machine supports all power-of-2 field widths up to the full
    311 SIMD register width on a target machine.   
    312 Let $w = 2k$ be the field width in bits. Let $f$ be a basic binary operation defined on $w$-bit quantities
    313 producing an $w$-bit result. Let $W$ be the SIMD vector size in bits where $W = 2K$.
    314 Then the C++ template notation \verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector $v$,
    315 given two input SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value computed is $f(a_i, b_i)$.
    316 For example, given 128-bit SIMD vectors, \verb`simd<8>::add(a,b)` represents the simultaneous addition of sixteen 8-bit fields.
    317 
    318 These operations were originally developed for 128-bit Altivec operations on Power PC
    319 as well as 64-bit MMX and 128-bit SSE operations on Intel
    320 but have recently extended to support
    321 the new 256-bit AVX operations on Intel as well as the 128-bit
    322 NEON operations on the ARM architecture.
     341The Parabix architecture also includes run-time libraries that support
     342a machine-independent view of basic SIMD operations, as well as a set
     343of core function libraries.  For machine-independence, we program all
     344operations using an abstract SIMD machine.  The abstract machine
     345supports all power-of-2 field widths up to the full SIMD register
     346width on a target machine.  Let $w = 2k$ be the field width in
     347bits. Let $f$ be a basic binary operation defined on $w$-bit
     348quantities producing an $w$-bit result. Let $W$ be the SIMD vector
     349size in bits where $W = 2K$.  Then the C++ template notation
     350\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
     351SIMD operation yielding an output SIMD vector $v$, given two input
     352SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
     353computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
     354\verb`simd<8>::add(a,b)` represents the simultaneous addition of
     355sixteen 8-bit fields.
     356
     357These operations were originally developed for 128-bit Altivec
     358operations on Power PC as well as 64-bit MMX and 128-bit SSE
     359operations on Intel but have recently extended to support the new
     360256-bit AVX operations on Intel as well as the 128-bit NEON operations
     361on the ARM architecture.
    323362 
Note: See TracChangeset for help on using the changeset viewer.