source: docs/HPCA2012/03-research.tex @ 1393

Last change on this file since 1393 was 1393, checked in by ashriram, 8 years ago

Minor bug fixes up to 04

File size: 18.9 KB
1\section{The Parabix Framework}
4This section presents an overview of the SIMD-based parallel bit
5stream text processing framework, \emph{Parabix}.  The framework has
6three components: a unifying architectural view of text processing in
7terms of parallel bit streams, a tool chain for automating the
8generation of parallel bit stream code from higher-level
9specifications, and a run-time environment providing a portable SIMD
10programming abstraction, independent of the specific facilities
11available on particular target architectures.
14\subsection{Parallel Bit Streams}
16The fundamental difference between the Parabix framework and
17traditional text processing models is in how Parabix represents the
18source data.  Given a traditional byte-oriented text stream, Parabix
19first transposes the text data to a transform domain consisting of 8
20parallel bit streams, known as {\em basis bit streams}.  In essence,
21each basis bit stream $b_{k}$ represents the stream of $k$-th bit of
22each byte in the source text.  That is, the $k$-th bit of $i$-th byte
23in the source text is in the $i$-th (bit) position of the $k$-th basis
24bit stream, $b_{k}$.  For example, in Figure\ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily
25  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
26  \ldots 7}$. The bits used to construct $\tt b_7$ have been
27highlighted in this example.
32\begin{tabular}{r c c c c }
33STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
34ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
39\begin{tabular}{r |c |c |c |c |c |c |c |c |}
40 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
41 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
42 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
43 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
44 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
49\caption{Example 7-bit ASCII Basis Bit Streams}
53The advantage of the parallel bit stream representation is that we can
54use the 128-bit SIMD registers commonly found on commodity processors
55(e.g. SSE on Intel) to process 128 byte positions at a time using
56bitwise logic, shifting and other operations.
58Just as forward and inverse Fourier transforms are used to transform
59between the time and frequency domains in signal processing, bit
60stream transposition and inverse transposition provides ``byte space''
61and ``bit space'' views of text.  The goal of the Parabix framework is
62to support efficient text processing using these two equivalent
63representations in the same way that efficient signal processing
64benefits from the use of the frequency domain in some cases and the
65time domain in others.
67In the Parabix framework, basis bit streams are used as the starting
68point to determine other bit streams.  In particular, Parabix uses the
69basis bit streams to construct \emph{character-class bit streams} in
70which each $\tt 1$ bit indicates the presense of a significant
71character (or class of characters) in the parsing process.
72Character-class bit streams may then be used to compute \emph{lexical
73  bit streams} and \emph{error bit streams}, which Parabix uses to
74process and validate the source document. The remainder of this
75section will discuss each type of bit stream.
77{\bf Basis Bit Streams:}
78To construct the basis bit streams, the source data is first loaded in
79sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
80so that Parabix can efficiently produce the character-class bit streams.
81Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
82of approximately 1 cycle per byte.
83%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
84%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
85%as UNICODE, may use a multiple code units to express all of its possible code points.
86%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
87%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
88%128 code points within it.
90% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
91% Basis bit streams
92% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
93% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
94% Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit
95% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
96% by periods as to emphasize the $1$ bits.
98{\bf Character-class Bit Streams:}
99Typically, as text parsers process input data, they locate specific
100characters to determine if and when to transition between data and
101metadata parsing.
102% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
103For example, in XML, any opening angle bracket character, `\verb`<`',
104may indicate that we are starting a new markup tag.  Traditional
105byte-at-a-time parsers find these characters by comparing the value of
106each code point with a set of known code points and branching
107appropriately when one is found, typically using an if or switch
108statement.  Using this method to perform multiple transitions in
109parallel is non-trivial and may require fairly sophisticated
110algorithms to do so correctly.
111% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
113Character-class bit streams allow us to perform up to 128
114``comparisons'' in parallel with a single operation by using a series
115of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
116  denote the boolean AND, OR and NOT operations.}  to merge multiple
117basis bit streams into a single character-class stream that marks the
118positions of key characters with a $1$.  For example, a character is
119an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
120b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of
121characters can be found with similar formulas.  For example, a
122character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
123\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
124important observation here is that a range of characters can sometimes
125take fewer operations and require fewer basis bit streams to compute
126than individual characters.  Finding optimal solutions to all
127character-classes is non-trivial and goes beyond the scope of this
130% The advantage of character-class bit streams over traditional ``byte streams'' is that
132% Rather than testing each byte individually,
133% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
136{\bf Lexical and Error Bit Streams:} To perform lexical analysis on
137the input data, Parabix computes lexical and error bit streams from
138the character-class bit streams using a mixture of both boolean logic
139and integer math. Lexical bit streams typically mark multiple current
140parsing positions.  Unlike the single-cursor approach of traditional
141text parsers, these allow Parabix to process multiple cursors in
142parallel.  Error bit streams are derived from the lexical bit streams
143and can be used to identify any well-formedness issues found during
144the parsing process. The presense of a $\tt 1$ bit in an error stream
145tends to mean that the lexical stream cannot be trusted to be
146completely accurate and Parabix may need to perform some sequential
147parsing on that section to determine the cause and severity of the
150To form lexical bit streams, we have to introduce a few new
151operations: Advance and ScanThru.  The $\texttt{Advance}$ operator
152accepts one input parameter, $c$, which is typically viewed as a bit
153stream containing multiple cursor bits, and advances each cursor one
154position forward.  On little-endian architectures, shifting forward
155means shifting to the right.  $\texttt{ScanThru}$ accepts two input
156parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved
157to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land
158\lnot m$.  For example, in Figure \ref{fig:ParabixParsingExample}
159suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to
160find all instances of it in the source text.  We begin by constructing
161the character classes $C_{0}$, which consists of all letters, $C_{1}$,
162which contains all `\verb`>`'s, and $C_{2}$, which marks all
163`\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by
164one to locate the first character of each token. By computing $E_{0}$,
165the parser notes that ``\verb`<>`'' does not match the expected
166pattern. To find the end positions of each token, the parser
167calculates $L_{1}$ by moving the cursors in $L_0$ through the letter
168bits in $C_0$. $L_1$ is then validated to ensure that each token ends
169with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to
170match the expected pattern.  With additional post bit stream
171processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can
172be removed or ignored; the details of which go beyond the scope of
173this paper.
179\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
180$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
181$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
182$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
183$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
184$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
185$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
186$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
191\caption{Lexical Parsing in Parabix}
195% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
197% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
198% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
199% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
200% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
201% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
202% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
203% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
204% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
205% details, refer to the technical report \cite{Cameron2010}.
207Using this parallel bit stream approach, conditional branch statements
208used to identify key positions and/or syntax errors at each each
209parsing position are mostly eliminated, which, as Section
210\ref{section:XML-branches} shows, minimizes branch misprediction
211penalties.  Accurate parsing and parallel lexical analysis is done
212through processor-friendly equations and requires neither speculation
213nor multithreading.
216\subsection{Parabix Compilers}
217\label{parabix tool chain}
219To support the Parabix execution framework, we have developed a tool
220chain to the automate various aspects of parallel bit stream
221programming. Our tool chain consists of two compilers: a
222character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
223block-at-a-time processing compiler (\textit{Pablo}).
225The character class compiler is used to automatically produce bit
226stream logic for all the individual characters (e.g., delimiters) and
227character classes (e.g., digits, letters) used in a particular
228application.  Input is specified using a character class syntax
229adapted from the standard regular expression notations.  Output is a
230minimized set of three-address bitwise operations, such as $a =
231b~\&~c$, to compute each of the character classes from the basis bit
234For example, Figure \ref{fig:CCC} shows the input and output produced
235by the character class compiler for the example of \verb`[0-9]`
236discussed in the previous section.  The output operations may be
237viewed as operations on a single block of input at a time, or may be
238viewed as operations on unbounded bit streams as supported by the Pablo
245\begin{tabular}{r l}
246\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
248\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
249& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
250& \verb`temp3 = (temp2 &~ temp1)` \\
251& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
252& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
253& \verb`digit = (temp3 &~ temp5)`
258\caption{Character Class Compiler Input/Output}
266\begin{tabular}{r l}
267\ttfamily{\bfseries INPUT:} 
268& \verb`def parse_tags(classes, errors):` \\
269& \verb`  classes.C0 = Alpha` \\
270& \verb`  classes.C1 = Rangle` \\
271& \verb`  classes.C2 = Langle` \\
272& \verb`  L0 = bitutil.Advance(C2)` \\
273& \verb`  errors.E0 = L0 &~ C0` \\
274& \verb`  L1 = bitutil.ScanThru(L0, C0)` \\
275& \verb`  errors.E1 = L1 &~ C1` \\ \\
277\ttfamily{\bfseries OUTPUT:} 
278& \verb`struct Parse_tags {` \\
279& \verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
280& \verb`  void do_block(Classes & classes, Errors & errors) {` \\
281& \verb`    BitBlock L0, L1;` \\
282& \verb`    classes.C0 = Alpha;` \\
283& \verb`    classes.C1 = Rangle;` \\
284& \verb`    classes.C2 = Langle;` \\
285& \verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
286& \verb`    errors.E0 = simd_andc(L0, C0);` \\
287& \verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
288& \verb`    errors.E1 = simd_andc(L1, C1);` \\
289& \verb`    CarryQ_Adjust(carryQ, 2);` \\
290& \verb}` \\
291& \verb`  CarryDeclare(carryQ, 2);` \\
292& \verb`};` \\
297\caption{Parallel Block Compiler (Pablo) Input/Output}
301The Pablo compiler abstracts away the details of programming parallel
302bit stream code in terms of finite SIMD register widths and
303application buffer sizes.  Input to Pablo is a language for expressing
304bitstream operations on unbounded bitstreams.  The operations include
305bitwise logic, the {\tt Advance} and {\tt ScanThru} operations
306described in the previous subsection as well as ``if-then'' and
307``while {}'' control structures.  Pablo translates these operations to
308block-at-a-time code in C/C++.
309%, where the block size is the register width for SIMD operations on the selected target architecture.
310The key functionality of Pablo is to arrange for block-to-block
311carry bit propagation to implement the long bitstream shift
312and addition operations required by
313{\tt Advance} and {\tt ScanThru}.
315For example, we can translate the simple parsing example of
316\ref{fig:ParabixParsingExample} above into Pablo code to produce the
317output as shown in Figure \ref{fig:Pablo}.  In this example, Pablo has
318the primary responsibility of inserting carry variable declarations
319that allow the results of Advance and ScanThru operations to be
320carried over from block to block.  A separate carry variable is
321required for every {\tt Advance} or {\tt ScanThru} operation.  A
322function containing such operations is translated into a public C++
323class (struct), which includes a carry queue to hold all the carry
324variables from iteration to iteration, together with the a method {\tt
325  do\_block} to implement the processing for a single block (based on
326the SIMD register width).  Macros {\tt CarryDeclare} and {\tt
327  CarryInit} declare and initialize the carry queue structure
328depending on the specific architecture and carry queue representation.
329The unbounded bitstream {\tt Advance} and {\tt ScanThru} operations
330are translated into block-by-block equivalents with explicit carry-in
331and carry-out processing.  At the end of each block, the {\tt
332  CarryQ\_Adjust} operation implements any necessary adjustment of the
333carry queue to prepare for the next iteration.  The Pablo language and
334compiler also support conditional and iterative bitstream logic on
335unbounded streams (if and while constructs) which involves additional
336carry-test insertion in control branches.  Explaining the full details
337of the translation is beyond the scope of this paper, however.
339\subsection{Parabix Run-Time Libraries}
341The Parabix architecture also includes run-time libraries that support
342a machine-independent view of basic SIMD operations, as well as a set
343of core function libraries.  For machine-independence, we program all
344operations using an abstract SIMD machine.  The abstract machine
345supports all power-of-2 field widths up to the full SIMD register
346width on a target machine.  Let $w = 2k$ be the field width in
347bits. Let $f$ be a basic binary operation defined on $w$-bit
348quantities producing an $w$-bit result. Let $W$ be the SIMD vector
349size in bits where $W = 2K$.  Then the C++ template notation
350\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
351SIMD operation yielding an output SIMD vector $v$, given two input
352SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
353computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
354\verb`simd<8>::add(a,b)` represents the simultaneous addition of
355sixteen 8-bit fields.
357These operations were originally developed for 128-bit Altivec
358operations on Power PC as well as 64-bit MMX and 128-bit SSE
359operations on Intel but have recently extended to support the new
360256-bit AVX operations on Intel as well as the 128-bit NEON operations
361on the ARM architecture.
Note: See TracBrowser for help on using the repository browser.