source: docs/HPCA2012/final_ieee/03-research.tex @ 1783

Last change on this file since 1783 was 1783, checked in by ashriram, 7 years ago

Final pass

File size: 19.0 KB
Line 
1\section{The Parabix Framework}
2\label{section:parabix}
3
4This section presents an overview of the SIMD-based parallel bit
5stream text processing framework, \emph{Parabix}.  The framework has
6three components: (1) a unifying architectural view of text processing in
7terms of parallel bit streams, (2) a tool chain for automating the
8generation of parallel bit stream code from higher-level
9specifications, and (3) a runtime environment, which provides a portable SIMD
10programming abstraction that is independent of the specific facilities
11available on particular target architectures.
12
13
14\subsection{Parallel Bit Streams}
15
16
17The fundamental difference between the Parabix framework and
18traditional text processing models is in how Parabix represents the
19source data.  Given a traditional byte-oriented text stream, Parabix
20first transposes the text data into a transform representation consisting of 8
21parallel bit streams, known as {\em basis bit streams} wherein
22basis bit stream $b_{k}$ represents the stream of $k$-th bit of
23each byte in the source text.  That is, the $k$-th bit of $i$-th byte
24in the source text is in one-to-one correspondence with the $i$-th bit of the $k$-th basis
25bit stream, $b_{k}$.  For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
26  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
27  \ldots 7}$. The bits used to construct $\tt b_7$ are
28highlighted in this example.
29
30\begin{figure}[h]
31\begin{center}
32\begin{tabular}{r c c c c }
33STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
34ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
35\hline
36\end{tabular}
37\end{center}
38\begin{center}
39\begin{tabular}{r |c |c |c |c |c |c |c |c |}
40 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
41 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
42 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
43 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
44 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
45\end{tabular}
46\end{center}
47
48\caption{Example 7-bit ASCII Basis Bit Streams}
49\label{fig:BitStreamsExample}
50\end{figure}
51\pagebreak
52
53The advantage of the parallel bit stream representation is that we can
54use the 128-bit SIMD registers commonly found on commodity processors
55(e.g. SSE on Intel) to process 128 byte positions at a time using
56bitwise logical, shift and arithmetic operations. Just as forward and
57inverse Fourier transforms are used to transform between the time and
58frequency domains in signal processing, bit stream transposition and
59inverse transposition provides ``byte space'' and ``bit space''
60domains for text.  The Parabix framework supports
61efficient text processing using these two equivalent representations
62analogous to signal processing that benefits from the use of the
63frequency domain and time domain.
64
65In the Parabix framework, basis bit streams are used as the starting
66point to determine other bit streams.  In particular, Parabix uses the
67basis bit streams to construct \emph{character-class bit streams} in
68which each $\tt 1$ bit indicates the presence of a significant
69character (or class of characters) in the parsing process.
70Character-class bit streams may then be used to compute \emph{lexical
71  bit streams} and \emph{error bit streams}, which Parabix uses to
72process and validate the source document. The remainder of this
73section will discuss each type of bit stream.
74
75
76{\bf Basis Bit Streams:}
77To construct the basis bit streams, the source data is first loaded in
78sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
79so that Parabix can efficiently produce the character-class bit streams.
80Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
81of approximately 1 cycle per byte.
82%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
83%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
84%as UNICODE, may use a multiple code units to express all of its possible code points.
85%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
86%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
87%128 code points within it.
88
89% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
90% Basis bit streams
91% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
92% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
93% Figure \ref{fig:bit streamsExample} presents an example of the basis bit stream representation of 8-bit
94% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
95% by periods as to emphasize the $1$ bits.
96
97{\bf Character-class Bit Streams:}
98Typically, as text parsers process input data, they locate specific
99characters to determine if and when to transition between data and
100metadata parsing.
101% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
102For example, in XML, any opening angle bracket character, `\verb`<`', may indicate the start of a markup tag.
103Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set
104of known significant characters and branching appropriately when one is found, typically using an if or switch statement.
105Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
106to do so correctly.
107
108% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
109
110Character-class bit streams enable up to 128
111``comparisons'' in parallel through a
112series of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
113  denote the boolean AND, OR and NOT operations.}  that merge multiple basis
114bit streams into a single character-class stream that marks the
115positions of key characters. For example, a character is
116an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
117b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Addition character
118classes can be determined with similar formulas.  For example, a
119character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
120\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
121important observation here is that a range of characters can sometimes
122take fewer operations and require fewer basis bit streams to compute
123than individual characters.  Finding optimal solutions for all
124character-classes is beyond the scope of this
125paper.
126
127% The advantage of character-class bit streams over traditional ``byte streams'' is that
128
129% Rather than testing each byte individually,
130% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
131
132
133{\bf Lexical and Error Bit Streams:}
134To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
135a mixture of both boolean logic and arithmetic operations. Lexical bit streams typically mark multiple current parsing positions.
136Unlike the single-cursor approach of traditional text parsers, the marking of multiple lexical items allows Parabix to process multiple items in parallel.
137Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
138issues found during the parsing process. A $\tt 1$ bit in an error stream indicates the presence of a potential error that may require further
139processing to determine cause and severity.
140
141To form lexical bit streams we introduce two new operations: {\tt Advance} and {\tt ScanThru}.
142The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
143and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
144{\tt ScanThru} accepts two input parameters, $c$ and $m$, where $c$ denotes an initial
145set of cursor positions, and $m$ denotes a set of ``marked'' lexical item positions.
146The ScanThru operation determines the cursor positions immediately
147following any run of marker positions by calculating $(c + m) \land \lnot m$
148For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
149instances of it in the source text.
150We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
151$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
152token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
153the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
154token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' also fails to match the expected pattern.
155With additional post bit-stream processing, the erroneous cursors in $L_{0}$ and $L_{1}$ can be removed; the details
156of which go beyond the scope of this paper.
157
158\begin{figure}[htbp]
159{\footnotesize
160\begin{center}
161\begin{tabular}{@{~}l@{~}c@{~}l@{~}r}
162\multicolumn{3}{l}{source text}                         & \verb`<a><val> <str>  <>txt><err]` \\
163$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$                     & \verb`.1..111...111.....111..111.` \\
164$C_{1}$ & $=$ & $\texttt{[>]}$                          & \verb`..1....1.....1...1...1.....` \\
165$C_{2}$ & $=$ & $\texttt{[<]}$                          & \verb`1..1.....1......1.....1....` \\
166$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$               & \verb`.1..1.....1......1.....1...` \\
167$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$               & \verb`.................1.........` \\
168$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$       & \verb`..1....1.....1...1........1` \\
169$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$               & \verb`..........................1` \\
170\end{tabular}
171\end{center}
172}
173
174\caption{Lexical Parsing in Parabix}
175\label{fig:ParabixParsingExample}
176\end{figure}
177
178% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
179%
180% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
181% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
182% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
183% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
184% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
185% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
186% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
187% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
188% details, refer to the technical report \cite{Cameron2010}.
189
190Using this parallel bit stream approach, the vast majority of conditional branches
191used to identify key positions and/or syntax errors at each
192parsing position are mostly eliminated, which, as Section
193\ref{section:XML-branches} shows, minimizes branch misprediction
194penalties.  Accurate parsing and parallel lexical analysis is done
195through processor-friendly equations that require neither speculation
196nor multithreading.
197
198
199\subsection{Parabix Compilers}
200\label{parabix tool chain}
201
202To support the Parabix execution framework, we have developed a tool
203chain to the automate various aspects of parallel bit stream
204programming. Our tool chain consists of two compilers: a
205character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
206block-at-a-time processing compiler (\textit{Pablo}).
207
208The character class compiler is used to automatically produce bit
209stream logic for all the individual characters (e.g., delimiters) and
210character classes (e.g., digits, letters) used in a particular
211application.  Input is specified using a character class syntax
212adapted from the standard regular expression notations.  Output is a
213minimized set of three-address bitwise operations that compute each of
214the character classes from the basis bit streams.
215Figure \ref{fig:CCC} shows the input and output produced
216by the character class compiler for the example of \verb`[0-9]`
217discussed in the previous section.  The output operations may be
218viewed as operations on a single block of input at a time, or may be
219viewed as operations on unbounded bit streams as supported by the Pablo
220compiler.
221
222\begin{figure}[tbh]
223{\footnotesize
224\begin{center}
225
226\begin{tabular}{@{~}r@{~}l@{~}}
227\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
228
229\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
230& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
231& \verb`temp3 = (temp2 &~ temp1)` \\
232& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
233& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
234& \verb`digit = (temp3 &~ temp5)`
235\end{tabular}
236
237\end{center}
238}
239\caption{Character Class Compiler Input/Output}
240\label{fig:CCC}
241\end{figure}
242
243\begin{figure}[tbh]
244{\footnotesize
245\begin{center}
246
247\begin{tabular}{l}
248\ttfamily{\bfseries INPUT:} \\
249\verb`def parse_tags(classes, errors):` \\
250\verb`  classes.C0 = Alpha` \\
251\verb`  classes.C1 = Rangle` \\
252\verb`  classes.C2 = Langle` \\
253\verb`  L0 = bitutil.Advance(C2)` \\
254\verb`  errors.E0 = L0 &~ C0` \\
255\verb`  L1 = bitutil.ScanThru(L0, C0)` \\
256\verb`  errors.E1 = L1 &~ C1` \\ \\
257
258\ttfamily{\bfseries OUTPUT:} \\
259\verb`struct Parse_tags {` \\
260\verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
261\verb`  void do_block(Classes & classes, Errors & errors){` \\
262\verb`    BitBlock L0, L1;` \\
263\verb`    classes.C0 = Alpha;` \\
264\verb`    classes.C1 = Rangle;` \\
265\verb`    classes.C2 = Langle;` \\
266\verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
267\verb`    errors.E0 = simd_andc(L0, C0);` \\
268\verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
269\verb`    errors.E1 = simd_andc(L1, C1);` \\
270\verb`    CarryQ_Adjust(carryQ, 2);` \\
271\verb}` \\
272\verb`  CarryDeclare(carryQ, 2);` \\
273\verb`};` \\
274\end{tabular}
275
276\end{center}
277}
278\caption{Parallel Block Compiler (Pablo) Input/Output}
279\label{fig:Pablo}
280\end{figure}
281
282\pagebreak
283The Pablo compiler abstracts away the details of
284programming parallel bit stream code in terms of finite
285SIMD register widths and application buffer sizes.
286Input to Pablo is a language for expressing bit stream operations
287on unbounded bit streams.  The operations include bitwise
288logic, the {\tt Advance} and {\tt ScanThru} operations described in the
289previous subsection as well as if and while control structures.
290Pablo translates these operations to block-at-a-time
291code in C/C++.
292
293%, where the block size is the register width for SIMD operations on the selected target architecture.
294The key functionality of Pablo is to arrange for block-to-block
295carry bit propagation to implement the long bit stream shift
296and addition operations required by
297{\tt Advance} and {\tt ScanThru}.
298
299For example, we can translate the simple parsing example
300of \ref{fig:ParabixParsingExample} above into Pablo code
301to produce the output as shown in Figure \ref{fig:Pablo}.
302In this example, Pablo has the primary responsibility of inserting
303carry variable declarations that allow the results of
304{\tt Advance} and {\tt ScanThru} operations to be carried over from
305block-to-block.  A separate carry variable is required for every
306{\tt Advance} or {\tt ScanThru} operation.  A function containing
307such operations is translated into a public C++ class (struct),
308which includes a Carry Queue to hold all the carry
309variables from iteration to iteration, together with the
310a method {\tt do\_block} to implement the processing
311for a single block (based on the SIMD register width).
312Macros {\tt CarryDeclare} and {\tt CarryInit} declare and
313initialize the Carry Queue structure depending on the
314specific architecture and Carry Queue representation.
315The unbounded bit stream {\tt Advance} and {\tt ScanThru}
316operations are translated into block-wise equivalents
317with explicit carry-in and carry-out processing. 
318At the end of each block, the {\tt CarryQ\_Adjust}
319operation implements any necessary adjustment of the
320Carry Queue to prepare for the next iteration.
321The Pablo language and compiler also support conditional
322and iterative bit stream logic on unbounded streams
323(if and while constructs) which involves additional
324carry-test insertion in control branches.
325A complete explanation of the Pablo translation
326is beyond the scope of this paper.
327
328\subsection{Parabix Runtime Libraries}
329
330The Parabix architecture also includes runtime libraries that support
331a machine-independent view of basic SIMD operations, as well as a set
332of core function libraries. For portability, we program all SIMD operations against
333an abstract SIMD machine representation, parameterized on SIMD
334field and register width. The abstract machine
335supports all power-of-2 field widths up to the full SIMD register
336width on a target machine.  Let $w = 2k$ be the field width in
337bits. Let $f$ be a basic binary operation defined on $w$-bit
338quantities producing an $w$-bit result. Let $W$ be the SIMD vector
339size in bits where $W = 2K$.  Then the C++ template notation
340\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
341SIMD operation yielding an output SIMD vector $v$, given two input
342SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
343computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
344\verb`simd<8>::add(a,b)` represents the simultaneous addition of
345sixteen 8-bit fields.
346
347We have ported Parabix to a wide variety of processor architectures
348demonstrating its applicability to commodity SIMD hardware. We
349currently take advantage of the 128-bit Altivec operations on the
350Power PC, 64-bit MMX and 128-bit SSE operations on previous generation
351Intel platforms, the latest 256-bit AVX extensions on the \SB{}
352processor, and finally the 128-bit \NEON{} operations on ARM.
353
Note: See TracBrowser for help on using the repository browser.