source: docs/HPCA2012/03-research.tex @ 1782

Last change on this file since 1782 was 1693, checked in by lindanl, 8 years ago

Minor changes.

File size: 19.1 KB
Line 
1\section{The Parabix Framework}
2\label{section:parabix}
3
4This section presents an overview of the SIMD-based parallel bit
5stream text processing framework, \emph{Parabix}.  The framework has
6three components: (1) a unifying architectural view of text processing in
7terms of parallel bit streams, (2) a tool chain for automating the
8generation of parallel bit stream code from higher-level
9specifications, and (3) a runtime environment, which provides a portable SIMD
10programming abstraction that is independent of the specific facilities
11available on particular target architectures.
12
13
14\subsection{Parallel Bit Streams}
15
16
17The fundamental difference between the Parabix framework and
18traditional text processing models is in how Parabix represents the
19source data.  Given a traditional byte-oriented text stream, Parabix
20first transposes the text data to a transform domain consisting of 8
21parallel bit streams, known as {\em basis bit streams}.  In essence,
22each basis bit stream $b_{k}$ represents the stream of $k$-th bit of
23each byte in the source text.  That is, the $k$-th bit of $i$-th byte
24in the source text is in the $i$-th (bit) position of the $k$-th basis
25bit stream, $b_{k}$.  For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
26  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
27  \ldots 7}$. The bits used to construct $\tt b_7$ have been
28highlighted in this example.
29
30\begin{figure}[h]
31\begin{center}
32\begin{tabular}{r c c c c }
33STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
34ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
35\hline
36\end{tabular}
37\end{center}
38\begin{center}
39\begin{tabular}{r |c |c |c |c |c |c |c |c |}
40 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
41 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
42 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
43 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
44 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
45\end{tabular}
46\end{center}
47
48\caption{Example 7-bit ASCII Basis Bit Streams}
49\label{fig:BitStreamsExample}
50\end{figure}
51
52The advantage of the parallel bit stream representation is that we can
53use the 128-bit SIMD registers commonly found on commodity processors
54(e.g. SSE on Intel) to process 128 byte positions at a time using
55bitwise logic, shifting and other operations.
56
57Just as forward and inverse Fourier transforms are used to transform
58between the time and frequency domains in signal processing, bit
59stream transposition and inverse transposition provides ``byte space''
60and ``bit space'' views of text.  The goal of the Parabix framework is
61to support efficient text processing using these two equivalent
62representations in the same way that efficient signal processing
63benefits from the use of the frequency domain in some cases and the
64time domain in others.
65
66In the Parabix framework, basis bit streams are used as the starting
67point to determine other bit streams.  In particular, Parabix uses the
68basis bit streams to construct \emph{character-class bit streams} in
69which each $\tt 1$ bit indicates the presence of a significant
70character (or class of characters) in the parsing process.
71Character-class bit streams may then be used to compute \emph{lexical
72  bit streams} and \emph{error bit streams}, which Parabix uses to
73process and validate the source document. The remainder of this
74section will discuss each type of bit stream.
75
76
77{\bf Basis Bit Streams:}
78To construct the basis bit streams, the source data is first loaded in
79sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
80so that Parabix can efficiently produce the character-class bit streams.
81Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
82of approximately 1 cycle per byte.
83%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
84%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
85%as UNICODE, may use a multiple code units to express all of its possible code points.
86%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
87%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
88%128 code points within it.
89
90% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
91% Basis bit streams
92% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
93% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
94% Figure \ref{fig:bit streamsExample} presents an example of the basis bit stream representation of 8-bit
95% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
96% by periods as to emphasize the $1$ bits.
97
98{\bf Character-class Bit Streams:}
99Typically, as text parsers process input data, they locate specific
100characters to determine if and when to transition between data and
101metadata parsing.
102% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
103For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
104Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set
105of known significant characters and branching appropriately when one is found, typically using an if or switch statement.
106Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
107to do so correctly.
108
109% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
110
111Character-class bit streams allow us to perform up to 128
112``comparisons'' in parallel with a single operation by using a series
113of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
114  denote the boolean AND, OR and NOT operations.}  to merge multiple
115basis bit streams into a single character-class stream that marks the
116positions of key characters with a $1$.  For example, a character is
117an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
118b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of
119characters can be found with similar formulas.  For example, a
120character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
121\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
122important observation here is that a range of characters can sometimes
123take fewer operations and require fewer basis bit streams to compute
124than individual characters.  Finding optimal solutions to all
125character-classes is non-trivial and goes beyond the scope of this
126paper.
127
128% The advantage of character-class bit streams over traditional ``byte streams'' is that
129
130% Rather than testing each byte individually,
131% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
132
133
134{\bf Lexical and Error Bit Streams:}
135To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
136a mixture of both boolean logic and arithmetic operations. Lexical bit streams typically mark multiple current parsing positions.
137Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
138Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
139issues found during the parsing process. The presence of a $\tt 1$ in an error stream indicates that the lexical stream cannot be
140trusted to be completely accurate and it may be necessary to perform some sequential parsing on that section to determine the cause and severity
141of the error. %How errors are handled depends on the logical implications of the error and go beyond the scope of this paper.
142
143To form lexical bit streams, we have to introduce a few new operations: {\tt Advance} and {\tt ScanThru}.
144The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
145and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
146{\tt ScanThru} accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
147$\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$
148For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
149instances of it in the source text.
150We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
151$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
152token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
153the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
154token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
155With additional post bit-stream processing, the erroneous cursors in $L_{0}$ and $L_{1}$ can be removed; the details
156of which go beyond the scope of this paper.
157
158\begin{figure*}[htbp]
159
160\begin{center}
161\begin{tabular}{lclr}
162\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
163$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
164$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
165$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
166$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
167$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
168$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
169$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
170\end{tabular}
171\end{center}
172
173
174\caption{Lexical Parsing in Parabix}
175\label{fig:ParabixParsingExample}
176\end{figure*}
177
178% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
179%
180% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
181% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
182% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
183% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
184% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
185% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
186% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
187% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
188% details, refer to the technical report \cite{Cameron2010}.
189
190Using this parallel bit stream approach, conditional branch statements
191used to identify key positions and/or syntax errors at each
192parsing position are mostly eliminated, which, as Section
193\ref{section:XML-branches} shows, minimizes branch misprediction
194penalties.  Accurate parsing and parallel lexical analysis is done
195through processor-friendly equations that require neither speculation
196nor multithreading.
197
198
199\subsection{Parabix Compilers}
200\label{parabix tool chain}
201
202To support the Parabix execution framework, we have developed a tool
203chain to the automate various aspects of parallel bit stream
204programming. Our tool chain consists of two compilers: a
205character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
206block-at-a-time processing compiler (\textit{Pablo}).
207
208The character class compiler is used to automatically produce bit
209stream logic for all the individual characters (e.g., delimiters) and
210character classes (e.g., digits, letters) used in a particular
211application.  Input is specified using a character class syntax
212adapted from the standard regular expression notations.  Output is a
213minimized set of three-address bitwise operations to compute each of
214the character classes from the basis bit streams.
215
216
217For example, Figure \ref{fig:CCC} shows the input and output produced
218by the character class compiler for the example of \verb`[0-9]`
219discussed in the previous section.  The output operations may be
220viewed as operations on a single block of input at a time, or may be
221viewed as operations on unbounded bit streams as supported by the Pablo
222compiler.
223
224\begin{figure}[tbh]
225{\footnotesize
226\begin{center}
227
228\begin{tabular}{r l}
229\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
230
231\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
232& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
233& \verb`temp3 = (temp2 &~ temp1)` \\
234& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
235& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
236& \verb`digit = (temp3 &~ temp5)`
237\end{tabular}
238
239\end{center}
240}
241\caption{Character Class Compiler Input/Output}
242\label{fig:CCC}
243\end{figure}
244
245\begin{figure}[tbh]
246{\footnotesize
247\begin{center}
248
249\begin{tabular}{l}
250\ttfamily{\bfseries INPUT:} \\
251\verb`def parse_tags(classes, errors):` \\
252\verb`  classes.C0 = Alpha` \\
253\verb`  classes.C1 = Rangle` \\
254\verb`  classes.C2 = Langle` \\
255\verb`  L0 = bitutil.Advance(C2)` \\
256\verb`  errors.E0 = L0 &~ C0` \\
257\verb`  L1 = bitutil.ScanThru(L0, C0)` \\
258\verb`  errors.E1 = L1 &~ C1` \\ \\
259
260\ttfamily{\bfseries OUTPUT:} \\
261\verb`struct Parse_tags {` \\
262\verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
263\verb`  void do_block(Classes & classes, Errors & errors){` \\
264\verb`    BitBlock L0, L1;` \\
265\verb`    classes.C0 = Alpha;` \\
266\verb`    classes.C1 = Rangle;` \\
267\verb`    classes.C2 = Langle;` \\
268\verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
269\verb`    errors.E0 = simd_andc(L0, C0);` \\
270\verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
271\verb`    errors.E1 = simd_andc(L1, C1);` \\
272\verb`    CarryQ_Adjust(carryQ, 2);` \\
273\verb}` \\
274\verb`  CarryDeclare(carryQ, 2);` \\
275\verb`};` \\
276\end{tabular}
277
278\end{center}
279}
280\caption{Parallel Block Compiler (Pablo) Input/Output}
281\label{fig:Pablo}
282\end{figure}
283
284
285The Pablo compiler abstracts away the details of
286programming parallel bit stream code in terms of finite
287SIMD register widths and application buffer sizes.
288Input to Pablo is a language for expressing bit stream operations
289on unbounded bit streams.  The operations include bitwise
290logic, the {\tt Advance} and {\tt ScanThru} operations described in the
291previous subsection as well as if and while control structures.
292Pablo translates these operations to block-at-a-time
293code in C/C++.
294
295%, where the block size is the register width for SIMD operations on the selected target architecture.
296The key functionality of Pablo is to arrange for block-to-block
297carry bit propagation to implement the long bit stream shift
298and addition operations required by
299{\tt Advance} and {\tt ScanThru}.
300
301For example, we can translate the simple parsing example
302of \ref{fig:ParabixParsingExample} above into Pablo code
303to produce the output as shown in Figure \ref{fig:Pablo}.
304In this example, Pablo has the primary responsibility of inserting
305carry variable declarations that allow the results of
306{\tt Advance} and {\tt ScanThru} operations to be carried over from
307block to block.  A separate carry variable is required for every
308{\tt Advance} or {\tt ScanThru} operation.  A function containing
309such operations is translated into a public C++ class (struct),
310which includes a Carry Queue to hold all the carry
311variables from iteration to iteration, together with the
312a method {\tt do\_block} to implement the processing
313for a single block (based on the SIMD register width).
314Macros {\tt CarryDeclare} and {\tt CarryInit} declare and
315initialize the Carry Queue structure depending on the
316specific architecture and Carry Queue representation.
317The unbounded bit stream {\tt Advance} and {\tt ScanThru}
318operations are translated into block-by-block equivalents
319with explicit carry-in and carry-out processing. 
320At the end of each block, the {\tt CarryQ\_Adjust}
321operation implements any necessary adjustment of the
322Carry Queue to prepare for the next iteration.
323The Pablo language and compiler also support conditional
324and iterative bit stream logic on unbounded streams
325(if and while constructs) which involves additional
326carry-test insertion in control branches.
327Explaining the full details of the translation
328is beyond the scope of this paper.
329
330\subsection{Parabix Runtime Libraries}
331
332The Parabix architecture also includes runtime libraries that support
333a machine-independent view of basic SIMD operations, as well as a set
334of core function libraries.  For machine-independence, we program all
335operations using an abstract SIMD machine.  The abstract machine
336supports all power-of-2 field widths up to the full SIMD register
337width on a target machine.  Let $w = 2k$ be the field width in
338bits. Let $f$ be a basic binary operation defined on $w$-bit
339quantities producing an $w$-bit result. Let $W$ be the SIMD vector
340size in bits where $W = 2K$.  Then the C++ template notation
341\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
342SIMD operation yielding an output SIMD vector $v$, given two input
343SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
344computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
345\verb`simd<8>::add(a,b)` represents the simultaneous addition of
346sixteen 8-bit fields.
347
348We have ported Parabix to a wide variety of processor architectures
349demonstrating its applicability to commodity SIMD hardware. We
350currently take advantage of the 128-bit Altivec operations on the
351Power PC, 64-bit MMX and 128-bit SSE operations on previous generation
352Intel platforms, the latest 256-bit AVX extensions on the Sandybridge
353processor, and finally the 128-bit \NEON{} operations on ARM.
354
Note: See TracBrowser for help on using the repository browser.