source: docs/HPCA2012/final_ieee/03-research.tex @ 1774

Last change on this file since 1774 was 1774, checked in by lindanl, 8 years ago

minor changes

File size: 19.0 KB
Line 
1\section{The Parabix Framework}
2\label{section:parabix}
3
4This section presents an overview of the SIMD-based parallel bit
5stream text processing framework, \emph{Parabix}.  The framework has
6three components: (1) a unifying architectural view of text processing in
7terms of parallel bit streams, (2) a tool chain for automating the
8generation of parallel bit stream code from higher-level
9specifications, and (3) a runtime environment, which provides a portable SIMD
10programming abstraction that is independent of the specific facilities
11available on particular target architectures.
12
13
14\subsection{Parallel Bit Streams}
15
16
17The fundamental difference between the Parabix framework and
18traditional text processing models is in how Parabix represents the
19source data.  Given a traditional byte-oriented text stream, Parabix
20first transposes the text data into a transform representation consisting of 8
21parallel bit streams, known as {\em basis bit streams} wherein
22basis bit stream $b_{k}$ represents the stream of $k$-th bit of
23each byte in the source text.  That is, the $k$-th bit of $i$-th byte
24in the source text is in one-to-one correspondence with the $i$-th bit of the $k$-th basis
25bit stream, $b_{k}$.  For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
26  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
27  \ldots 7}$. The bits used to construct $\tt b_7$ are
28highlighted in this example.
29
30\begin{figure}[h]
31\begin{center}
32\begin{tabular}{r c c c c }
33STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
34ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
35\hline
36\end{tabular}
37\end{center}
38\begin{center}
39\begin{tabular}{r |c |c |c |c |c |c |c |c |}
40 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
41 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
42 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
43 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
44 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
45\end{tabular}
46\end{center}
47
48\caption{Example 7-bit ASCII Basis Bit Streams}
49\label{fig:BitStreamsExample}
50\end{figure}
51
52The advantage of the parallel bit stream representation is that we can
53use the 128-bit SIMD registers commonly found on commodity processors
54(e.g. SSE on Intel) to process 128 byte positions at a time using
55bitwise logical, shift and arithmetic operations.
56
57Just as forward and inverse Fourier transforms are used to transform
58between the time and frequency domains in signal processing, bit
59stream transposition and inverse transposition provides ``byte space''
60and ``bit space'' domains for text.  The goal of the Parabix framework is
61to support efficient text processing using these two equivalent
62representations in the analogous way that efficient signal processing
63benefits from the use of the frequency domain in some cases and the
64time domain in others.
65
66In the Parabix framework, basis bit streams are used as the starting
67point to determine other bit streams.  In particular, Parabix uses the
68basis bit streams to construct \emph{character-class bit streams} in
69which each $\tt 1$ bit indicates the presence of a significant
70character (or class of characters) in the parsing process.
71Character-class bit streams may then be used to compute \emph{lexical
72  bit streams} and \emph{error bit streams}, which Parabix uses to
73process and validate the source document. The remainder of this
74section will discuss each type of bit stream.
75
76
77{\bf Basis Bit Streams:}
78To construct the basis bit streams, the source data is first loaded in
79sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
80so that Parabix can efficiently produce the character-class bit streams.
81Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
82of approximately 1 cycle per byte.
83%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
84%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
85%as UNICODE, may use a multiple code units to express all of its possible code points.
86%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
87%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
88%128 code points within it.
89
90% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
91% Basis bit streams
92% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
93% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
94% Figure \ref{fig:bit streamsExample} presents an example of the basis bit stream representation of 8-bit
95% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
96% by periods as to emphasize the $1$ bits.
97
98{\bf Character-class Bit Streams:}
99Typically, as text parsers process input data, they locate specific
100characters to determine if and when to transition between data and
101metadata parsing.
102% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
103For example, in XML, any opening angle bracket character, `\verb`<`', may indicate the start of a markup tag.
104Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set
105of known significant characters and branching appropriately when one is found, typically using an if or switch statement.
106Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
107to do so correctly.
108
109% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
110
111Character-class bit streams enable up to 128
112``comparisons'' in parallel through a
113series of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
114  denote the boolean AND, OR and NOT operations.}  that merge multiple basis
115bit streams into a single character-class stream that marks the
116positions of key characters. For example, a character is
117an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
118b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Addition character
119classes can be determined with similar formulas.  For example, a
120character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
121\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
122important observation here is that a range of characters can sometimes
123take fewer operations and require fewer basis bit streams to compute
124than individual characters.  Finding optimal solutions to all
125character-classes is non-trivial and goes beyond the scope of this
126paper.
127
128% The advantage of character-class bit streams over traditional ``byte streams'' is that
129
130% Rather than testing each byte individually,
131% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
132
133
134{\bf Lexical and Error Bit Streams:}
135To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
136a mixture of both boolean logic and arithmetic operations. Lexical bit streams typically mark multiple current parsing positions.
137Unlike the single-cursor approach of traditional text parsers, the marking of multiple lexical items allows Parabix to process multiple items in parallel.
138Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
139issues found during the parsing process. A $\tt 1$ bit in an error stream indicates the precense of a potential error that may require further
140processing to determine cause and severity.
141
142To form lexical bit streams we introduce two new operations: {\tt Advance} and {\tt ScanThru}.
143The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
144and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
145{\tt ScanThru} accepts two input parameters, $c$ and $m$, where $c$ denotes an initial
146set of cursor positions, and $m$ denotes a set of ``marked'' lexical item positions.
147The ScanThru operation determines the cursor positions immediately
148following any run of marker positions by calculating $(c + m) \land \lnot m$
149For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
150instances of it in the source text.
151We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
152$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
153token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
154the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
155token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' also fails to match the expected pattern.
156With additional post bit-stream processing, the erroneous cursors in $L_{0}$ and $L_{1}$ can be removed; the details
157of which go beyond the scope of this paper.
158
159\begin{figure}[htbp]
160{\footnotesize
161\begin{center}
162\begin{tabular}{@{~}l@{~}c@{~}l@{~}r}
163\multicolumn{3}{l}{source text}                         & \verb`<a><val> <str>  <>txt><err]` \\
164$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$                     & \verb`.1..111...111.....111..111.` \\
165$C_{1}$ & $=$ & $\texttt{[>]}$                          & \verb`..1....1.....1...1...1.....` \\
166$C_{2}$ & $=$ & $\texttt{[<]}$                          & \verb`1..1.....1......1.....1....` \\
167$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$               & \verb`.1..1.....1......1.....1...` \\
168$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$               & \verb`.................1.........` \\
169$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$       & \verb`..1....1.....1...1........1` \\
170$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$               & \verb`..........................1` \\
171\end{tabular}
172\end{center}
173}
174
175\caption{Lexical Parsing in Parabix}
176\label{fig:ParabixParsingExample}
177\end{figure}
178
179% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
180%
181% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
182% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
183% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
184% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
185% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
186% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
187% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
188% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
189% details, refer to the technical report \cite{Cameron2010}.
190
191Using this parallel bit stream approach, the vast majority of conditional branches
192used to identify key positions and/or syntax errors at each
193parsing position are mostly eliminated, which, as Section
194\ref{section:XML-branches} shows, minimizes branch misprediction
195penalties.  Accurate parsing and parallel lexical analysis is done
196through processor-friendly equations that require neither speculation
197nor multithreading.
198
199
200\subsection{Parabix Compilers}
201\label{parabix tool chain}
202
203To support the Parabix execution framework, we have developed a tool
204chain to the automate various aspects of parallel bit stream
205programming. Our tool chain consists of two compilers: a
206character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
207block-at-a-time processing compiler (\textit{Pablo}).
208
209The character class compiler is used to automatically produce bit
210stream logic for all the individual characters (e.g., delimiters) and
211character classes (e.g., digits, letters) used in a particular
212application.  Input is specified using a character class syntax
213adapted from the standard regular expression notations.  Output is a
214minimized set of three-address bitwise operations that compute each of
215the character classes from the basis bit streams.
216Figure \ref{fig:CCC} shows the input and output produced
217by the character class compiler for the example of \verb`[0-9]`
218discussed in the previous section.  The output operations may be
219viewed as operations on a single block of input at a time, or may be
220viewed as operations on unbounded bit streams as supported by the Pablo
221compiler.
222
223\begin{figure}[tbh]
224{\footnotesize
225\begin{center}
226
227\begin{tabular}{@{~}r@{~}l@{~}}
228\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
229
230\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
231& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
232& \verb`temp3 = (temp2 &~ temp1)` \\
233& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
234& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
235& \verb`digit = (temp3 &~ temp5)`
236\end{tabular}
237
238\end{center}
239}
240\caption{Character Class Compiler Input/Output}
241\label{fig:CCC}
242\end{figure}
243
244\begin{figure}[tbh]
245{\footnotesize
246\begin{center}
247
248\begin{tabular}{l}
249\ttfamily{\bfseries INPUT:} \\
250\verb`def parse_tags(classes, errors):` \\
251\verb`  classes.C0 = Alpha` \\
252\verb`  classes.C1 = Rangle` \\
253\verb`  classes.C2 = Langle` \\
254\verb`  L0 = bitutil.Advance(C2)` \\
255\verb`  errors.E0 = L0 &~ C0` \\
256\verb`  L1 = bitutil.ScanThru(L0, C0)` \\
257\verb`  errors.E1 = L1 &~ C1` \\ \\
258
259\ttfamily{\bfseries OUTPUT:} \\
260\verb`struct Parse_tags {` \\
261\verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
262\verb`  void do_block(Classes & classes, Errors & errors){` \\
263\verb`    BitBlock L0, L1;` \\
264\verb`    classes.C0 = Alpha;` \\
265\verb`    classes.C1 = Rangle;` \\
266\verb`    classes.C2 = Langle;` \\
267\verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
268\verb`    errors.E0 = simd_andc(L0, C0);` \\
269\verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
270\verb`    errors.E1 = simd_andc(L1, C1);` \\
271\verb`    CarryQ_Adjust(carryQ, 2);` \\
272\verb}` \\
273\verb`  CarryDeclare(carryQ, 2);` \\
274\verb`};` \\
275\end{tabular}
276
277\end{center}
278}
279\caption{Parallel Block Compiler (Pablo) Input/Output}
280\label{fig:Pablo}
281\end{figure}
282
283
284The Pablo compiler abstracts away the details of
285programming parallel bit stream code in terms of finite
286SIMD register widths and application buffer sizes.
287Input to Pablo is a language for expressing bit stream operations
288on unbounded bit streams.  The operations include bitwise
289logic, the {\tt Advance} and {\tt ScanThru} operations described in the
290previous subsection as well as if and while control structures.
291Pablo translates these operations to block-at-a-time
292code in C/C++.
293
294%, where the block size is the register width for SIMD operations on the selected target architecture.
295The key functionality of Pablo is to arrange for block-to-block
296carry bit propagation to implement the long bit stream shift
297and addition operations required by
298{\tt Advance} and {\tt ScanThru}.
299
300For example, we can translate the simple parsing example
301of \ref{fig:ParabixParsingExample} above into Pablo code
302to produce the output as shown in Figure \ref{fig:Pablo}.
303In this example, Pablo has the primary responsibility of inserting
304carry variable declarations that allow the results of
305{\tt Advance} and {\tt ScanThru} operations to be carried over from
306block-to-block.  A separate carry variable is required for every
307{\tt Advance} or {\tt ScanThru} operation.  A function containing
308such operations is translated into a public C++ class (struct),
309which includes a Carry Queue to hold all the carry
310variables from iteration to iteration, together with the
311a method {\tt do\_block} to implement the processing
312for a single block (based on the SIMD register width).
313Macros {\tt CarryDeclare} and {\tt CarryInit} declare and
314initialize the Carry Queue structure depending on the
315specific architecture and Carry Queue representation.
316The unbounded bit stream {\tt Advance} and {\tt ScanThru}
317operations are translated into block-wise equivalents
318with explicit carry-in and carry-out processing. 
319At the end of each block, the {\tt CarryQ\_Adjust}
320operation implements any necessary adjustment of the
321Carry Queue to prepare for the next iteration.
322The Pablo language and compiler also support conditional
323and iterative bit stream logic on unbounded streams
324(if and while constructs) which involves additional
325carry-test insertion in control branches.
326A complete explanation of the Pablo translation
327is beyond the scope of this paper.
328
329\subsection{Parabix Runtime Libraries}
330
331The Parabix architecture also includes runtime libraries that support
332a machine-independent view of basic SIMD operations, as well as a set
333of core function libraries. For portability, we program all SIMD operations against
334an abstract SIMD machine representation, parameterized on SIMD
335field and register width. The abstract machine
336supports all power-of-2 field widths up to the full SIMD register
337width on a target machine.  Let $w = 2k$ be the field width in
338bits. Let $f$ be a basic binary operation defined on $w$-bit
339quantities producing an $w$-bit result. Let $W$ be the SIMD vector
340size in bits where $W = 2K$.  Then the C++ template notation
341\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
342SIMD operation yielding an output SIMD vector $v$, given two input
343SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
344computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
345\verb`simd<8>::add(a,b)` represents the simultaneous addition of
346sixteen 8-bit fields.
347
348We have ported Parabix to a wide variety of processor architectures
349demonstrating its applicability to commodity SIMD hardware. We
350currently take advantage of the 128-bit Altivec operations on the
351Power PC, 64-bit MMX and 128-bit SSE operations on previous generation
352Intel platforms, the latest 256-bit AVX extensions on the \SB{}
353processor, and finally the 128-bit \NEON{} operations on ARM.
354
Note: See TracBrowser for help on using the repository browser.