source: docs/HPCA2012/03-research.tex @ 1419

Last change on this file since 1419 was 1417, checked in by ksherdy, 8 years ago

fixed typo; minor edit

File size: 19.1 KB
RevLine 
[1374]1\section{The Parabix Framework}
[1302]2\label{section:parabix}
[1354]3
[1393]4This section presents an overview of the SIMD-based parallel bit
5stream text processing framework, \emph{Parabix}.  The framework has
[1417]6three components: (1) a unifying architectural view of text processing in
7terms of parallel bit streams; (2) a tool chain for automating the
[1393]8generation of parallel bit stream code from higher-level
[1417]9specifications, and (3) a run-time environment, which provides a portable SIMD
10programming abstraction that is independent of the specific facilities
[1393]11available on particular target architectures.
[1302]12
13
[1374]14\subsection{Parallel Bit Streams}
[1302]15
[1397]16
[1393]17The fundamental difference between the Parabix framework and
18traditional text processing models is in how Parabix represents the
19source data.  Given a traditional byte-oriented text stream, Parabix
20first transposes the text data to a transform domain consisting of 8
21parallel bit streams, known as {\em basis bit streams}.  In essence,
22each basis bit stream $b_{k}$ represents the stream of $k$-th bit of
23each byte in the source text.  That is, the $k$-th bit of $i$-th byte
24in the source text is in the $i$-th (bit) position of the $k$-th basis
[1398]25bit stream, $b_{k}$.  For example, in Figure\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
[1393]26  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
27  \ldots 7}$. The bits used to construct $\tt b_7$ have been
28highlighted in this example.
[1354]29
[1302]30\begin{figure}[h]
[1354]31
[1302]32\begin{center}
[1354]33\begin{tabular}{r c c c c }
34STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
35ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
36\hline
[1302]37\end{tabular}
38\end{center}
[1354]39\begin{center}
40\begin{tabular}{r |c |c |c |c |c |c |c |c |}
41 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
42 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
43 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
44 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
45 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
46\end{tabular}
47\end{center}
48
49\caption{Example 7-bit ASCII Basis Bit Streams}
[1398]50\label{fig:BitStreamsExample}
[1302]51\end{figure}
52
[1374]53The advantage of the parallel bit stream representation is that we can
[1393]54use the 128-bit SIMD registers commonly found on commodity processors
55(e.g. SSE on Intel) to process 128 byte positions at a time using
56bitwise logic, shifting and other operations.
[1302]57
[1374]58Just as forward and inverse Fourier transforms are used to transform
[1393]59between the time and frequency domains in signal processing, bit
60stream transposition and inverse transposition provides ``byte space''
61and ``bit space'' views of text.  The goal of the Parabix framework is
62to support efficient text processing using these two equivalent
63representations in the same way that efficient signal processing
64benefits from the use of the frequency domain in some cases and the
65time domain in others.
[1374]66
67In the Parabix framework, basis bit streams are used as the starting
[1393]68point to determine other bit streams.  In particular, Parabix uses the
69basis bit streams to construct \emph{character-class bit streams} in
[1407]70which each $\tt 1$ bit indicates the presence of a significant
[1393]71character (or class of characters) in the parsing process.
72Character-class bit streams may then be used to compute \emph{lexical
73  bit streams} and \emph{error bit streams}, which Parabix uses to
74process and validate the source document. The remainder of this
75section will discuss each type of bit stream.
[1374]76
[1397]77
[1374]78{\bf Basis Bit Streams:}
79To construct the basis bit streams, the source data is first loaded in
80sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
81so that Parabix can efficiently produce the character-class bit streams.
82Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
[1393]83of approximately 1 cycle per byte.
[1374]84%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
85%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
86%as UNICODE, may use a multiple code units to express all of its possible code points.
87%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
88%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
89%128 code points within it.
90
[1354]91% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
92% Basis bit streams
93% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
94% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
[1397]95% Figure \ref{fig:bit streamsExample} presents an example of the basis bit stream representation of 8-bit
[1354]96% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
97% by periods as to emphasize the $1$ bits.
[1302]98
[1354]99{\bf Character-class Bit Streams:}
[1393]100Typically, as text parsers process input data, they locate specific
101characters to determine if and when to transition between data and
102metadata parsing.
[1354]103% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
[1397]104For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
105Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set
106of known significant characters and branching appropriately when one is found, typically using an if or switch statement.
107Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
108to do so correctly.
109
[1354]110% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
[1302]111
[1393]112Character-class bit streams allow us to perform up to 128
113``comparisons'' in parallel with a single operation by using a series
114of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
115  denote the boolean AND, OR and NOT operations.}  to merge multiple
116basis bit streams into a single character-class stream that marks the
117positions of key characters with a $1$.  For example, a character is
118an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
119b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of
120characters can be found with similar formulas.  For example, a
121character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
122\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
123important observation here is that a range of characters can sometimes
124take fewer operations and require fewer basis bit streams to compute
125than individual characters.  Finding optimal solutions to all
126character-classes is non-trivial and goes beyond the scope of this
127paper.
[1354]128
129% The advantage of character-class bit streams over traditional ``byte streams'' is that
130
131% Rather than testing each byte individually,
132% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
133
134
[1397]135{\bf Lexical and Error Bit Streams:}
136To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
137a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
138Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
139Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
[1407]140issues found during the parsing process. The presence of a $\tt 1$ in an error stream indicates that the lexical stream cannot be
[1397]141trusted to be completely accurate and it may be necessary to perform some sequential parsing on that section to determine the cause and severity
142of the error. %How errors are handled depends on the logical implications of the error and go beyond the scope of this paper.
[1354]143
[1397]144To form lexical bit streams, we have to introduce a few new operations: {\tt Advance} and {\tt ScanThru}.
145The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
146and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
147{\tt ScanThru} accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
148$\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$
149For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
150instances of it in the source text.
151We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
152$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
153token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
154the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
155token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
156With additional post bit-stream processing, the erroneous cursors in $L_{0}$ and $L_{1}$ can be removed; the details
157of which go beyond the scope of this paper.
[1354]158
[1302]159\begin{figure}[h]
[1354]160
[1302]161\begin{center}
[1354]162\begin{tabular}{lclr}
163\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
164$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
165$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
166$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
167$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
168$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
169$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
170$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
[1302]171\end{tabular}
172\end{center}
[1354]173
174
175\caption{Lexical Parsing in Parabix}
176\label{fig:ParabixParsingExample}
[1302]177\end{figure}
178
[1354]179% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
180%
181% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
182% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
183% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
184% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
185% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
186% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
187% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
188% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
189% details, refer to the technical report \cite{Cameron2010}.
[1302]190
[1393]191Using this parallel bit stream approach, conditional branch statements
192used to identify key positions and/or syntax errors at each each
193parsing position are mostly eliminated, which, as Section
194\ref{section:XML-branches} shows, minimizes branch misprediction
195penalties.  Accurate parsing and parallel lexical analysis is done
[1417]196through processor-friendly equations that require neither speculation
[1393]197nor multithreading.
[1302]198
199
[1393]200\subsection{Parabix Compilers}
[1368]201\label{parabix tool chain}
[1354]202
[1393]203To support the Parabix execution framework, we have developed a tool
204chain to the automate various aspects of parallel bit stream
205programming. Our tool chain consists of two compilers: a
206character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
207block-at-a-time processing compiler (\textit{Pablo}).
[1354]208
[1393]209The character class compiler is used to automatically produce bit
210stream logic for all the individual characters (e.g., delimiters) and
211character classes (e.g., digits, letters) used in a particular
212application.  Input is specified using a character class syntax
213adapted from the standard regular expression notations.  Output is a
[1397]214minimized set of three-address bitwise operations to compute each of
215the character classes from the basis bit streams.
[1354]216
[1397]217
[1393]218For example, Figure \ref{fig:CCC} shows the input and output produced
219by the character class compiler for the example of \verb`[0-9]`
220discussed in the previous section.  The output operations may be
221viewed as operations on a single block of input at a time, or may be
222viewed as operations on unbounded bit streams as supported by the Pablo
223compiler.
[1354]224
[1384]225\begin{figure}[tbh]
[1354]226
[1302]227\begin{center}
[1354]228
229\begin{tabular}{r l}
[1377]230\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
231
[1354]232\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
233& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
234& \verb`temp3 = (temp2 &~ temp1)` \\
235& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
236& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
237& \verb`digit = (temp3 &~ temp5)`
[1302]238\end{tabular}
[1354]239
[1302]240\end{center}
[1354]241
242\caption{Character Class Compiler Input/Output}
243\label{fig:CCC}
[1302]244\end{figure}
245
[1384]246\begin{figure}[tbh]
[1302]247
[1374]248\begin{center}
249
250\begin{tabular}{r l}
[1377]251\ttfamily{\bfseries INPUT:} 
252& \verb`def parse_tags(classes, errors):` \\
253& \verb`  classes.C0 = Alpha` \\
254& \verb`  classes.C1 = Rangle` \\
255& \verb`  classes.C2 = Langle` \\
256& \verb`  L0 = bitutil.Advance(C2)` \\
257& \verb`  errors.E0 = L0 &~ C0` \\
258& \verb`  L1 = bitutil.ScanThru(L0, C0)` \\
259& \verb`  errors.E1 = L1 &~ C1` \\ \\
[1374]260
[1377]261\ttfamily{\bfseries OUTPUT:} 
262& \verb`struct Parse_tags {` \\
263& \verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
[1374]264& \verb`  void do_block(Classes & classes, Errors & errors) {` \\
[1377]265& \verb`    BitBlock L0, L1;` \\
266& \verb`    classes.C0 = Alpha;` \\
267& \verb`    classes.C1 = Rangle;` \\
268& \verb`    classes.C2 = Langle;` \\
269& \verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
270& \verb`    errors.E0 = simd_andc(L0, C0);` \\
271& \verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
272& \verb`    errors.E1 = simd_andc(L1, C1);` \\
273& \verb`    CarryQ_Adjust(carryQ, 2);` \\
[1374]274& \verb}` \\
275& \verb`  CarryDeclare(carryQ, 2);` \\
[1377]276& \verb`};` \\
[1374]277\end{tabular}
278
279\end{center}
280
281\caption{Parallel Block Compiler (Pablo) Input/Output}
282\label{fig:Pablo}
283\end{figure}
284
[1397]285
286The Pablo compiler abstracts away the details of
287programming parallel bit stream code in terms of finite
288SIMD register widths and application buffer sizes.
289Input to Pablo is a language for expressing bit stream operations
290on unbounded bit streams.  The operations include bitwise
[1417]291logic, the {\tt Advance} and {\tt ScanThru} operations described in the
[1397]292previous subsection as well as if and while control structures.
293Pablo translates these operations to block-at-a-time
294code in C/C++.
295
[1384]296%, where the block size is the register width for SIMD operations on the selected target architecture.
297The key functionality of Pablo is to arrange for block-to-block
[1397]298carry bit propagation to implement the long bit stream shift
[1384]299and addition operations required by
300{\tt Advance} and {\tt ScanThru}.
[1374]301
[1397]302For example, we can translate the simple parsing example
303of \ref{fig:ParabixParsingExample} above into Pablo code
304to produce the output as shown in Figure \ref{fig:Pablo}.
305In this example, Pablo has the primary responsibility of inserting
306carry variable declarations that allow the results of
307{\tt Advance} and {\tt ScanThru} operations to be carried over from
308block to block.  A separate carry variable is required for every
309{\tt Advance} or {\tt ScanThru} operation.  A function containing
310such operations is translated into a public C++ class (struct),
311which includes a Carry Queue to hold all the carry
312variables from iteration to iteration, together with the
313a method {\tt do\_block} to implement the processing
314for a single block (based on the SIMD register width).
315Macros {\tt CarryDeclare} and {\tt CarryInit} declare and
316initialize the Carry Queue structure depending on the
317specific architecture and Carry Queue representation.
318The unbounded bit stream {\tt Advance} and {\tt ScanThru}
319operations are translated into block-by-block equivalents
320with explicit carry-in and carry-out processing. 
321At the end of each block, the {\tt CarryQ\_Adjust}
322operation implements any necessary adjustment of the
323Carry Queue to prepare for the next iteration.
324The Pablo language and compiler also support conditional
325and iterative bit stream logic on unbounded streams
326(if and while constructs) which involves additional
327carry-test insertion in control branches.
328Explaining the full details of the translation
329is beyond the scope of this paper.
[1384]330
[1354]331\subsection{Parabix Run-Time Libraries}
[1302]332
[1393]333The Parabix architecture also includes run-time libraries that support
334a machine-independent view of basic SIMD operations, as well as a set
335of core function libraries.  For machine-independence, we program all
336operations using an abstract SIMD machine.  The abstract machine
337supports all power-of-2 field widths up to the full SIMD register
338width on a target machine.  Let $w = 2k$ be the field width in
339bits. Let $f$ be a basic binary operation defined on $w$-bit
340quantities producing an $w$-bit result. Let $W$ be the SIMD vector
341size in bits where $W = 2K$.  Then the C++ template notation
342\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
343SIMD operation yielding an output SIMD vector $v$, given two input
344SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
345computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
346\verb`simd<8>::add(a,b)` represents the simultaneous addition of
347sixteen 8-bit fields.
[1381]348
[1407]349We have ported parabix to a wide variety of processor architectures
350demonstrating its applicability to commodity SIMD hardware. We
351currently take advantage of the 128-bit Altivec operations on the
352Power PC, 64-bit MMX and 128-bit SSE operations on previous generation
353Intel platforms, the latest 256-bit AVX extensions on the Sandybridge
354processor, and finally the 128-bit \NEON{} operations on ARM.
[1397]355
Note: See TracBrowser for help on using the repository browser.