source: docs/HPCA2012/03-research.tex @ 1382

Last change on this file since 1382 was 1381, checked in by lindanl, 8 years ago

SIMD vertical op syntax and minor fixes

File size: 17.9 KB
1\section{The Parabix Framework}
4This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}.
5The framework has three components: a unifying architectural view of text processing in terms
6of parallel bit streams, a tool chain for automating the generation of parallel bit stream
7code from higher-level specifications, and a run-time environment providing a portable
8SIMD programming abstraction, independent of the specific facilities available
9on particular target architectures.
12\subsection{Parallel Bit Streams}
14The fundamental difference between the Parabix framework and traditional text processing models is in how
15Parabix represents the source data.   Given a traditional byte-oriented text stream, Parabix
16first transposes the text data to a transform domain consisting of 8 parallel bit streams, known
17as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream
18of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source
19text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.
20For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
21streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
26\begin{tabular}{r c c c c }
27STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
28ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
33\begin{tabular}{r |c |c |c |c |c |c |c |c |}
34 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
35 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
36 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
37 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
38 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
43\caption{Example 7-bit ASCII Basis Bit Streams}
47The advantage of the parallel bit stream representation is that we can
48use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on
49Intel) to process 128 byte positions at a time using bitwise logic, shifting and
50other operations.
52Just as forward and inverse Fourier transforms are used to transform
53between the time and frequency domains in signal processing, bit stream
54transposition and inverse transposition provides ``byte space'' and ``bit space''
55views of text.  The goal of the Parabix framework is to support efficient
56text processing using these two equivalent representations in the same
57way that efficient signal processing benefits from the use of the frequency
58domain in some cases and the time domain in others.
60In the Parabix framework, basis bit streams are used as the starting
61point to determine other bit streams.   In particular, Parabix uses the basis bit streams to
62construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense
63of a significant character (or class of characters) in the parsing process.
64Character-class bit streams may then be used to compute \emph{lexical bit streams} and \emph{error bit streams}, which
65Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream.
67{\bf Basis Bit Streams:}
68To construct the basis bit streams, the source data is first loaded in
69sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
70so that Parabix can efficiently produce the character-class bit streams.
71Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
72of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
73%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
74%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
75%as UNICODE, may use a multiple code units to express all of its possible code points.
76%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
77%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
78%128 code points within it.
80% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
81% Basis bit streams
82% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
83% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
84% Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit
85% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
86% by periods as to emphasize the $1$ bits.
88{\bf Character-class Bit Streams:}
89Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between
90data and metadata parsing.
91% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
92For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
93Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set
94of known code points and branching appropriately when one is found, typically using an if or switch statement.
95Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
96to do so correctly.
97% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
99Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel with a single operation
100by using a series of boolean-logic operations
101\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.} 
102to merge multiple basis bit streams into
103a single character-class stream that marks the positions of key characters with a $1$.
104For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
105Classes of characters can be found with similar formulas.
106For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
107An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute
108than individual characters.
109Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper.
111% The advantage of character-class bit streams over traditional ``byte streams'' is that
113% Rather than testing each byte individually,
114% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
117{\bf Lexical and Error Bit Streams:}
118To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
119a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
120Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
121Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
122issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be
123trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity
124of the error.
126To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.
127The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
128and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
129$\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
130$\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$
131For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
132instances of it in the source text.
133We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
134$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
135token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
136the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
137token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
138With additional post bit-stream processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can be removed or ignored; the details
139of which go beyond the scope of this paper.
145\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
146$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
147$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
148$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
149$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
150$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
151$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
152$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
157\caption{Lexical Parsing in Parabix}
161% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
163% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
164% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
165% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
166% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
167% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
168% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
169% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
170% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
171% details, refer to the technical report \cite{Cameron2010}.
173Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
174each parsing position are mostly eliminated, which, as Section \ref{section:XML-branches} shows, minimizes branch misprediction penalties.
175Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading.
178\subsection{Parabix Tool Chain}
179\label{parabix tool chain}
181To support the Parabix framework, we are developing tool technology to automate various aspects
182of parallel bit stream programming. At present, our tool chain consists of two compilers: a character class
183compiler (ccc) and a unbounded bit stream to C/C++ block-at-a-time processing compiler (Pablo).
185The character class compiler is used to automatically produce bit stream logic for all the individual characters (e.g.,
186delimiters) and character classes (e.g., digits, letters) used in a particular application.
187Input is specified using a character class syntax adapted from the standard regular expression notations.
188Output is a minimized set of three-address bitwise operations, such as $a = b~\&~c$,
189to compute each of the character classes from the basis bit streams.
191For example, Figure \ref{fig:CCC} shows the input and output produced by the
192character class compiler for the example of \verb`[0-9]`
193discussed in the previous section.  The output operations
194may be viewed as operations on a single block of input
195at a time, or may be viewed as operations on unbounded bitstreams
196as supported by the Pablo compiler.
202\begin{tabular}{r l}
203\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\ \\
205\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
206& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
207& \verb`temp3 = (temp2 &~ temp1)` \\
208& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
209& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
210& \verb`digit = (temp3 &~ temp5)`
215\caption{Character Class Compiler Input/Output}
219The Pablo compiler abstracts away the details of
220programming parallel bit stream code in terms of finite
221SIMD register widths and application buffer sizes.
222Input to Pablo is a language for expressing bitstream operations on unbounded bitstreams. 
223The operations include bitwise
224logic, the {\tt Advance} and {\tt ScanThru} operations described in he
225previous subsection as well as if and while control structures.
226Pablo translates these operations to block-at-a-time
227code in C/C++.
228%, where the block size is the register width for SIMD operations on the selected target architecture.
229The key functionality of Pablo is to arrange for block-to-block
230carry bit propagation to implement the long bitstream shift
231and addition operations required by
232{\tt Advance} and {\tt ScanThru}.
234For example, we can translate the simple parsing example
235of \ref{fig:ParabixParsingExample} above into Pablo code
236to produce the output as shown in Figure \ref{fig:Pablo}.
237In this example, Pablo has the primary responsibility of inserting
238carry variable declarations that allow the results of
239Advance and ScanThru operations to be carried over from
240block to block.  Explaining the full details of the translation
241is beyond the scope of this paper, however.
247\begin{tabular}{r l}
248\ttfamily{\bfseries INPUT:} 
249& \verb`def parse_tags(classes, errors):` \\
250& \verb`  classes.C0 = Alpha` \\
251& \verb`  classes.C1 = Rangle` \\
252& \verb`  classes.C2 = Langle` \\
253& \verb`  L0 = bitutil.Advance(C2)` \\
254& \verb`  errors.E0 = L0 &~ C0` \\
255& \verb`  L1 = bitutil.ScanThru(L0, C0)` \\
256& \verb`  errors.E1 = L1 &~ C1` \\ \\
258\ttfamily{\bfseries OUTPUT:} 
259& \verb`struct Parse_tags {` \\
260& \verb`  Parse_tags() { CarryInit(carryQ, 2); }` \\
261& \verb`  void do_block(Classes & classes, Errors & errors) {` \\
262& \verb`    BitBlock L0, L1;` \\
263& \verb`    classes.C0 = Alpha;` \\
264& \verb`    classes.C1 = Rangle;` \\
265& \verb`    classes.C2 = Langle;` \\
266& \verb`    L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
267& \verb`    errors.E0 = simd_andc(L0, C0);` \\
268& \verb`    L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
269& \verb`    errors.E1 = simd_andc(L1, C1);` \\
270& \verb`    CarryQ_Adjust(carryQ, 2);` \\
271& \verb}` \\
272& \verb`  CarryDeclare(carryQ, 2);` \\
273& \verb`};` \\
278\caption{Parallel Block Compiler (Pablo) Input/Output}
283\subsection{Parabix Run-Time Libraries}
285The Parabix architecture also includes run-time libraries
286that support a machine-independent view of basic SIMD
287operations, as well as a set of core function libraries.
288For machine-independence, we program all operations using
289an abstract SIMD machine based on the Inductive Doubling
290Instruction Set Architecture (IDISA) \cite{CameronLin2009}.
291The abstract machine supports all power-of-2 field widths up to the full
292SIMD register width on a target machine.   
293Let $w = 2k$ be the field width in bits. Let $f$ be a basic binary operation defined on $w$-bit quantities
294producing an $w$-bit result. Let $W$ be the SIMD vector size in bits where $W = 2K$.
295Then the C++ template notation \verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector $v$,
296given two input SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value computed is $f(a_i, b_i)$.
297For example, given 128-bit SIMD vectors, \verb`simd<8>::add(a,b)` represents the simultaneous addition of sixteen 8-bit fields.
299These operations were originally developed for 128-bit Altivec operations on Power PC
300as well as 64-bit MMX and 128-bit SSE operations on Intel
301but have recently extended to support
302the new 256-bit AVX operations on Intel as well as the 128-bit
303NEON operations on the ARM architecture.
Note: See TracBrowser for help on using the repository browser.