source: docs/HPCA2012/03-research.tex @ 1374

Last change on this file since 1374 was 1374, checked in by cameron, 8 years ago

Section 3 and other fixes

File size: 19.4 KB
Line 
1\section{The Parabix Framework}
2\label{section:parabix}
3
4%Describe key technology behind Parabix
5%Introduce SIMD;
6%Talk about SSE
7%Highlight which SSE instructions are important
8%TAlk about each pass in the parser; How SSE is used in every phase...
9%Benefits of SSE in each phase.
10
11
12% Extract section 2.2 and merge into 3.   Add a new subsection
13% in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
14% operations and then mention the pack operations that allow
15% us to implement transposition efficiently in parallel. 
16% Also note that the SIMD registers support bitwise logic across
17% their full width and that this is extensively used in our work.
18%
19% Also, it could be good to have a small excerpt of a byte-at-a-time
20% scanning loop for XML, e.g., extracted from Xerces in section 2.1. 
21% Just a few lines showing the while loop - Linda can tell you the file.
22%
23
24% This section focuses on the
25
26% With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position
27% within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit,
28% 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and
29% shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel
30% \cite{CameronLin2009}.
31
32% The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0.
33% It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte.
34% This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources.
35% The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption.
36% Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage.
37
38This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}.
39The framework has three components: a unifying architectural view of text processing in terms
40of parallel bit streams, a tool chain for automating the generation of parallel bit stream
41code from higher-level specifications, and a run-time environment providing a portable
42SIMD programming abstraction, independent of the specific facilities available
43on particular target architectures.
44
45
46\subsection{Parallel Bit Streams}
47
48The fundamental difference between the Parabix framework and traditional text processing models is in how
49Parabix represents the source data.   Given a traditional byte-oriented text stream, Parabix
50first transposes the text data to a transform domain consisting of 8 parallel bit streams, known
51as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream
52of $k$th bit of each byte in the source text.  That is, the $k$th bit of $i$th byte in the source
53text is in the $i$th (bit) position of the $k$th basis bit stream, $b_{k}$.
54For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
55streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
56
57\begin{figure}[h]
58
59\begin{center}
60\begin{tabular}{r c c c c }
61STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
62ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
63\hline
64\end{tabular}
65\end{center}
66\begin{center}
67\begin{tabular}{r |c |c |c |c |c |c |c |c |}
68 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
69 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
70 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
71 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
72 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
73\end{tabular}
74\end{center}
75
76
77\caption{Example 7-bit ASCII Basis Bit Streams}
78\label{fig:BitstreamsExample}
79\end{figure}
80
81The advantage of the parallel bit stream representation is that we can
82use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on
83Intel) to process 128 byte positions at a time using bitwise logic, shifting and
84other operations.
85
86Just as forward and inverse Fourier transforms are used to transform
87between the time and frequency domains in signal processing, bit stream
88transposition and inverse transposition provides ``byte space'' and ``bit space''
89views of text.  The goal of the Parabix framework is to support efficient
90text processing using these two equivalent representations in the same
91way that efficient signal processing benefits from the use of the frequency
92domain in some cases and the time domain in others.
93
94In the Parabix framework, basis bit streams are used as the starting
95point to determine other bit streams.   In particular, Parabix uses the basis bit streams to
96construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense
97of a significant character (or class of characters) in the parsing process.
98Character-class bit streams may then be used to compute \emph{lexical bit streams} and \emph{error bit streams}, which
99Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream.
100
101{\bf Basis Bit Streams:}
102To construct the basis bit streams, the source data is first loaded in
103sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
104so that Parabix can efficiently produce the character-class bit streams.
105Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
106of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
107%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
108%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
109%as UNICODE, may use a multiple code units to express all of its possible code points.
110%How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
111%The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
112%128 code points within it.
113
114% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
115% Basis bit streams
116% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
117% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
118% Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit
119% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
120% by periods as to emphasize the $1$ bits.
121
122{\bf Character-class Bit Streams:}
123Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between
124data and metadata parsing.
125% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
126For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
127Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set
128of known code points and branching appropriately when one is found, typically using an if or switch statement.
129Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
130to do so correctly.
131% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
132
133Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel with a single operation
134by using a series of boolean-logic operations to merge multiple basis bit streams into
135a single character-class stream that marks the positions of key characters with a $1$.
136For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
137Classes of characters can be found with similar formulas.
138For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
139An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute
140than individual characters.
141Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper.
142
143% The advantage of character-class bit streams over traditional ``byte streams'' is that
144
145% Rather than testing each byte individually,
146% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
147
148
149{\bf Lexical and Error Bit Streams:}
150To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
151a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
152Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
153Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
154issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be
155trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity
156of the error.
157
158To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.
159The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
160and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
161$\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
162$\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$
163For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
164instances of it in the source text.
165%We know from this statement that only strings starting with \verb`<` that contain at least one letter and end with a \verb`>` are matches.
166We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
167$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
168token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
169the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
170token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
171With additional post bit-stream processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can be removed or ignored; the details
172of which go beyond the scope of this paper.
173
174\begin{figure}[h]
175
176\begin{center}
177\begin{tabular}{lclr}
178\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
179$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
180$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
181$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
182$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
183$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
184$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
185$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
186\end{tabular}
187\end{center}
188
189
190\caption{Lexical Parsing in Parabix}
191\label{fig:ParabixParsingExample}
192\end{figure}
193
194% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
195%
196% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
197% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
198% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
199% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
200% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
201% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
202% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
203% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
204% details, refer to the technical report \cite{Cameron2010}.
205
206Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
207each parsing position are mostly eliminated, which, as Section XXX shows, minimizes branch misprediction penalties.
208Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading.
209
210
211\subsection{Parabix Tool Chain}
212\label{parabix tool chain}
213
214To support the Parabix framework, we are developing tool technology to automate various aspects
215of parallel bit stream programming. At present, our tool chain consists of two compilers: a character class
216compiler (ccc) and a unbounded bit stream to C/C++ block-at-a-time processing compiler (Pablo).
217
218The character class compiler is used to automatically produce bit stream logic for all the individual characters (e.g.,
219delimiters) and character classes (e.g., digits, letters) used in a particular application.
220Input is specified using a character class syntax adapted from the standard regular expression notations.
221Output is a minimized set of three-address bitwise operations, such as $a = b~\&~c$,
222to compute each of the character classes from the basis bit streams.
223
224For example, Figure \ref{fig:CCC} shows the input and output produced by the
225character class compiler for the example of \verb`[0-9]`
226discussed in the previous section.  The output operations
227may be viewed as operations on a single block of input
228at a time, or may be viewed as operations on unbounded bitstreams
229as supported by the Pablo compiler.
230
231\begin{figure}[h]
232
233\begin{center}
234
235\begin{tabular}{r l}
236\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\
237\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
238& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
239& \verb`temp3 = (temp2 &~ temp1)` \\
240& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
241& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
242& \verb`digit = (temp3 &~ temp5)`
243\end{tabular}
244
245\end{center}
246
247\caption{Character Class Compiler Input/Output}
248\label{fig:CCC}
249\end{figure}
250
251Pablo is a compiler that abstracts away the details of
252programming parallel bit stream code in terms of finite
253SIMD register widths and application buffer sizes.  Input
254to Pablo is a language for expressing bitstream operations
255on unbounded bitstreams.  The operations include bitwise
256logic, the {\tt Advance} and {\tt ScanThru} operations described in he
257previous subsection as well as if and while control structures.
258Pablo translates these operations to block-at-a-time
259code in C/C++.
260%, where the block size is the register width for SIMD operations on the selected target architecture.
261The key functionality of Pablo is to arrange for block-to-block
262carry bit propagation to implement the long bitstream shift
263and addition operations required by
264{\tt Advance} and {\tt ScanThru}.
265
266For example, we can translate the simple parsing example
267of \ref{fig:ParabixParsingExample} above into Pablo code
268to produce the output as shown in Figure \ref{fig:Pablo}.
269In this example, Pablo has the primary responsibility of inserting
270carry variable declarations that allow the results of
271Advance and ScanThru operations to be carried over from
272block to block.  Explaining the full details of the translation
273is beyond the scope of this paper, however.
274
275\begin{figure}[h]
276
277\begin{center}
278
279\begin{tabular}{r l}
280\ttfamily{\bfseries INPUT:} & \verb`def parse_tags(classes, errors):` \\
281& \verb`        classes.C0 = Alpha` \\
282& \verb`        classes.C1 = Rangle` \\
283& \verb`        classes.C2 = Langle` \\
284& \verb`        L0 = bitutil.Advance(C2)` \\
285& \verb`        errors.E0 = L0 &~ C0` \\
286& \verb`        L1 = bitutil.ScanThru(L0, C0)` \\
287& \verb`        errors.E1 = L1 &~ C1` \\ \\
288
289\ttfamily{\bfseries OUTPUT:} & \verb`struct Parse_tags {` \\
290& \verb`  Parse_tags() { ` \\
291& \verb`  CarryInit(carryQ, 2); }` \\
292& \verb`  void do_block(Classes & classes, Errors & errors) {` \\
293& \verb`                BitBlock L0, L1;` \\
294& \verb`        classes.C0 = Alpha;` \\
295& \verb`        classes.C1 = Rangle;` \\
296& \verb`        classes.C2 = Langle;` \\
297& \verb`        L0 = BitBlock_advance_ci_co(C2, carryQ, 0);` \\
298& \verb`        errors.E0 = simd_andc(L0, C0);` \\
299& \verb`        L1 = BitBlock_scanthru_ci_co(L0, C0, carryQ, 1);` \\
300& \verb`        errors.E1 = simd_andc(L1, C1);` \\
301& \verb`        CarryQ_Adjust(carryQ, 2);` \\
302& \verb}` \\
303& \verb`  CarryDeclare(carryQ, 2);` \\
304& \verb};` \\
305\end{tabular}
306
307\end{center}
308
309\caption{Parallel Block Compiler (Pablo) Input/Output}
310\label{fig:Pablo}
311\end{figure}
312
313
314\subsection{Parabix Run-Time Libraries}
315
316The Parabix architecture also includes run-time libraries
317that support a machine-independent view of basic SIMD
318operations, as well as a set of core function libraries.
319For machine-independence, we program all operations using
320an abstract SIMD machine based on the Inductive Doubling
321Instruction Set Architecture (IDISA) \cite{CameronLin2009}.
322These operations were originally developed for 128-bit Altivec operations on Power PC
323as well as 64-bit MMX and 128-bit SSE operations on Intel
324but have recently extended to support
325the new 256-bit AVX operations on Intel as well as the 128-bit
326NEON operations on the ARM architecture.
327 
Note: See TracBrowser for help on using the repository browser.