source: docs/HPCA2012/03-research.tex @ 1354

Last change on this file since 1354 was 1354, checked in by ksherdy, 8 years ago

first revision of section 3

File size: 20.7 KB
Line 
1\section{Parabix}
2\label{section:parabix}
3
4%Describe key technology behind Parabix
5%Introduce SIMD;
6%Talk about SSE
7%Highlight which SSE instructions are important
8%TAlk about each pass in the parser; How SSE is used in every phase...
9%Benefits of SSE in each phase.
10
11
12% Extract section 2.2 and merge into 3.   Add a new subsection
13% in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
14% operations and then mention the pack operations that allow
15% us to implement transposition efficiently in parallel. 
16% Also note that the SIMD registers support bitwise logic across
17% their full width and that this is extensively used in our work.
18%
19% Also, it could be good to have a small excerpt of a byte-at-a-time
20% scanning loop for XML, e.g., extracted from Xerces in section 2.1. 
21% Just a few lines showing the while loop - Linda can tell you the file.
22%
23
24% This section focuses on the
25
26% With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position
27% within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit,
28% 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and
29% shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel
30% \cite{CameronLin2009}.
31
32% The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0.
33% It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte.
34% This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources.
35% The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption.
36% Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage.
37
38This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}
39In Section {\bf 4}, we discuss one of its implementations, \emph{Parabix-XML}.
40A more comprehensive study of Parabix and Parabix-XML, can be found in the technical report
41``Parallel Parsing with Bitstream Addition: An XML Case Study'' \cite{Cameron2010}.
42
43The fundemental difference between the Parabix framework and traditional text processing models is in how
44Parabix represents the source data.
45Rather than viewing it as a stream of bytes or characters and performing per-byte comparisons to
46determine when to transition between parsing states, Parabix views data as a set of \emph{bit streams} and uses a mixture of
47boolean-logic and integer-based math on those streams to effectively parse many bytes in parallel.
48In this paper, we use the notation $\land$, $\lor$ and $\lnot$ to denote the AND, OR and NOT boolean operations.
49Bit streams are the foundation of the Parabix technology. In order to understand how it is possible to
50perform complex state-transition-based text processing operations in parallel, understanding what bit streams are,
51how they are created, and --- more importantly --- how they are used by Parabix, is critical.
52
53% Our first generation parallel bitstream XML parser---Parabix1---uses employs a less conventional approach of
54% SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence
55% with the bytes of a character stream. A transposition step first transforms sequential byte stream data into
56% eight basis bitstreams for the bits of each byte.
57
58% Parabix1 is functionally equivalent to a traditional XML parser.
59% That is, Parabix1 moves a single cursor sequentially through the source document and
60% parses it to distinguish between (and properly validate) the markup and content.
61% Where Parabix1 differs from traditional parsers is that instead of processing a document a byte-at-a-time, it
62% transforms the document into a set of \emph{bit streams} (explained shortly) and then performs boolean-logic operations
63% on those bit streams to process many bytes in parallel.
64
65
66%Each $1$-bit marks the postion of each key character in the
67%corresponding source data stream. A single stream is generated for each of the key markup characters.
68
69\subsection{What are Bit Streams?}
70
71A bit stream is simply a sequence of $\tt 1$s and $\tt 0$s.
72The significance of each bit value is dependent on the type of bit stream.
73We view bit streams in one of two ways: $n$ 1-bit values for boolean-logic operations and 1 $n$-bit value for integer-based math.
74For simplicity, assume that $n$ is infinitely long w.r.t. the size of the source data. In reality, each bit stream is divided into
75a series of $w$-bit blocks, where $w$ is equal to the width of the SIMD registers within the system
76(e.g., 64-bits for MMX, 128-bits for SSE/NEON, and 256-bits for AVX).
77We discuss how these $\frac{n}{w}$ bit stream segments can be used as infinitely-long bit stream in Section \ref{parallel bit stream compilation}.
78
79The first type of bit streams used in Parabix are refered to as \emph{basis bit streams}, which contain the input source data.
80Parabix uses the basis bit streams to construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense
81of a significant character (or class of characters) in the parsing process.
82Character-class bit streams are used to compute \emph{lexical bit streams} and \emph{error bit streams}, which
83Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream.
84
85{\bf Basis Bit Streams:}
86To construct the basis bit streams, the source data is first loaded in
87sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
88so that Parabix can efficiently produce the character-class bit streams.
89Essentially, when the source data is in basis bit stream form, the $k$-th bit of $i$-th character in the source
90text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.
91Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
92of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
93The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
94a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
95as UNICODE, may use a multiple code units to express all of its possible code points.
96How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
97The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
98128 code points within it.
99In Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
100streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
101
102\begin{figure}[h]
103
104\begin{center}
105\begin{tabular}{r c c c c }
106STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
107ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
108\hline
109\end{tabular}
110\end{center}
111\begin{center}
112\begin{tabular}{r |c |c |c |c |c |c |c |c |}
113 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
114 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
115 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
116 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
117 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
118\end{tabular}
119\end{center}
120
121
122\caption{Example 7-bit ASCII Basis Bit Streams}
123\label{fig:BitstreamsExample}
124\end{figure}
125
126
127% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
128% Basis bit streams
129% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
130% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
131% Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit
132% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
133% by periods as to emphasize the $1$ bits.
134
135{\bf Character-class Bit Streams:}
136Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between
137data and metadata parsing.
138% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
139For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
140Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set
141of known code points and branching appropriately when one is found, typically using an if or switch statement.
142Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
143to do so correctly.
144% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
145
146Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel in a single operation (using Intel SSE/ARM NEON)
147by using a series of boolean-logic operations to merge multiple basis bit streams into
148a single character-class stream that marks the positions of key characters with a $1$.
149For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
150Classes of characters can be found with similar formulas.
151For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
152An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute
153than individual characters.
154Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper.
155
156% The advantage of character-class bit streams over traditional ``byte streams'' is that
157
158% Rather than testing each byte individually,
159% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
160
161
162{\bf Lexical and Error Bit Streams:}
163To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
164a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
165Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
166Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
167issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be
168trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity
169of the error.
170
171To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.
172The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
173and returns $c + c$ --- effectively moves each cursor one position to the ``right''.
174$\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
175$\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$.
176For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
177instances of it in the source text.
178%We know from this statement that only strings starting with \verb`<` that contain at least one letter and end with a \verb`>` are matches.
179We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
180$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
181token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
182the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
183token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
184With additional post bit-stream processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can be removed or ignored; the details
185of which go beyond the scope of this paper.
186
187\begin{figure}[h]
188
189\begin{center}
190\begin{tabular}{lclr}
191\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
192$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
193$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
194$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
195$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
196$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
197$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
198$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
199\end{tabular}
200\end{center}
201
202
203\caption{Lexical Parsing in Parabix}
204\label{fig:ParabixParsingExample}
205\end{figure}
206
207% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
208%
209% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
210% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
211% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
212% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
213% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
214% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
215% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
216% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
217% details, refer to the technical report \cite{Cameron2010}.
218
219Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
220each parsing position are mostly eliminated, which, as Section XXX shows, minimizes branch misprediction penalties.
221Accurate parsing and parallel lexical analysis is done through processor-friendly equations that requires no speculation nor multithreading.
222
223\subsection{Parallel Bit Stream Compilation}
224\label {parallel bit stream compilation}
225
226While the description of parallel bit stream parsing in the previous section works conceptually on
227infinitely-long bit streams, in practice, the implementation of a Parabix parser must process input streams as bit stream blocks
228whose size is equal to the SIMD register width of the target processor.
229In our work, we leverage the unbounded integer type of the Python programming language to verify our equations
230against unbounded bit streams.
231Using a restricted subset of Python, we prototype and validate the functionality of our applications,
232such as CSV parsing, XML validation, and UTF-8 to UTF-16 transcoding, etc.
233To transform the Python code into the equivalent block-at-a-time C/C++ code,
234the key question becomes how to efficiently transfer information from one bit stream block to the next whenever the result of an operation
235crosses a block boundary.
236
237The answer lies in carry bit propagation. Since the parallel ScanThru operation relies solely on bit-wise addition and logical operations,
238block-to-block information transfer is captured in entirety by the carry bit associated with each underlying addition operation.
239Boolean-logical operations do not pass information over block boundaries since each bit value is independent of the state of its surrounding bits.
240Properly determining, initializing and inserting carry bits into a block-by-block implementation is tedious and error prone.
241Thus we have developed compiler technology to automatically transform parallel bit stream
242Python code to block-at-a-time C/C++ implementations. Details are beyond the scope of this paper, but are described in the on-line
243source code repository at {\tt http://parabix.costar.sfu.ca}.
244
245\subsection{Parabix Tool Chain}
246
247To support the Parabix framework, we are developing tool technology to automate various aspects
248of parallel bit stream programming. At present, our tool chain consists of two compilers: a character class
249compiler (ccc) and a unbounded bit stream to C/C++ block-at-a-time processing compiler (Pablo).
250
251The character class compiler is used to automatically produce bit stream logic for all the individual characters (e.g.,
252delimiters) and character classes (e.g., digits, letters) used in a particular application.
253Input is specified using a character class syntax adapted from the standard regular expression notations.
254Output is a minimized set of three-address bitwise operations, such as $a = b~\&~c$,
255to compute each of the character classes from the basis bit streams.
256
257For example, Figure \ref{fig:CCC} shows the input and output produced by the
258character class compiler for the example of \verb`[0-9]`
259discussed in the previous section.  The output operations
260may be viewed as operations on a single block of input
261at a time, or may be viewed as operations on unbounded bitstreams
262as supported by the Pablo compiler.
263
264\begin{figure}[h]
265
266\begin{center}
267
268\begin{tabular}{r l}
269\ttfamily{\bfseries INPUT:} & \verb`digit = [<]` \\
270\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
271& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
272& \verb`temp3 = (temp2 &~ temp1)` \\
273& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
274& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
275& \verb`digit = (temp3 &~ temp5)`
276\end{tabular}
277
278\end{center}
279
280\caption{Character Class Compiler Input/Output}
281\label{fig:CCC}
282\end{figure}
283
284Pablo is a compiler that abstracts away the details of
285programming parallel bit stream code in terms of finite
286SIMD register widths and application buffer sizes.  Input
287to Pablo is a language for expressing bitstream operations
288on unbounded bitstreams.  The operations include bitwise
289logic, the {\tt Advance} and {\tt ScanThru} operations described in he
290previous subsection as well as if and while control structures.
291Pablo translates these operations to block-at-a-time
292code in C/C++, where the block size is the register
293width for SIMD operations on the selected target architecture.
294
295
296\subsection{Parabix Run-Time Libraries}
297
298The Parabix architecture also includes run-time libraries
299that support a machine-independent view of basic SIMD
300operations, as well as a set of core function libraries.
301For machine-independence, we program all operations using
302an abstract SIMD machine based on the Inductive Doubling
303Instruction Set Architecture (IDISA) \cite{CameronLin2009}.
304Originally developed for 128-bit Altivec operations on Power PC
305as well as 64-bit MMX and 128-bit SSE operations on Intel,
306we have recently extended our library support to include
307the new 256-bit AVX operations on Intel as well as the 128-bit
308NEON operations on the ARM architecture. Further details
309are provided in later sections.
310 
Note: See TracBrowser for help on using the repository browser.