source: docs/HPCA2012/03-research.tex @ 1368

Last change on this file since 1368 was 1368, checked in by ksherdy, 8 years ago

large changes to section 2; minor edits to section 3

File size: 19.1 KB
4%Describe key technology behind Parabix
5%Introduce SIMD;
6%Talk about SSE
7%Highlight which SSE instructions are important
8%TAlk about each pass in the parser; How SSE is used in every phase...
9%Benefits of SSE in each phase.
12% Extract section 2.2 and merge into 3.   Add a new subsection
13% in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
14% operations and then mention the pack operations that allow
15% us to implement transposition efficiently in parallel. 
16% Also note that the SIMD registers support bitwise logic across
17% their full width and that this is extensively used in our work.
19% Also, it could be good to have a small excerpt of a byte-at-a-time
20% scanning loop for XML, e.g., extracted from Xerces in section 2.1. 
21% Just a few lines showing the while loop - Linda can tell you the file.
24% This section focuses on the
26% With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position
27% within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit,
28% 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and
29% shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel
30% \cite{CameronLin2009}.
32% The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0.
33% It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte.
34% This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources.
35% The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption.
36% Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage.
38This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}
39In Section {\bf 4}, we discuss one of its implementations, \emph{Parabix-XML}.
40A more comprehensive study of Parabix and Parabix-XML, can be found in the technical report
41``Parallel Parsing with Bitstream Addition: An XML Case Study'' \cite{Cameron2010}.
43The fundemental difference between the Parabix framework and traditional text processing models is in how
44Parabix represents the source data.
45Rather than viewing it as a stream of bytes or characters and performing per-byte comparisons to
46determine when to transition between parsing states, Parabix views data as a set of \emph{bit streams} and uses a mixture of
47boolean-logic\footnote{In this paper, we use the notation $\land$, $\lor$ and $\lnot$ to denote the AND, OR and NOT boolean operations.} and integer-based math on those streams to effectively parse many bytes in parallel.
48Bit streams are the foundation of the Parabix technology. In order to understand how it is possible to
49perform complex state-transition-based text processing operations in parallel, understanding what bit streams are,
50how they are created, and --- more importantly --- how they are used by Parabix, is critical.
52% Our first generation parallel bitstream XML parser---Parabix1---uses employs a less conventional approach of
53% SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence
54% with the bytes of a character stream. A transposition step first transforms sequential byte stream data into
55% eight basis bitstreams for the bits of each byte.
57% Parabix1 is functionally equivalent to a traditional XML parser.
58% That is, Parabix1 moves a single cursor sequentially through the source document and
59% parses it to distinguish between (and properly validate) the markup and content.
60% Where Parabix1 differs from traditional parsers is that instead of processing a document a byte-at-a-time, it
61% transforms the document into a set of \emph{bit streams} (explained shortly) and then performs boolean-logic operations
62% on those bit streams to process many bytes in parallel.
65%Each $1$-bit marks the postion of each key character in the
66%corresponding source data stream. A single stream is generated for each of the key markup characters.
68\subsection{What are Bit Streams?}
70A bit stream is simply a sequence of $\tt 1$s and $\tt 0$s.
71The significance of each bit value is dependent on the type of bit stream.
72We view bit streams in one of two ways: $n$ 1-bit values for boolean-logic operations and 1 $n$-bit value for integer-based math.
73For simplicity, assume that $n$ is infinitely long (i.e., unbounded) w.r.t. the size of the source data. In reality, each bit stream is divided into
74a series of $w$-bit blocks, where $w$ is equal to the SIMD register width of the system (e.g., 64, 128, or 256).
75We discuss how these $\lceil \frac{n}{w} \rceil$ bit blocks can be viewed as an unbounded bit stream in Section \ref{parabix tool chain}.
77The first type of bit streams used in Parabix are refered to as \emph{basis bit streams}, which contain the input source data.
78Parabix uses the basis bit streams to construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense
79of a significant character (or class of characters) in the parsing process.
80Character-class bit streams are used to compute \emph{lexical bit streams} and \emph{error bit streams}, which
81Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream.
83{\bf Basis Bit Streams:}
84To construct the basis bit streams, the source data is first loaded in
85sequential order and then transposed --- through a series of SIMD pack, shift, and bitwise operations ---
86so that Parabix can efficiently produce the character-class bit streams.
87Essentially, when the source data is in basis bit stream form, the $k$-th bit of $i$-th character in the source
88text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.
89Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
90of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
91The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
92a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
93as UNICODE, may use a multiple code units to express all of its possible code points.
94How multi-code-unit characters can be parsed efficiently goes beyond the scope of this paper.
95The most dominant format in data-oriented XML documents is ASCII, which uses a 8-bit code unit to represent all of the
96128 code points within it.
97In Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
98streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
103\begin{tabular}{r c c c c }
104STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
105ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
110\begin{tabular}{r |c |c |c |c |c |c |c |c |}
111 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
112 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
113 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
114 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
115 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
120\caption{Example 7-bit ASCII Basis Bit Streams}
125% In Parabix1, basis bit streams are used to generate \ref{lexical bit streams} for key markup characters.
126% Basis bit streams
127% are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other
128% words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams.
129% Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit
130% ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented
131% by periods as to emphasize the $1$ bits.
133{\bf Character-class Bit Streams:}
134Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between
135data and metadata parsing.
136% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
137For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
138Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set
139of known code points and branching appropriately when one is found, typically using an if or switch statement.
140Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
141to do so correctly.
142% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
144Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel with a single operation
145by using a series of boolean-logic operations to merge multiple basis bit streams into
146a single character-class stream that marks the positions of key characters with a $1$.
147For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
148Classes of characters can be found with similar formulas.
149For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
150An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute
151than individual characters.
152Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper.
154% The advantage of character-class bit streams over traditional ``byte streams'' is that
156% Rather than testing each byte individually,
157% Parabix1 combines the basis bit streams using boolean-logic and computes a lexical bit stream.
160{\bf Lexical and Error Bit Streams:}
161To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
162a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
163Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
164Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
165issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be
166trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity
167of the error.
169To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.
170The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
171and returns $c + c$ --- effectively moves each cursor one position to the ``right''.
172$\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
173$\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$.
174For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
175instances of it in the source text.
176%We know from this statement that only strings starting with \verb`<` that contain at least one letter and end with a \verb`>` are matches.
177We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
178$C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
179token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
180the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
181token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
182With additional post bit-stream processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can be removed or ignored; the details
183of which go beyond the scope of this paper.
189\multicolumn{3}{l}{source text} & \verb`<a><valid> <string>  <>ignored><error]` \\
190$C_{0}$ & $=$ & $\texttt{[a-zA-Z]}$ & \verb`.1..11111...111111.....1111111..11111.` \\
191$C_{1}$ & $=$ & $\texttt{[>]}$ & \verb`..1......1........1...1.......1.......` \\
192$C_{2}$ & $=$ & $\texttt{[<]}$ & \verb`1..1.......1.........1.........1......` \\
193$L_{0}$ & $=$ & $\texttt{Advance}(C_{2})$ & \verb`.1..1.......1.........1.........1.....` \\
194$E_{0}$ & $=$ & $L_{0} \land \lnot C_{0}$ & \verb`......................1...............` \\
195$L_{1}$ & $=$ & $\texttt{ScanThru}(L_{0}, C_{0})$ & \verb`..1......1........1...1..............1` \\
196$E_{1}$ & $=$ & $L_{1} \land \lnot C_{1}$ & \verb`.....................................1` \\
201\caption{Lexical Parsing in Parabix}
205% For example, using the source data from Figure \ref{fig:Parabix1StarttagExample},
207% Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates
208% the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2
209% begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric
210% character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream.
211% $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable
212% length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by
213% adding $M_1$ to $N$, and uses the $\land$ operation of $\lnot N$ to find only the true end tags of $M_1$.
214% Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional
215% details, refer to the technical report \cite{Cameron2010}.
217Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
218each parsing position are mostly eliminated, which, as Section XXX shows, minimizes branch misprediction penalties.
219Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading.
222\subsection{Parabix Tool Chain}
223\label{parabix tool chain}
225To support the Parabix framework, we are developing tool technology to automate various aspects
226of parallel bit stream programming. At present, our tool chain consists of two compilers: a character class
227compiler (ccc) and a unbounded bit stream to C/C++ block-at-a-time processing compiler (Pablo).
229The character class compiler is used to automatically produce bit stream logic for all the individual characters (e.g.,
230delimiters) and character classes (e.g., digits, letters) used in a particular application.
231Input is specified using a character class syntax adapted from the standard regular expression notations.
232Output is a minimized set of three-address bitwise operations, such as $a = b~\&~c$,
233to compute each of the character classes from the basis bit streams.
235For example, Figure \ref{fig:CCC} shows the input and output produced by the
236character class compiler for the example of \verb`[0-9]`
237discussed in the previous section.  The output operations
238may be viewed as operations on a single block of input
239at a time, or may be viewed as operations on unbounded bitstreams
240as supported by the Pablo compiler.
246\begin{tabular}{r l}
247\ttfamily{\bfseries INPUT:} & \verb`digit = [0-9]` \\
248\ttfamily{\bfseries OUTPUT:} & \verb`temp1 = (basis_bits.bit_0 | basis_bits.bit_1)` \\
249& \verb`temp2 = (basis_bits.bit_2 & basis_bits.bit_3)` \\
250& \verb`temp3 = (temp2 &~ temp1)` \\
251& \verb`temp4 = (basis_bits.bit_5 | basis_bits.bit_6)` \\
252& \verb`temp5 = (basis_bits.bit_4 & temp4)` \\
253& \verb`digit = (temp3 &~ temp5)`
258\caption{Character Class Compiler Input/Output}
262Pablo is a compiler that abstracts away the details of
263programming parallel bit stream code in terms of finite
264SIMD register widths and application buffer sizes.  Input
265to Pablo is a language for expressing bitstream operations
266on unbounded bitstreams.  The operations include bitwise
267logic, the {\tt Advance} and {\tt ScanThru} operations described in he
268previous subsection as well as if and while control structures.
269Pablo translates these operations to block-at-a-time
270code in C/C++.
271%, where the block size is the register width for SIMD operations on the selected target architecture.
272The key functionality of Pablo is to arrange for block-to-block
273carry bit propagation to implement the long bitstream shift
274and addition operations required by
275{\tt Advance} and {\tt ScanThru}.
279\subsection{Parabix Run-Time Libraries}
281The Parabix architecture also includes run-time libraries
282that support a machine-independent view of basic SIMD
283operations, as well as a set of core function libraries.
284For machine-independence, we program all operations using
285an abstract SIMD machine based on the Inductive Doubling
286Instruction Set Architecture (IDISA) \cite{CameronLin2009}.
287These operations were originally developed for 128-bit Altivec operations on Power PC
288as well as 64-bit MMX and 128-bit SSE operations on Intel
289but have recently extended to support
290the new 256-bit AVX operations on Intel as well as the 128-bit
291NEON operations on the ARM architecture. Further details
292are provided in later sections.
Note: See TracBrowser for help on using the repository browser.