source: docs/Working/icXML/background-parabix.tex @ 3633

Last change on this file since 3633 was 2872, checked in by nmedfort, 7 years ago


File size: 6.1 KB
1\subsection{The Parabix Framework}
4The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
5(and other forms of text processing.) The key idea is to exploit the availability of wide
6SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
7of input data by using one register bit per single input byte.
8To facilitate this, the input data is first transposed into a set of basis bit streams.
9In Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
10is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
11% The bits used to construct $\tt b_7$ have been highlighted in this example.
12Boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.} 
13are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
14characters (or groups of characters) with a $1$.
15For example, one of the fundamental characters in XML is a left-angle bracket.
16A character is an `\verb`<`' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
17b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
18Similarly, a character is numeric
19{\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
20An important observation here is that ranges of characters may
21require fewer operations than individual characters and
22% the classification cost could be amortized over many character classes.
23multiple classes can share the classification cost.
27\begin{tabular}{r c c c c }
28String & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
29ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
34\begin{tabular}{r |c |c |c |c |c |c |c |c |}
35 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
36 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
37 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
38 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
39 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
42\caption{8-bit ASCII Basis Bit Streams}
46% Using a mixture of boolean-logic and arithmetic operations, character-class
47% bit streams can be transformed into lexical bit streams, where the presense of
48% a 1 bit identifies a key position in the input data. As an artifact of this
49% process, intra-element well-formedness validation is performed on each block
50% of text.
52Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
53The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
54parsing, with each bit of each stream in one-to-one correspondence to the source character code units
55of the input stream.
56For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
57The first bit stream shown is that for the opening
58angle brackets that represent tag openers in XML.
59The second and third streams show a partition of the
60tag openers into start tag marks and end tag marks
61depending on the character immediately following the
62opener (i.e., ``\verb:/:'') or not.  The remaining three
63lines show streams that can be computed in subsequent
64parsing (using the technique
65of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
66attribute names and attribute values of tags. 
68Two intuitions may help explain how the Parabix approach can lead
69to improved XML parsing performance. The first is that
70the use of the full register width offers a considerable
71information advantage over sequential byte-at-a-time
72parsing.  That is, sequential processing of bytes
73uses just 8 bits of each register, greatly limiting the
74processor resources that are effectively being used at any one time.
75The second is that byte-at-a-time loop scanning loops are actually
76often just computing a single bit of information per iteration:
77is the scan complete yet?
78Rather than computing these individual decision-bits, an approach that computes
79many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
80should provide substantial benefit.
82Previous studies have shown that the Parabix approach improves many aspects of XML processing,
83including transcoding \cite{Cameron2008}, character classification and validation,
84tag parsing and well-formedness checking. 
85The first Parabix parser used processor bit scan instructions to considerably accelerate
86sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
87Recent work has incorporated a method of parallel
88scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as
89well as combining SIMD methods with 4-stage pipeline parallelism to further improve
90throughput \cite{HPCA2012}.
91Although these research prototypes handled the full syntax of schema-less XML documents,
92they lacked the functionality required by full XML parsers.
94Commercial XML processors support transcoding of multiple character sets and can parse and
95validate against multiple document vocabularies.
96Additionally, they provide API facilities beyond those found in research prototypes,
97including the widely used SAX, SAX2 and DOM interfaces.
Note: See TracBrowser for help on using the repository browser.