source: docs/Working/icXML/background-parabix.tex @ 2525

Last change on this file since 2525 was 2522, checked in by nmedfort, 7 years ago


File size: 7.8 KB
1\subsection{The Parabix Framework}
4The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
5(and other forms of text processing.) The key idea is to exploit the availability of wide
6SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
7of input data by using one register bit per single input byte.
8To facilitate this, the input data is first transposed into a set of basis bit streams.
9In Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
10is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
11% The bits used to construct $\tt b_7$ have been highlighted in this example.
12Boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.} 
13are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
14characters (or groups of characters) with a $1$.
15For example, one of the fundemental characters in XML is a left-angle bracket.
16A character is an `\verb`<`' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
17b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
18Similarly, a character is numeric
19{\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
20An important observation here is that ranges of characters may
21require fewer operations than individual characters and multiple
22classes can sometimes share the classification cost.
26\begin{tabular}{r c c c c }
27STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
28ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
33\begin{tabular}{r |c |c |c |c |c |c |c |c |}
34 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
35 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
36 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
37 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
38 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
41\caption{8-bit ASCII Basis Bit Streams}
45% Using a mixture of boolean-logic and arithmetic operations, character-class
46% bit streams can be transformed into lexical bit streams, where the presense of
47% a 1 bit identifies a key position in the input data. As an artifact of this
48% process, intra-element well-formedness validation is performed on each block
49% of text.
54Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
55Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
56Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
57End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
58Empty Tag Marks & \verb`___________________________________________________________________1_______`\\
59Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
60Att Names & \verb`_______________________11_______11________________________1111_____________`\\
61Att Values & \verb`___________________________111________111_______________________1__________`\\
62String ends & \verb`1_____1_______1____1__________1__________1_____1____1____________1__1______`\\
63String marks & \verb`0_____0_______0____0__________0__________0_____0____0____________0__0______`\\
64Transition marks & \verb`_____1___1_____1_________1_________1______1_____1_____________1____1_1_____`\\
65Transition chars & \verb`_____>___>_____/_________=_________=______>_____/_____________=____>_/_____`\\
66Deletion mask & \verb`_1111__11_______111_11111_1____1111_11___________111_111111111_1__1___11111`\\
67Undeleted & \verb`_____>___>text_/_________=_foo_____=__fie_>more_/_____________=_b__>_/_____`
70\caption{XML Source Data and Derived Parallel Bit Streams}
74Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
75The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
76parsing, with each bit of each stream in one-to-one correspondence to the source character code units
77of the input stream.
78For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
79The first bit stream shown is that for the opening
80angle brackets that represent tag openers in XML.
81The second and third streams show a partition of the
82tag openers into start tag marks and end tag marks
83depending on the character immediately following the
84opener (i.e., ``\verb:/:'') or not.  The remaining three
85lines show streams that can be computed in subsequent
86parsing (using the technique
87of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
88attribute names and attribute values of tags. 
90Two intuitions may help explain how the Parabix approach can lead
91to improved XML parsing performance. The first is that
92the use of the full register width offers a considerable
93information advantage over sequential byte-at-a-time
94parsing.  That is, sequential processing of bytes
95uses just 8 bits of each register, greatly limiting the
96processor resources that are effectively being used at any one time.
97The second is that byte-at-a-time loop scanning loops are actually
98often just computing a single bit of information per iteration:
99is the scan complete at this position yet?  Rather than
100computing these individual decision-bits, an approach that computes
101many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
102should provide substantial benefit.
104Previous studies have shown Parabix approach improves many aspects of XML processing,
105including transcoding \cite{Cameron2008}, character classification and validation,
106tag parsing and well-formedness checking. 
107The first Parabix parser used processor bit scan instructions to considerably accelerate
108sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
109Recent work has incorporated a method of parallel
110scanning using bitstream addition \cite{cameron-EuroPar2011}, as
111well as combining SIMD methods with 4-stage pipeline parallelism to further improve
112throughput \cite{HPCA2012}.
113Although these research prototypes handled the full syntax of
114{\bf grammarless} XML documents, they lacked the functionality required by full XML parsers.
115Namely, commercial XML processors, such as Xerces,
116as support for transcoding of multiple character sets,
117the ability to parse and validate against DTD grammars, both internal and external,
118facilities for handling different XML vocabularies through namespace
119processing, as well validation against XML Schema grammars. 
120Additionally, commercial parsers can be expected to provide a number of API
121facilities beyond those found in research prototypes, including
122full implementations of the widely used SAX, SAX2 and DOM interfaces.
Note: See TracBrowser for help on using the repository browser.