source: docs/Working/icXML/background-parabix.tex @ 2439

Last change on this file since 2439 was 2439, checked in by nmedfort, 7 years ago

work on overview

File size: 7.7 KB
1\subsection{The Parabix Framework}
5Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
6Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
7Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
8End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
9Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
10Att Names & \verb`_______________________11_______11________________________1111_____________`\\
11Att Values & \verb`__________________________11111______11111_____________________111_________`
14\caption{XML Source Data and Derived Parallel Bit Streams}
18The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
19(and other forms of text processing.) The key idea is to exploit the availability of wide
20(e.g., 128-bit) SIMD registers in commodity processors to represent data from long blocks
21of input data by using one register bit per single input byte.
22To facilitate this, the input data is first transposed into a set of basis bit streams
23and then boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean
24AND, OR and NOT operators.} are used to classify the input bits into a set of character-class
25(and eventually lexical) bit streams.
26For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
27  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
28  \ldots 7}$. The bits used to construct $\tt b_7$ have been
29highlighted in this example.
33\begin{tabular}{r c c c c }
34STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
35ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
40\begin{tabular}{r |c |c |c |c |c |c |c |c |}
41 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
42 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
43 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
44 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
45 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
49\caption{8-bit ASCII Basis Bit Streams}
53Character-class bit streams allow us to perform 128
54comparisons in parallel with a single operation by using a series
55of boolean-logic operations to merge multiple
56basis bit streams into a single character-class stream that marks the
57positions of key characters with a $1$
58For example, one of the fundemental markers in XML is the one that
59identifies all left-angle brackets ``\verb`<`''. A character is
60an ``\verb`<`'' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
61b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
62Classes of characters can be found with similar formulas.  For example, a
63character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
64\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
65important observation here is that a range of characters can sometimes
66take fewer operations and require fewer basis bit streams to compute
67than individual characters. Finding optimal solutions to all
68character-classes is non-trivial and goes beyond the scope of this
72% Using a mixture of boolean-logic and arithmetic operations, character-class
73% bit streams can be transformed into lexical bit streams, where the presense of
74% a 1 bit identifies a key position in the input data. As an artifact of this
75% process, intra-element well-formedness validation is performed on each block
76% of text.
78Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
79The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
80parsing, with each bit of each stream in one-to-one correspondence to the source character code units
81of the input stream.
82For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
83The first bit stream shown is that for the opening
84angle brackets that represent tag openers in XML.
85The second and third streams show a partition of the
86tag openers into start tag marks and end tag marks
87depending on the character immediately following the
88opener (i.e., ``\verb:/:'') or not.  The remaining three
89lines show streams that can be computed in subsequent
90parsing, namely streams marking the element names,
91attribute names and attribute values of tags. 
93{\it Do we need to explain how those can be computed from the input text or do we simply refer them to prior papers?}
95Two intuitions may help explain how the Parabix approach can lead
96to improved XML parsing performance. The first is that
97the use of the full register width offers a considerable
98information advantage over sequential byte-at-a-time
99parsing.  That is, sequential processing of bytes
100uses just 8 bits of each register, greatly limiting the
101processor resources that are effectively being used at any one time.
102The second is that byte-at-a-time loop scanning loops are actually
103often just computing a single bit of information per iteration:
104is the scan complete at this position yet?  Rather than
105computing these bits one at a time, an approach that computes
106many of them in parallel (e.g., 128 with SSE registers) should
107provide substantial {\bf benefit}.
108Previous studies have shown the performance {\bf benefits} of the
109Parabix approach in many aspects of XML processing, including transcoding\cite{Cameron2008},
110character classification and validation, tag parsing and well-formedness
111checking.  The first Parabix parser used processor bit scan instructions
112to considerably accelerate sequential scanning loops for individual
113characters \cite{CameronHerdyLin2008}.
114Recent work has incorporated a method of parallel
115scanning using bitstream addition \cite{cameron-EuroPar2011}, as
116well as combining SIMD methods with 4-stage pipeline parallelism to further improve
117throughput \cite{HPCA2012}.
119Although these research prototypes handle the full syntax of
120DTD-less XML documents, including well-formedness checking, they fall
121short of the functionality required in full XML parser for several reasons. Namely,
122commercial XML processors, such as Xerces, include a number of additional facilities such
123as support for transcoding of multiple character sets,
124the ability to parse and validate against DTDs, both internal and external,
125facilities for handling different XML vocabularies through namespace
126processing, as well validation against XML schema.  In addition,
127commercial parsers can be expected to provide a number of API
128facilities beyond those found in research prototypes, including
129full implementations of the widely used SAX, SAX2 and DOM interfaces.
Note: See TracBrowser for help on using the repository browser.