source: docs/Working/icXML/background-parabix.tex @ 2470

Last change on this file since 2470 was 2470, checked in by nmedfort, 7 years ago

More work; mostly edits

File size: 7.7 KB
1\subsection{The Parabix Framework}
7Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
8Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
9Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
10End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
11Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
12Att Names & \verb`_______________________11_______11________________________1111_____________`\\
13Att Values & \verb`__________________________11111______11111_____________________111_________`
16\caption{XML Source Data and Derived Parallel Bit Streams}
20The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
21(and other forms of text processing.) The key idea is to exploit the availability of wide
22(e.g., 128-bit) SIMD registers in commodity processors to represent data from long blocks
23of input data by using one register bit per single input byte.
24To facilitate this, the input data is first transposed into a set of basis bit streams
25and then boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean
26AND, OR and NOT operators.} are used to classify the input bits into a set of character-class
27(and eventually lexical) bit streams.
28For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
29  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
30  \ldots 7}$. The bits used to construct $\tt b_7$ have been
31highlighted in this example.
35\begin{tabular}{r c c c c }
36STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
37ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
42\begin{tabular}{r |c |c |c |c |c |c |c |c |}
43 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
44 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
45 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
46 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
47 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
51\caption{8-bit ASCII Basis Bit Streams}
55Character-class bit streams allow us to perform 128
56comparisons in parallel with a single operation by using a series
57of boolean-logic operations to merge multiple
58basis bit streams into a single character-class stream that marks the
59positions of key characters with a $1$
60For example, one of the fundemental markers in XML is the one that
61identifies all left-angle brackets ``\verb`<`''. A character is
62an ``\verb`<`'' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
63b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
64Classes of characters can be found with similar formulas.  For example, a
65character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
66\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
67important observation here is that a range of characters can sometimes
68take fewer operations and require fewer basis bit streams to compute
69than individual characters. Finding optimal solutions to all
70character-classes is non-trivial and goes beyond the scope of this
74% Using a mixture of boolean-logic and arithmetic operations, character-class
75% bit streams can be transformed into lexical bit streams, where the presense of
76% a 1 bit identifies a key position in the input data. As an artifact of this
77% process, intra-element well-formedness validation is performed on each block
78% of text.
80Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
81The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
82parsing, with each bit of each stream in one-to-one correspondence to the source character code units
83of the input stream.
84For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
85The first bit stream shown is that for the opening
86angle brackets that represent tag openers in XML.
87The second and third streams show a partition of the
88tag openers into start tag marks and end tag marks
89depending on the character immediately following the
90opener (i.e., ``\verb:/:'') or not.  The remaining three
91lines show streams that can be computed in subsequent
92parsing, namely streams marking the element names,
93attribute names and attribute values of tags. 
95{\it Do we need to explain how those can be computed from the input text or do we simply refer them to prior papers?}
97Two intuitions may help explain how the Parabix approach can lead
98to improved XML parsing performance. The first is that
99the use of the full register width offers a considerable
100information advantage over sequential byte-at-a-time
101parsing.  That is, sequential processing of bytes
102uses just 8 bits of each register, greatly limiting the
103processor resources that are effectively being used at any one time.
104The second is that byte-at-a-time loop scanning loops are actually
105often just computing a single bit of information per iteration:
106is the scan complete at this position yet?  Rather than
107computing these bits one at a time, an approach that computes
108many of them in parallel (e.g., 128 with SSE registers) should
109provide substantial {\bf benefit}.
110Previous studies have shown the performance {\bf benefits} of the
111Parabix approach in many aspects of XML processing, including transcoding\cite{Cameron2008},
112character classification and validation, tag parsing and well-formedness
113checking.  The first Parabix parser used processor bit scan instructions
114to considerably accelerate sequential scanning loops for individual
115characters \cite{CameronHerdyLin2008}.
116Recent work has incorporated a method of parallel
117scanning using bitstream addition \cite{cameron-EuroPar2011}, as
118well as combining SIMD methods with 4-stage pipeline parallelism to further improve
119throughput \cite{HPCA2012}.
121Although these research prototypes handle the full syntax of
122DTD-less XML documents, including well-formedness checking, they fall
123short of the functionality required in full XML parser for several reasons. Namely,
124commercial XML processors, such as Xerces, include a number of additional facilities such
125as support for transcoding of multiple character sets,
126the ability to parse and validate against DTDs, both internal and external,
127facilities for handling different XML vocabularies through namespace
128processing, as well validation against XML schema.  In addition,
129commercial parsers can be expected to provide a number of API
130facilities beyond those found in research prototypes, including
131full implementations of the widely used SAX, SAX2 and DOM interfaces.
Note: See TracBrowser for help on using the repository browser.