source: docs/Working/icXML/background-parabix.tex @ 2516

Last change on this file since 2516 was 2516, checked in by cameron, 7 years ago

Updates, more streams to discuss

File size: 7.9 KB
1\subsection{The Parabix Framework}
4The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
5(and other forms of text processing.) The key idea is to exploit the availability of wide
6(e.g., 128-bit) SIMD registers in commodity processors to represent data from long blocks
7of input data by using one register bit per single input byte.
8To facilitate this, the input data is first transposed into a set of basis bit streams.
9In Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
10is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
11% The bits used to construct $\tt b_7$ have been highlighted in this example.
12Boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.} 
13are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
14characters (or groups of characters) with a $1$.
15For example, one of the fundemental characters in XML is a left-angle bracket.
16A character is an `\verb`<`' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
17b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
18Similarly, a character is numeric
19{\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
20% An important observation here is that a range of characters can sometimes
21% take fewer operations and require fewer basis bit streams to compute
22% than individual characters. Finding optimal solutions to all
23% character-classes is non-trivial and goes beyond the scope of this
24% paper.
28\begin{tabular}{r c c c c }
29STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
30ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
35\begin{tabular}{r |c |c |c |c |c |c |c |c |}
36 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
37 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
38 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
39 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
40 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
43\caption{8-bit ASCII Basis Bit Streams}
47% Using a mixture of boolean-logic and arithmetic operations, character-class
48% bit streams can be transformed into lexical bit streams, where the presense of
49% a 1 bit identifies a key position in the input data. As an artifact of this
50% process, intra-element well-formedness validation is performed on each block
51% of text.
56Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
57Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
58Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
59End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
60Empty Tag Marks & \verb`___________________________________________________________________1_______`\\
61Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
62Att Names & \verb`_______________________11_______11________________________1111_____________`\\
63Att Values & \verb`___________________________111________111_______________________1__________`\\
64String ends & \verb`1_____1_______1____1__________1__________1_____1____1____________1__1______`\\
65String marks & \verb`0_____0_______0____0__________0__________0_____0____0____________0__0______`\\
66Transition marks & \verb`_____1___1_____1_________1_________1______1_____1_____________1____1_1_____`\\
67Transition chars & \verb`_____>___>_____/_________=_________=______>_____/_____________=____>_/_____`\\
68Deletion mask & \verb`_1111__11_______111_11111_1____1111_11___________111_111111111_1__1___11111`\\
69Undeleted & \verb`_____>___>text_/_________=_foo_____=__fie_>more_/_____________=_b__>_/_____`
72\caption{XML Source Data and Derived Parallel Bit Streams}
76Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
77The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
78parsing, with each bit of each stream in one-to-one correspondence to the source character code units
79of the input stream.
80For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
81The first bit stream shown is that for the opening
82angle brackets that represent tag openers in XML.
83The second and third streams show a partition of the
84tag openers into start tag marks and end tag marks
85depending on the character immediately following the
86opener (i.e., ``\verb:/:'') or not.  The remaining three
87lines show streams that can be computed in subsequent
88parsing (using the technique
89of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
90attribute names and attribute values of tags. 
92Two intuitions may help explain how the Parabix approach can lead
93to improved XML parsing performance. The first is that
94the use of the full register width offers a considerable
95information advantage over sequential byte-at-a-time
96parsing.  That is, sequential processing of bytes
97uses just 8 bits of each register, greatly limiting the
98processor resources that are effectively being used at any one time.
99The second is that byte-at-a-time loop scanning loops are actually
100often just computing a single bit of information per iteration:
101is the scan complete at this position yet?  Rather than
102computing these individual decision-bits, an approach that computes
103many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
104should provide substantial benefit.
106Previous studies have shown Parabix approach improves many aspects of XML processing,
107including transcoding \cite{Cameron2008}, character classification and validation,
108tag parsing and well-formedness checking. 
109The first Parabix parser used processor bit scan instructions to considerably accelerate
110sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
111Recent work has incorporated a method of parallel
112scanning using bitstream addition \cite{cameron-EuroPar2011}, as
113well as combining SIMD methods with 4-stage pipeline parallelism to further improve
114throughput \cite{HPCA2012}.
115Although these research prototypes handle the full syntax of
116DTD-less XML documents, they lacked the functionality required by full XML parsers.
117Namely, commercial XML processors, such as Xerces,
118as support for transcoding of multiple character sets,
119the ability to parse and validate against DTDs, both internal and external,
120facilities for handling different XML vocabularies through namespace
121processing, as well validation against XML Schema grammars. 
122Additionally, commercial parsers can be expected to provide a number of API
123facilities beyond those found in research prototypes, including
124full implementations of the widely used SAX, SAX2 and DOM interfaces.
Note: See TracBrowser for help on using the repository browser.