source: docs/Working/icXML/background-parabix.tex @ 2505

Last change on this file since 2505 was 2505, checked in by nmedfort, 7 years ago

more edits

File size: 7.2 KB
Line 
1\subsection{The Parabix Framework}
2\label{background:parabix}
3
4The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
5(and other forms of text processing.) The key idea is to exploit the availability of wide
6(e.g., 128-bit) SIMD registers in commodity processors to represent data from long blocks
7of input data by using one register bit per single input byte.
8To facilitate this, the input data is first transposed into a set of basis bit streams.
9In Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb|<|A}''
10is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
11% The bits used to construct $\tt b_7$ have been highlighted in this example.
12Boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.} 
13are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
14characters (or groups of characters) with a $1$.
15For example, one of the fundemental characters in XML is a left-angle bracket.
16A character is an `\verb`<`' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
17b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
18Similarly, a character is numeric
19{\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
20% An important observation here is that a range of characters can sometimes
21% take fewer operations and require fewer basis bit streams to compute
22% than individual characters. Finding optimal solutions to all
23% character-classes is non-trivial and goes beyond the scope of this
24% paper.
25
26\begin{figure}[tbh]
27\begin{center}
28\begin{tabular}{r c c c c }
29STRING & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
30ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
31\hline
32\end{tabular}
33\end{center}
34\begin{center}
35\begin{tabular}{r |c |c |c |c |c |c |c |c |}
36 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
37 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
38 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
39 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
40 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
41\end{tabular}
42\end{center}
43\caption{8-bit ASCII Basis Bit Streams}
44\label{fig:BitStreamsExample}
45\end{figure}
46
47% Using a mixture of boolean-logic and arithmetic operations, character-class
48% bit streams can be transformed into lexical bit streams, where the presense of
49% a 1 bit identifies a key position in the input data. As an artifact of this
50% process, intra-element well-formedness validation is performed on each block
51% of text.
52
53\begin{figure*}[tbh]
54\begin{center}
55\begin{tabular}{cr}\\
56Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
57Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
58Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
59End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
60Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
61Att Names & \verb`_______________________11_______11________________________1111_____________`\\
62Att Values & \verb`__________________________11111______11111_____________________111_________`
63\end{tabular}
64\end{center}
65\caption{XML Source Data and Derived Parallel Bit Streams}
66\label{fig:parabix1}
67\end{figure*}
68
69Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
70The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
71parsing, with each bit of each stream in one-to-one correspondence to the source character code units
72of the input stream.
73For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
74The first bit stream shown is that for the opening
75angle brackets that represent tag openers in XML.
76The second and third streams show a partition of the
77tag openers into start tag marks and end tag marks
78depending on the character immediately following the
79opener (i.e., ``\verb:/:'') or not.  The remaining three
80lines show streams that can be computed in subsequent
81parsing (using the technique
82of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
83attribute names and attribute values of tags. 
84
85Two intuitions may help explain how the Parabix approach can lead
86to improved XML parsing performance. The first is that
87the use of the full register width offers a considerable
88information advantage over sequential byte-at-a-time
89parsing.  That is, sequential processing of bytes
90uses just 8 bits of each register, greatly limiting the
91processor resources that are effectively being used at any one time.
92The second is that byte-at-a-time loop scanning loops are actually
93often just computing a single bit of information per iteration:
94is the scan complete at this position yet?  Rather than
95computing these individual decision-bits, an approach that computes
96many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
97should provide substantial benefit.
98
99Previous studies have shown Parabix approach improves many aspects of XML processing,
100including transcoding \cite{Cameron2008}, character classification and validation,
101tag parsing and well-formedness checking. 
102The first Parabix parser used processor bit scan instructions to considerably accelerate
103sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
104Recent work has incorporated a method of parallel
105scanning using bitstream addition \cite{cameron-EuroPar2011}, as
106well as combining SIMD methods with 4-stage pipeline parallelism to further improve
107throughput \cite{HPCA2012}.
108Although these research prototypes handle the full syntax of
109DTD-less XML documents, they lacked the functionality required by full XML parsers.
110Namely, commercial XML processors, such as Xerces,
111as support for transcoding of multiple character sets,
112the ability to parse and validate against DTDs, both internal and external,
113facilities for handling different XML vocabularies through namespace
114processing, as well validation against XML Schema grammars. 
115Additionally, commercial parsers can be expected to provide a number of API
116facilities beyond those found in research prototypes, including
117full implementations of the widely used SAX, SAX2 and DOM interfaces.
118
Note: See TracBrowser for help on using the repository browser.