source: docs/EuroPar2011/europar-cameron.tex @ 883

Last change on this file since 883 was 883, checked in by ksherdy, 8 years ago

Reduce Section 4 to 1.5 pages. Minor edits.

File size: 48.7 KB
Line 
1
2%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
3%
4% This is the LaTeX source for the instructions to authors using
5% the LaTeX document class 'llncs.cls' for contributions to
6% the Lecture Notes in Computer Sciences series.
7% http://www.springer.com/lncs       Springer Heidelberg 2006/05/04
8%
9% It may be used as a template for your own input - copy it
10% to a new file with a new name and use it as the basis
11% for your article.
12%
13% NB: the document class 'llncs' has its own and detailed documentation, see
14% ftp://ftp.springer.de/data/pubftp/pub/tex/latex/llncs/latex2e/llncsdoc.pdf
15%
16%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
17
18
19\documentclass[runningheads,a4paper]{llncs}
20\usepackage{multirow}
21\usepackage{amssymb}
22\setcounter{tocdepth}{3}
23\usepackage{graphicx}
24
25\usepackage{url}
26\urldef{\mailsa}\path|{alfred.hofmann, ursula.barth, ingrid.haas, frank.holzwarth,|
27\urldef{\mailsb}\path|anna.kramer, leonie.kunz, christine.reiss, nicole.sator,|
28\urldef{\mailsc}\path|erika.siebert-cole, peter.strasser, lncs}@springer.com|   
29\newcommand{\keywords}[1]{\par\addvspace\baselineskip
30\noindent\keywordname\enspace\ignorespaces#1}
31
32\begin{document}
33
34\mainmatter  % start of an individual contribution
35
36
37
38
39% first the title is needed
40\title{Parallel Scanning with Bitstream Addition: An XML Case Study}
41
42% a short form should be given in case it is too long for the running head
43\titlerunning{Parallel Scanning with Bitstream Addition}
44
45% the name(s) of the author(s) follow(s) next
46%
47% NB: Chinese authors should write their first names(s) in front of
48% their surnames. This ensures that the names appear correctly in
49% the running heads and the author index.
50%
51\author{Robert D. Cameron
52\thanks{}
53\and Ehsan Amiri \and Kenneth S. Herdy \and Dan Lin \and Thomas C. Shermer \and Fred P. Popowich}
54
55%
56\authorrunning{Cameron {\em et al}}
57
58% the affiliations are given next; don't give your e-mail address
59% unless you accept that it will be published
60\institute{Simon Fraser University, Surrey, BC, Canada\\
61\email{\{cameron, eamiri, ksherdy, lindanl, shermer, popowich\}@cs.sfu.ca}
62}
63
64\maketitle
65
66
67\begin{abstract}
68A parallel scanning method using the concept of bitstream addition is
69introduced and studied in application to the problem of XML
70parsing and well-formedness checking.   
71% The method parallelizes
72% finite-state transitions, using carry propagation to achieve up
73% to $W$ transitions with each $W$-bit binary addition operation.
74On processors supporting $W$-bit addition operations,
75the method can perform up to $W$ finite state transitions per instruction.
76The method is based on the concept of parallel bitstream technology,
77in which parallel streams of bits are formed such that each stream
78comprises bits in one-to-one correspondence with the character
79code units of a source data stream.    Parsing routines are initially
80prototyped in Python using its native support for unbounded
81integers to represent arbitrary-length  bitstreams.  A compiler
82then translates the Python code into low-level C-based implementations.
83These low-level implementations take advantage of
84the SIMD (single-instruction multiple-data) capabilities of commodity
85processors to yield a dramatic speed-up over
86traditional alternatives employing byte-at-a-time parsing.
87\keywords{SIMD text processing, parallel bitstream technology, XML, parsing}
88\end{abstract}
89
90
91\section{Introduction}
92
93
94
95Traditional byte-at-a-time parsing technology is increasingly
96mismatched to the capabilities of modern processors.   Current
97commodity processors generally possess 64-bit general purpose registers
98as well as 128-bit SIMD registers, with 256-bit registers now
99appearing.   General purpose processing on graphics processors
100can make available 512-bit or wider registers.   Parsing models
101based on the traditional loading and processing of 8 bits at a time
102would seem to be greatly underutilizing processor resources.
103
104Unfortunately, parsing is hard to parallelize.   Indeed, in their seminal
105report outlining the landscape of parallel computing research,
106researchers from Berkeley identified the finite state machine
107methods underlying parsing and lexical processing as the hardest
108of the "13 dwarves" to parallelize, concluding at one point that
109"nothing helps." \cite{Asanovic:EECS-2006-183}   SIMD methods, in particular, would seem to
110be ill-suited to parsing, because textual data streams are seldom organized in
111convenient 16-byte blocks, tending to consist instead of
112variable-length items in generally unpredictable patterns.
113Nevertheless, there have been some notable works such as that of
114Scarpazza in applying the multicore and SIMD capabilities of the
115Cell/BE processor to regular expression matching \cite{Scarpazza:2009}
116Intel has also signalled the importance of accelerated string
117processing to its customers through the introduction of new string processing
118instructions in the SSE 4.2 instruction set extension, demonstrating
119how those features may be used to advantage in activities such as
120XML parsing \cite{XMLSSE42}
121
122Our research has been exploring a promising alternative approach, however, based on
123the concept of {\em parallel bit streams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}.   
124In this approach, byte streams
125are first sliced into eight basis bit streams, one for each
126bit position within the byte.  Bit stream $i$ thus comprises
127the $i$th bit of each byte.   Using 128-bit SIMD registers, then,
128bitwise logic operations on these basis bit streams allows
129byte classification operations to be carried out in parallel
130for 128 bytes at a time.  For example, consider a character class
131bit stream \verb:[<]: using a 1 bit to mark the position of
132opening angle brackets in a byte stream.  This stream may
133computed as logical combination of the basis bit streams using
134only seven bitwise logical operations per 128 bytes.
135
136Based on this approach, our prior work has shown how parallel
137bit streams may be used to accelerate XML parsing by further
138taking advantage of processor {\em bit scan} instructions, commonly
139found within commodity processors\cite{CameronHerdyLin2008}.
140On current Intel or AMD processors, for example, these instructions
141allow one to determine the position of the first 1 bit in a group of 64
142in a single instruction.   Using these techniques, our Parabix 1
143parser demonstrated offered considerable accelaration of XML
144parsing in statistics gathering \cite{CameronHerdyLin2008} as well
145as GML to SVG conversion \cite{Herdy2008}.
146
147In this paper, we further increase the parallelism in our methods
148by introducing a new parallel scanning primitive using bitstream
149addition.   In essence, multiple 1 bits in a marker stream
150identify current scanning positions for multiple instances
151of a particular syntactic context within a byte stream.
152These multiple marker positions may each be independently
153advanced in parallel using addition and masking. 
154The net result is a new scanning primitive that
155allows multiple instances
156of syntactic elements to be parsed simultaneously.   For example,
157in dense XML markup, one might find several instances of particular
158types of markup tags within a given 64-byte block of text; parallel
159addition on 64-bit words allows all such instances to be processed at once.
160
161
162Other efforts to accelerate XML parsing include the use of custom
163XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and
164multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}.
165
166The remainder of this paper is organized as follows.
167Section 2 reviews the basics of parallel bitstream technology
168and introduces our new parallel scanning primitive.
169Section 3 illustrates how this primitive may be used
170in the lexical processing of XML references including the
171parallel identification of errors.   Section 4 goes on
172to consider the more complex task of XML tag processing
173that goes beyond mere tokenization.
174Building on these methods, Section 5 describes how to
175construct a
176complete solution to the problem of XML parsing and
177well-formedness checking, in order
178to gauge the applicability and power of the techniques.
179Section \ref{sec:compile} then considers
180the translation of high-level operations on unbounded bitstreams into
181equivalent low-level code using SIMD intrinsics in the C programming
182language.
183Performance results are presented in section 7, comparing
184the results of our generated implementations in comparison with
185a number of existing alternatives.
186The paper concludes with
187comments on the current status of the work and directions for
188further research.
189
190\section{The Parallel Bitstream Method}\label{sec:parabit}
191
192\begin{figure}[h]
193\begin{center}
194\begin{tabular}{cr}\\
195source data $\vartriangleleft$ & \verb`----173942---654----1----49731----321--`\\
196$B_7$ & \verb`.......................................`\\
197$B_6$ & \verb`.......................................`\\
198$B_5$ & \verb`111111111111111111111111111111111111111`\\
199$B_4$ & \verb`....111111...111....1....11111....111..`\\
200$B_3$ & \verb`1111...1..111...1111.1111.1...1111...11`\\
201$B_2$ & \verb`1111.1..1.1111111111.11111.1..1111...11`\\
202$B_1$ & \verb`.....11..1...1.............11.....11...`\\
203$B_0$ & \verb`11111111..111.1.111111111.111111111.111`\\
204\verb:[0-9]: & \verb`....111111...111....1....11111....111..`\\
205\end{tabular}
206\end{center}
207\caption{Basis and Character-Class Bitstreams}
208\label{fig:inputstreams}
209\end{figure}
210
211\subsection{Fundamentals}
212
213A bitstream is simply a sequence of $0$s and $1$s, where there is one such bit in the bitstream for each character in a source data stream.
214For parsing, and other text processing tasks, we need to consider multiple properties of characters at different stages during the parsing process. A bitstream can be associated with each of these properties, and hence there will be multiple (parallel) bitstreams associated with a source data stream of characters \cite{Cameron2009,PPoPP08}.
215
216The starting point for bitstream methods are \emph{basis} bitstreams
217and their use in determining \emph{character-class} bitstreams.
218The $k$th basis bitstream $B_k$ consists of the $k$th bit (0-based, starting at the LSB)
219of each character in the source data stream;
220thus each $B_k$ is dependent on the encoding of the source characters (ASCII, UTF-8, UTF-16, etc.).
221Given these basis bit streams, it is then possible to combine them
222using bitwise logic in order to compute character-class
223bit streams, that is, streams that identify the positions at which characters belonging
224to a particular class occur.  For example, the character class bitstream
225$D=$\verb:[0-9]: marks with $1$s the positions at which decimal digits
226occur.    These bitstreams are illustrated in Figure \ref{fig:inputstreams},
227for an example source data stream consisting of digits and hyphens.
228This figure also illustrates some of our conventions for figures:  the left triangle $\vartriangleleft$ after
229``source data'' indicates that all streams are read from right to left
230(i.e., they are in little-endian notation).  We also use hyphens
231in the input stream represent any character that is not relevant to a character
232class under consideration, so that relevant characters stand out.
233Furthermore, the $0$ bits in the bitstreams are represented by periods,
234so that the $1$ bits stand out.
235
236
237Transposition of source data to basis bit streams and calculation
238of character-class streams in this way is an overhead on parallel bit
239stream applications, in general.   However, using the SIMD
240capabilities of current commodity processors, these operations are quite
241fast, with an amortized overhead of about 1 CPU cycle per byte for
242transposition and less than 1 CPU cycle per byte for all the character
243classes needed for XML parsing \cite{CameronHerdyLin2008}.
244Improved instruction sets using parallel extract operations or
245inductive doubling techniques may further reduce this overhead significantly \cite{CameronLin2009,HilewitzLee2006}.
246
247Beyond the bitwise logic needed for character class determination,
248we also need \emph{upshifting} to deal with sequential combination.
249The upshift $n(S)$ of a bitstream $S$ is obtained by shifting the bits in $S$ one position forward,
250then placing a $0$ bit in the starting position of the bitstream; $n$ is meant to be mnemonic of ``next''.
251In $n(S)$, the last bit of $S$ may be eliminated or retained for error-testing purposes.
252
253\subsection{A Parallel Scanning Primitive}
254
255In this section, we introduce the principal new feature of the paper,
256a parallel scanning method based on bitstream addition.   Key to this
257method is the concept of {\em marker} bitstreams. 
258Marker bitstreams are used to represent positions of interest in the
259scanning or parsing of a source data stream.
260The appearance of a 1 at a position in a marker bitstream could, for example, denote
261the starting position of an XML tag in the data stream.   In general, the set of
262bit positions in a marker bitstream may be considered to be the current
263parsing positions of multiple parses taking place in parallel throughout
264the source data stream.
265
266Figure \ref{fig:scan1} illustrates the basic concept
267underlying parallel parsing with bitstream addition.
268As with the previous figures, all streams are shown in little-endian
269representation, with streams reading from right-to-left.
270The first row shows a source data stream that includes three
271spans of digits, $13840$, $1139845$, and $127$, with other nondigit characters shown
272as hyphens.  The second row specifies the parsing problem
273using a marker bitstream $M_0$ to mark three initial
274marker positions at the start of each span of digits.
275The parallel parsing task is to move each
276of the three markers forward through the corresponding spans of
277digits to the immediately following positions.
278
279\begin{figure}[tbh]
280\begin{center}
281\begin{tabular}{l@{}lr}\\
282\multicolumn{2}{l}{source data $\vartriangleleft$}
283                          & \verb`--721----5489311-----04831------`\\
284$M_0$ &                   & \verb`....1..........1.........1......`\\
285$D$   & $= \verb`[0..9]`$ & \verb`..111....1111111.....11111......`\\
286$M_1$ & $ = M_0 + D$      & \verb`.1......1...........1...........`
287\end{tabular}
288\end{center}
289\caption{Bitstream addition}
290\label{fig:scan1}
291\end{figure}
292
293The third row of Figure \ref{fig:scan1}
294shows the derived character-class bitstream $D$ identifying
295positions of all digits in the source stream. 
296The fourth row then illustrates the key concept: marker movement
297is achieved by binary addition of the marker and character
298class bitstreams.  As a marker 1 bit is combined using binary addition to
299a span of 1s, each 1 in the span becomes 0, generating
300a carry to add to the next position to the left. 
301For each span, the process terminates at the left end
302of the span, generating a 1 bit in the immediately
303following position.   In this way, binary addition produces the marker bitstream
304$M_1$, with each of the three markers
305moved independently through their respective spans of digits to the position at the end.
306
307However, the simple addition technique shown in Figure \ref{fig:scan1}
308does not account for digits in the source stream that do
309not play a role in a particular scanning operation.
310Figure \ref{fig:scan2} shows an example and how this
311may be resolved.   The source data stream is again shown in row 1,
312and the marker bitstream defining the initial marker positions
313for the the parallel parsing tasks shown in row 2.   
314Row 3 again contains the character class bitstream for digits $D$.
315Row 4 shows the result of bitstream addition, in which
316marker bits are advanced, but additional bits not
317involved in the scan operation are included in the result.
318However, these are easily removed in row 5, by masking
319off any bits from the digit bitstream; these can never
320be marker positions resulting from a scan.
321
322\begin{figure}[tbh]
323\begin{center}
324\begin{tabular}{l@{}lr}\\
325\multicolumn{2}{l}{source data $\vartriangleleft$}     
326                                                        & \verb`--134--31--59127---3--3474--`\\
327$M_0$ &                                                 & \verb`....1.........1..........1..`\\
328$D$   & $= \verb`[0..9]`$ & \verb`..111..11..11111...1..1111..`\\
329$M_1$ & $= M_0 + D$       & \verb`.1.....11.1....1...1.1......`\\
330$M_2$ & $= (M_0 + D) \wedge \neg D$ & \verb`.1........1..........1......`
331\end{tabular}
332\end{center}
333\caption{Parallel Scan Using Addition and Mask}
334\label{fig:scan2}
335\end{figure}
336
337The addition and masking technique allows matching of
338the regular expression \verb:[0-9]*: for any reasonable
339(conflict-free) set of initial markers specified in $M_0$.
340A conflict occurs when a span from one marker would run
341into another marker position.   However, such conflicts
342do not occur with the normal methods of marker bitstream
343formation, in which unique syntactic features of
344the input stream are used to specify the initial marker
345positions.
346
347In the remainder of this paper, the notation $s(M, C)$
348denotes the operation to scan
349from an initial set of marker positions $M$ through
350the spans of characters belonging to a character class $C$ found at each position.
351\[s(M, C) = (M_0 + C)  \wedge \neg C\]
352
353
354\section{Parsing and Error Streams}
355\label{sec:errorstream}
356
357Now consider how the parallel scanning primitive can
358be applied to the following problem: parse all occurrences
359of XML decimal character references according to the
360grammar of Figure \ref{fig:decrefgrmr} and identify any
361errors.  This grammar is simplified to omit other
362forms of XML reference including hexadecimal character
363references as well as general entity references.
364For the time being, we assume that our XML documents contain
365only the decimal references and no other use of the ``\verb:&:''
366character.
367
368\begin{figure}[tbh]
369\begin{center}
370\begin{tabular}{rcl}
371DecRef & ::=   &        '\verb:&#:' Digit+ '\verb:;:'  \\
372Digit  & ::=   &         \verb:[0-9]:
373\end{tabular}
374\end{center}
375\caption{Grammar of Decimal Character References}
376\label{fig:decrefgrmr}
377\end{figure}
378
379Figure \ref{fig:decref} shows the parallel parsing of
380decimal references together with error checking.
381The source data includes four instances of potential
382decimal references beginning with the \verb:&: character.
383Of these, only the first one is legal according to
384the decimal reference syntax, the other three instances
385are in error.   These references may be parsed in
386parallel as follows.  The
387starting marker bitstream $M_0$ is formed from the \verb:[&]:
388character-class bitstream as shown in the second row.  The next row shows the
389result of the marker advance operation $n(M_0)$ to
390produce the new marker bitstream $M_1$.  At this point,
391a hash mark is required, so the first error bitstream $E_0$ is
392formed using a bitwise ``and'' operation combined with negation,
393to indicate violations of this condition.
394Marker bitstream $M_2$ is then defined as those positions
395immediately following any $M_1$ positions not in error.
396In the following row, the condition that at least
397one digit is required is checked to produce error bitstream $E_1$.
398A parallel scan operation is then applied through the
399digit sequences as shown in the next row to produce
400marker bitstream $M_3$.  The final error bitstream $E_2$ is
401produced to identify any references without a
402closing semicolon.
403In the penultimate row, the final marker bitstream $M_4$ marks the
404positions of all fully-checked decimal references, while the
405last row defines a unified error bitstream $E$ 
406indicating the positions of all detected errors.
407
408\begin{figure}[tbh]
409\begin{center}
410\begin{tabular}{l@{}lr}\\
411\multicolumn{2}{l}{source data $\vartriangleright$}     
412                                         & \verb`-&#978;-&9;--&#;--&#13!-`\\
413$M_0$ &                                  & \verb`.1......1....1....1.....`\\
414$M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
415$E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
416$M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
417$E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
418$M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
419$E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
420$M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
421$E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
422\end{tabular}
423\end{center}
424\caption{Parsing Decimal References}
425\label{fig:decref}
426\end{figure}
427
428
429\subsection{Marker Stream Initialization}
430
431How are marker bitstreams initialized?   In general,
432this is an important problem, and dependent on the task at hand.   
433In the XML parsing context,
434we rely on an important property of well-formed
435XML: after an initial filtering pass to identify
436XML comments, processing instructions and CDATA
437sections, every remaining \verb:<: in the
438file must be the initial character of a start,
439end or empty element tag, and every remaining \verb:&:
440must be the initial character of a general entity
441or character reference. These assumptions permit easy creation of marker bitstreams for XML tags and XML entities.
442
443\subsection{Parallel Parsing of Sequential Structures}
444
445The critical and most interesting application of our parallel
446parsing techniques to XML is in parsing of the start, end and empty element
447tags that comprise the core of XML markup.   
448In particular, start tags have an iterative internal structure
449as shown in the grammar of Figure \ref{fig:stag-grmr}.  After an
450opening angle bracket and tag name, a tag may have multiple
451attribute-value pairs with values enclosed in single or double
452quotes.  Using the bitstream addition technique, our method
453is to start with the opening angle bracket of all tags as
454the initial marker bitstream for parsing the tags in parallel,
455advance through the element name and then use an iterative
456process to move through attribute-value pairs.
457
458Figure \ref{fig:stag-ex}
459illustrates the parallel parsing of four XML start tags.
460The figure omits determination
461of error bitstreams, processing of single-quoted attribute values and handling
462of empty element tags, for simplicity.  In this figure, the first
463four rows show the source data and three character class bitstreams:
464$N$ for characters permitted in XML names,
465$W$ for whitespace characters,
466and $Q$ for characters permitted within a double-quoted attribute value string. 
467
468\begin{figure}[tbh]
469\begin{center}
470\begin{tabular}{rcl}
471STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
472Attribute & ::=   &        Name WS? '=' WS? AttValue \\
473AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
474%DQuoted & ::= & \verb:[^<"]*:  \\
475%SQuoted & ::= & \verb:[^<']*:
476\end{tabular}
477\end{center}
478\caption{Grammar of XML Start Tags}
479\label{fig:stag-grmr}
480\end{figure}
481
482
483\begin{figure*}[tbh]
484\begin{center}\footnotesize
485
486\begin{tabular}{l@{}lr}\\
487\multicolumn{2}{l}{source data $\vartriangleright$}  & \verb`-<d>--<ex1 a="17" b="33">---<ex1 a= "137" b ="xxx">----<c12 alpha="">---`\\
488$N$   & $= $\mbox{name chars} & \verb`1.1.11.111.1..11..1..11..111.111.1...111..1...111..1111.111.11111....111`\\
489$W$   & $= $\mbox{white space} & \verb`..........1......1..............1..1.....1.1...............1............`\\
490$Q $   & $ = \neg$\verb:[">]: & \verb`1.1111.111111.11.111.11.1111.1111111.111.1111.111.11111.1111111111..1111`\\
491\\
492$M_0 $   & $ = $\verb:[<]: & \verb`.1....1.....................1..........................1................`\\
493$M_1 $   & $ = n(M_0)$ & \verb`..1....1.....................1..........................1...............`\\
494$M_{0,7} $   & $= s(M_1, N)$ & \verb`...1......1.....................1..........................1............`\\
495$M_{0,8} $   & $= s(M_{0,7}, W) \wedge \neg$\verb:[>]: & \verb`...........1.....................1..........................1...........`\\
496\\
497$M_{1,1} $   & $= s(M_{0,8}, N)$ & \verb`............1.....................1..............................1......`\\
498$M_{1,2} $   & $= s(M_{1,1}, W) \wedge$\verb:[=]: & \verb`............1.....................1..............................1......`\\
499$M_{1,3} $   & $= n(M_{1,2})$ & \verb`.............1.....................1..............................1.....`\\
500$M_{1,4} $   & $= s({1,3}, W) \wedge$\verb:["]: & \verb`.............1......................1.............................1.....`\\
501$M_{1,5} $   & $= n(M_{1,4})$ & \verb`..............1......................1.............................1....`\\
502$M_{1,6} $   & $= s(M_{1,5}, Q) \wedge$\verb:["]: & \verb`................1.......................1..........................1....`\\
503$M_{1,7} $   & $= n(M_{1,6})$ & \verb`.................1.......................1..........................1...`\\
504$M_{1,8} $   & $= s(M_{1,7}, W) \wedge \neg$\verb:[>]: & \verb`..................1.......................1.............................`\\
505\\
506$M_{2,1} $   & $= s(M_{1,8}, N)$ & \verb`...................1.......................1............................`\\
507$M_{2,2} $   & $= s(M_{2,1}, W) \wedge$\verb:[=]: & \verb`...................1........................1...........................`\\
508$M_{2,3} $   & $= n(M_{2,2})$ & \verb`....................1........................1..........................`\\
509$M_{2,4} $   & $= s(M_{2,3}, W) \wedge$\verb:["]: & \verb`....................1........................1..........................`\\
510$M_{2,5} $   & $= n(M_{2,4})$ & \verb`.....................1........................1.........................`\\
511$M_{2,6} $   & $= s(M_{2,5}, Q) \wedge$\verb:["]: & \verb`.......................1.........................1......................`\\
512$M_{2,7} $   & $= n(M_{2,6})$ & \verb`........................1.........................1.....................`\\
513$M_{2,8} $   & $= s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb`........................................................................`
514
515\end{tabular}
516
517\end{center}
518
519
520\caption{Start Tag Parsing}
521\label{fig:stag-ex}
522\end{figure*}
523
524
525The parsing process is then illustrated in the remaining rows of the
526figure.    Each successive row shows the set of parsing markers as they
527advance in parallel using bitwise logic and addition.
528Overall, the sets of marker transitions can be divided into three groups.
529
530The first group
531$M_0$ through $M_{0,8}$ shows the initiation of parsing for each of the
532four tags through the
533opening angle brackets and  the element names, up to the first
534attribute name, if present.  Note that the there are no attribute names
535in the first tag shown, so the corresponding marker becomes zeroed
536out at the closing angle bracket.
537Since $M_{0,8}$ is not all $0$s, the parsing continues.
538
539The second group of marker transitions
540$M_{1,1}$ through $M_{1,8}$ deal with the parallel parsing of the first attribute-value
541pair of the three remaining tags.
542After these operations, there are no more attributes
543in the final tag, so its corresponding marker becomes zeroed out.
544However, $M_{1, 8}$ is not all $0$s, as each of the middle two tags still have an unparsed attribute-value pair.
545Thus, the parsing continues.
546
547The third group of marker transitions $M_{2,1}$ through $M_{2,8}$ deal with the parsing of
548the second attribute-value pair of each of the two remaining tags.  The
549final transition to $M_{2,8}$ shows the zeroing out of all remaining markers
550once two iterations of attribute-value processing have taken place.
551Since $M_{2,8}$ is all $0$s, start tag parsing stops.
552
553The implementation of start tag processing uses a while loop that
554terminates when the set of active markers becomes zero,
555i.e.\  when some $M_{k, 8} = 0$.
556Considered
557as an iteration over unbounded bitstreams, all start tags in the document
558are processed in parallel, using a number of iterations equal to the maximum
559number of attribute-value pairs in any one tag in the document.   
560However, the cost of iteration is considerably reduced; the iteration for
561each block only requires as many steps as there are attribute-value pairs
562overlapping the block.
563
564
565
566
567%\subsection{Name Scans}
568%To illustrate the scanning of the name found in an XML start tag,
569%let us consider a sequence that might be found in an HTML file,
570%\verb:<div id="myid">:,
571%which is shown as the source data stream in Figure \ref{fig:stag-scan}.
572
573%\begin{figure}[tbh]
574%\begin{center}
575%\begin{tabular}{cr}\\
576%source data & \verb:<div id="myid">:\\
577%$M_0$ & \verb`1..............`\\
578%$C_0$ & \verb`.111.11...1111.`\\
579%$M_1 = n(M_0)$ & \verb`.1.............`\\
580%$M_2 = s(M_1, D_0) \wedge \neg C_0$ & \verb`....1.........`\\
581%lastline & \verb`...............`
582%\end{tabular}
583%\end{center}
584%\caption{Scanning Names}
585%\label{fig:stag-scan}
586%\end{figure}
587
588%If we set our initial marker bitstream according to the procedure outlined in our discussion of marker bitstream initialization, we %obtain the bitstream $M_0$.
589%According to the grammar in Figure \ref{fig:stag-grmr}, we can then look for a \verb:Name: in an \verb:STag: after we have found a %5\verb:<:.
590%So, $M_1$ is the marker bitstream for the starting position of our name.
591%Although we do not know the length of the name, the $C_0$ bit vector can easily be set to $1$ for the characters that can be contained %in a name.
592%We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position.
593
594
595\subsection{Mask Formation with Bitstream Subtraction}
596%\subsection{Comment, CDATA and Processing Instructions}
597\label{sec:maskstream}
598
599For various purposes in parsing, it may be necessary to
600introduce {\em mask bitstreams}, streams that identify spans
601of positions that are to be selected or excluded from processing
602in some fashion.
603
604In the case of XML processing, one important use of mask
605bitstreams is to filter out those \verb:&: and \verb:<:
606characters that occur within comments,
607CDATA sections and processing instructions and hence
608do not indicate starting marker positions
609for references or tags, respectively. Each of these
610has a relatively simple structure comprising primarily
611specific opening and closing delimiters: \verb;<!--;
612and \verb:-->: for comments, \verb:<![CDATA[:
613and \verb:]]>: for CDATA sections and \verb:<?:
614and \verb:?>: for processing instructions.   Processing
615instructions also have a small amount of internal structure
616consisting of a name that identifies the target of the
617processing instruction followed by optional parameter text.
618
619The content of each of these items is relatively unconstrained
620and may contain what appears to be XML markup of other kinds.
621This makes it impossible to reliably parse all instances of these
622types of markup using parallel techniques. 
623However, we can
624still use bitstream addition for the sequential parsing of
625these items from the beginning of the file.  In this case,
626instead of initializing a marker bitstream using specific marker
627symbols found throughout the file, the marker bitstream is
628initialized with a single 1 bit at the file start position.
629
630Nevertheless, parsing of comments, CDATA sections and
631processing instructions generally proceeds quite quickly in a
632sequential fashion.   Parsing steps generally involve long
633scans to an opening delimiter of one of these constructs,
634followed by further long scans through the content of the
635comment, CDATA section or processing instruction.
636
637\begin{figure*}[tbh]
638\begin{center}
639\begin{tabular}{cr}\\
640source data $\vartriangleright$ & \verb`<!-- <<<< --> <?php 1<2 ?> <t> <![CDATA[ <demo/> ]]>.`\\
641$M_i$ & \verb`_1_____________1________________1____________________`\\
642$M_f$ & \verb`____________1____________1_________________________1_`\\
643$m = n(M_f) - m_i$ & \verb`_111111111111__11111111111______11111111111111111111_`\\
644$M_1 = n(\verb:[<]:) \wedge \neg m$ & \verb`____________________________1________________________`
645\end{tabular}
646\end{center}
647\caption{Mask Formation for Comments, CDATA and PI}
648\label{fig:CtCDPImask}
649\end{figure*}
650
651A single pass scan is made to identify these structures
652within the document.   Once complete, a mask bitstream
653of 1 bits is formed to identify the union of the interiors
654of these structures.   Figure \ref{fig:CtCDPImask}
655illustrates.   Here, we assume this pass is complete
656to produce the initial and final positions of these
657constructs as $M_i$ and $M_f$.
658Here, $M_i$ is actually the set of upshifted start positions
659for comments, CDATA sections and processing instructions.
660Note the upshifting is requires to properly match the
661delimiter $n(\verb`[<]`) \wedge \verb`[!?]`$.
662$M_f$ is the set of final positions determined by parsing.
663This then allows the mask bitstream $m$ to be computed by
664bitstream subtraction.   (To perform the subtraction,
665imaging reversing $M_i$ and $M_f$ to produce the
666internal little-endian representation.)   Note that this
667stream does not allow us to mask out the opening \verb:<:
668of these constructs themselves.   But we can mask
669out $n(\verb`[<]`)$ to produce $M_1$ as shown.  However,
670this is adequate as an initial marker bitstream for tag
671parsing, as the formation of $M_1 = n(\verb`[<]`)$ is the
672first step shown in Figure \ref{fig:stag-ex}.
673
674\subsection{Python Prototyping}
675
676Taking advantage of Python's built-in support for
677unbounded integers to represent arbitrary-size bitstreams,
678we have implemented a complete parsing prototype for XML
679using only bitstream addition, subtraction and bitwise logic.
680The \verb:ParallelScanThru: operation is a straightforward
681implementation of the scanning primitive $s$ as shown
682previously.   The operation $n$ is simply implemented using
683an upshift operation, while subtraction and bitwise logic
684are directly supported.
685
686We have also used a modified version of this prototype as the
687input language of an experimental bitstream compiler that
688we have developed.   This compiler implements the compilation to
689block-by-block processing presented in Section 5, following.
690
691
692\section{XML Well-Formedness}
693
694In this section, we consider the full application of the parsing techniques
695of the previous section to the problem of XML well-formedness checking \cite{TR:XML}.
696This application is useful as a well-defined and commercially significant
697example to assess the overall applicability of parallel bit stream techniques. 
698To what extent can the well-formedness requirements of XML be
699completely discharged using parallel bitstream techniques?
700Are those techniques worthwhile in every instance, or
701do better alternatives exist for certain requirements?
702For those requirements that cannot be fully implemented
703using parallel bitstream technology alone, what
704preprocessing support can be offered by parallel bit stream
705technology to the discharge of these requirements in other ways?
706We address each of these questions in this section,
707and look not only at the question of well-formedness, but also at
708the identification of error positions in documents that
709are not well-formed.
710
711
712\subsection{Error and Error-Check Bitstreams}
713
714Most of the requirements of XML well-formedness checking
715can be implemented using two particular types of computed
716bitstream: {\em error bitstreams}, introduced in the previous section, and {\em error-check bitstreams}.
717Recall that an error bitstream stream is a stream marking the location of definite errors in accordance with
718a particular requirement.  For example, the
719$E_0$, $E_1$, and $E_2$ bitstreams as computed during parsing of
720decimal character references in Figure \ref{fig:decref}
721are error bitstreams.  One bits mark definite errors and zero bits mark the
722absence of error according to the requirement.   
723Thus the complete absence of errors according to the
724requirements listed may be determined by forming the
725bitwise logical ``or'' of these bitstreams and confirming
726that the resulting value is zero. An error check bitstream is one
727that marks potential errors to be further checked in
728some fashion during post-bitstream processing.   
729An example is the bitstream marking the start positions
730of CDATA sections.   This is a useful information stream
731computed during bitstream processing to identify opening
732\verb:<![: sequences, but also marks positions to
733subsequently check for the complete opening
734delimiter  \verb:<![CDATA[: at each position.
735
736In typical documents, most of these error-check streams will be quite sparse
737or even zero.   Many of the error conditions could
738actually be fully implemented using bitstream techniques,
739but at the cost of a number of additional logical and shift
740operations.   In general, however, the conditions are
741easier and more efficient to check one-at-a-time using
742multibyte comparisons on the original source data stream.
743With very sparse streams, it is very unlikely that
744multiple instances occur within any given block, thus
745eliminating the benefit of parallel evaluation of the logic.
746
747The requirement for name checking merits comment.   XML
748names may use a wide range of Unicode character values.
749It is too expensive to check every instance of an XML name
750against the full range of possible values.   However, it is
751possible and quite inexpensive to use parallel bitstream
752techniques to verify that any ASCII characters within a name
753are indeed legal name start characters or name characters.
754Furthermore, the characters that may legally follow a
755name in XML are confined to the ASCII range.  This makes
756it useful to define a name scan character class to include all the legal ASCII characters
757for names as well as all non-ASCII characters. 
758A namecheck character class bitstream will then be defined to identify nonASCII
759characters found within namescans.   In most documents
760this bitstream will be all $0$s; even in documents with substantial
761internationalized content, the tag and attribute names used
762to define the document schema tend to be confined to the
763ASCII repertoire.   In the case that this bitstream is nonempty,
764the positions of all 1 bits in this bitstream denote characters
765that need to be individually validated.
766
767Attribute names within a single XML start tag or empty
768element tag must be unique.  This requirement could be
769implemented using one of several different approaches. Standard
770approaches include: sequential search, symbol lookup, and Bloom filters
771\cite{DaiNiZhu2010}.
772
773In general, the use of error-check bitstreams is a straightforward,
774convenient and reasonably efficient mechanism for
775checking the well-formedness requirements.
776
777\subsection{Tag Matching}
778
779Except for empty element tags, XML tags come in pairs with
780names that must be matched.   To discharge this requirement,
781we form a bitstream consisting of the disjunction of three
782bitstreams formed during parsing: the bitstream marking the
783positions of start or empty tags (which have a common
784initial structure), the bitstream marking tags that end using
785the empty tag syntax (``\verb:/>:''), and the bitstream
786marking the occurrences of end tags.   In post-bitstream
787processing, we iterate through this computed bitstream
788and match tags using an iterative stack-based approach.
789
790\subsection{Document Structure}
791
792An XML document consists of a single root element with recursively
793defined structure together with material in the document
794prolog and epilogs.  Verifying this top-level structure and
795the structure of the prolog and epilog material is not
796well suited to parallel bitstream techniques, in particular, nor
797to any form of parallelism, in general.  In essence, the
798prolog and epilog materials occur once per document instance
799Thus the requirements to check this top-level structure
800for well-formedness are relatively minor, with an overhead
801that is quite low for sufficiently sized files.
802
803\subsection{Summary}
804
805Overall, parallel bitstream techniques are quite well-suited to
806verification problems such as XML well-formedness checking. 
807Many of the character validation and syntax checking requirements
808can be conveniently and efficiently implemented using error streams.
809Other requirements are also supported by the computation of
810error-check streams for simple post-bitstream processing or
811composite stream over which iterative stack-based procedures
812can be defined for checking recursive syntax.
813
814\section{Compilation to Block-Based Processing} 
815\label{sec:compile}
816While a Python implementation of the techniques described in the previous section works on unbounded bitstreams, a corresponding
817C implementation needs to process an input stream in blocks of size equal to the
818SIMD register width of the processor it runs on.
819So, to convert Python code into C, the key question becomes how
820to transfer information from one block to the next one.
821The answer lies in the use of {\em carry bits}, the collection of carries resulting from bitwise additions.
822 
823In fact, in the methods we have outlined, all the
824the information flow between blocks for parallel bit stream
825calculations can be modeled using carry bits.   The parallel
826scanning primitive uses only addition and bitwise logic.
827Since the logic operations do not require information flow
828accross block boundaries, the information flow is entirely
829accounted by the carry.   Carry bits can also be used to
830capture the information flow associated with upshift
831operations, which move information forward one position
832in the file.   In essence, an upshift by one position for
833a bitstream is equivalent to the addition of the stream
834to itself; the bit shifted out in an upshift is in this
835case equivalent to the carry generated by the additon.
836The only other information flow requirement in the
837calculation of parallel bit streams occurs with the
838bitstream subtractions that are used to calculate span streams.
839In this case, the information flow is based on borrows
840generated, which can be handled in the same way as carries.
841
842Properly determining, initializing and inserting carry bits
843into a block-by-block implementation of parallel bit stream
844code is a task too tedious for manual implementation.
845We have thus developed compiler technology to automatically
846insert declarations, initializations and carry save/restore
847operations into appropriate locations when translating
848Python operations on unbounded bit streams into the
849equivalent low-level C code implemented on a block-by-block
850bases.  Our current compiler toolkit is capable of inserting
851carry logic using a variety of strategies, including both
852simulated carry bit processing with SIMD registers, as
853well as carry-flag processing using the processor general
854purpose registers and ALU.   Details are beyond the
855scope of this paper, but are described in the on-line
856source code repository at parabix.costar.sfu.ca.
857
858\section{Performance Results}
859
860In this section, we compare the performance of our \verb:xmlwf:
861implementation using the Parabix2 technology described above with several
862other implementations.
863These include the original \verb:xmlwf:
864distributed as an example application of the \verb:expat: XML
865parser,  implementations based on the widely used Xerces
866open source parser using both SAX and DOM interfaces,
867and an implementation using our prior Parabix 1 technology with
868bit scan operations. 
869
870Table \ref{XMLDocChars} 
871shows the document characteristics of the XML instances selected for this performance study,
872including both document-oriented and data-oriented XML files.
873The jawiki.xml and dewiki.xml XML files are document-oriented XML instances of Wikimedia books, written in Japanese and German, respectively. The remaining files are data-oriented.  The roads.gml file is an instance of Geography Markup Language (GML),
874a modeling language for geographic information systems as well as an open interchange format for geographic transactions on the Internet \cite{GML04}.  The po.xml file is an example of purchase order data, while the soap.xml file contains a large SOAP message.
875Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.
876This metric is reported for each document.
877
878\begin{table*}[tbh]
879\begin{center}
880\begin{tabular}{|c||r|r|r|r|r|}
881\hline
882File Name               & dewiki.xml            & jawiki.xml            & roads.gml     & po.xml        & soap.xml \\ \hline   
883File Type               & document      & document      & data  & data  & data   \\ \hline     
884File Size (kB)          & 66240                 & 7343                  & 11584         & 76450         & 2717 \\ \hline
885Markup Item Count       & 406792                & 74882                 & 280724        & 4634110       & 18004 \\ \hline               
886Attribute Count         & 18808                 & 3529                  & 160416        & 463397        & 30001\\ \hline
887Avg. Attribute Size     & 8                     & 8                     & 6             & 5             & 9\\ \hline
888Markup Density          & 0.07                  & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
889
890\end{tabular}
891\end{center}
892 \caption{XML Document Characteristics} 
893 \label{XMLDocChars} 
894\end{table*}
895
896Table \ref{parsers-cpb} shows performance measurements for the
897various \verb:xmlwf: implementations applied to the
898test suite.   Measurements are made on a single core of an
899Intel Core 2 system running a stock 64-bit Ubuntu 10.10 operating system. 
900Measurements are reported in CPU cycles per input byte of
901the XML data files in each case.
902The first row shows the performance of the Xerces C parser
903using the tree-building DOM interface. 
904Note that the performance
905varies considerably depending on markup density.  Note also that
906the DOM tree construction overhead is substantial and unnecessary
907for XML well-formedness checking.  Using the event-based SAX interface
908to Xerces gives much better results as shown in the
909second row.   
910The third row shows the best performance of our byte-at-a-time
911parsers, using the original   based on expat.
912
913The remaining rows of Table \ref{parsers-cpb} show performance
914of parallel bit stream implementations.  The first row shows
915the performance of our Parabix 1 implementation using
916bit scan instructions.   While showing a substantial speed-up
917over the byte-at-a-time parsers in every case, note also that
918the performance advantage increases with increasing markup
919density, as expected.   The last two rows show different versions of
920the \verb:xmlwf: implemented based on the Parabix 2 technology
921as discussed in this paper.   They differ in the carry handling
922strategy, with the ``simd\_add'' row referring to carry
923computations performed with simulated calculation of
924propagated and generated carries using SIMD operations, while the
925``adc'' row refers to an implementation directly employing
926the processor carry flags and add-with-carry instructions on
92764-bit general registers.  In both cases, the overall
928performance is quite impressive, with the increased
929parallelism of parallel bit scans clearly paying off in
930improved performance for dense markup.
931
932
933\begin{table}[thb]
934\begin{center}
935\begin{tabular}{|c|c||c|c|c|c|c|}
936\hline
937Parser Class & Parser & dewiki.xml  & jawiki.xml    & roads.gml  & po.xml & soap.xml  \\ \hline 
938\multirow{3}{*}{Byte-at-a-time} & Xerces (DOM)    &    39.8    &   46.7    &  81.6    &   122.5   &    143.7  \\ \cline{2-7} 
939& Xerces (SAX)   &     24.0   &    30.4     &  40.3    &   54.3    &   64.3     \\ \cline{2-7}
940& expat      &  14.2     &  17.9   &    35.4    &   44.7     &  53.6      \\ \hline 
941\multirow{3}{*}{Parallel Bit Stream} & Parabix1   &    8.313   &    9.335   &     13.345   &    16.136   &      19.047 \\ \cline{2-7}
942& gcc (simd\_add)    &  6.174   &       6.405   &       7.948   &       8.565  &        9.172 \\ \cline{2-7} 
943& llvm (adc64)       &  5.757   &       6.142   &       6.763   &       7.424   &       7.952 \\ \hline
944 \end{tabular}
945\end{center}
946 \caption{Parser Performance (Cycles Per Byte)} 
947\label{parsers-cpb} 
948\end{table}
949 
950%gcc (simd\_add)    &   6.174   &       6.405   &       7.948   &       8.565   &       9.172 \\ \hline
951%llvm (simd\_add)   &   6.104   &       6.335   &       8.332   &       8.849   &       9.811 \\ \hline
952%gcc (adc64)        &   9.23   &        9.921   &       10.394   &      10.705   &      11.751 \\ \hline
953%llvm (adc64)       &   5.757   &       6.142   &       6.763   &       7.424   &       7.952 \\ \hline
954%gcc (SAHFLAHF)    &    7.951   &       8.539   &       9.984   &       10.219   &      11.388 \\ \hline
955%llvm(SAHFLAHF)    &    5.61   &        6.02   &        6.901   &       7.597   &       8.183 \\ \hline
956 
957
958
959
960\section{Conclusion}
961
962In application to the problem of XML parsing and well-formedness
963checking, the method of parallel parsing with bitstream addition
964is effective and efficient.   Using only bitstream addition
965and bitwise logic, it is possible to handle all of the
966character validation, lexical recognition and parsing problems
967except for the recursive aspects of start and end tag matching.
968Error checking is elegantly supported through the use of error
969streams that eliminate separate if-statements to check for
970errors with each byte.   The techniques are generally very
971efficient particularly when markup density is high.   However, for some
972conditions that occur rarely and/or require complex combinations
973of upshifting and logic, it may be better to define simpler
974error-check streams that require limited postprocessing using
975byte matching techniques.
976
977The techniques have been implemented and assessed for present-day commodity processors employing current SIMD technology.
978As processor advances see improved instruction sets and increases
979in width of SIMD registers, the relative advantages of the
980techniques over traditional byte-at-a-time sequential
981parsing methods is likely to increase substantially.
982Of particular benefit to this method, instruction set modifications
983that provide for more convenient carry propagation for long
984bitstream arithmetic would be most welcome.
985
986A significant challenge to the application of these techniques
987is the difficulty of programming.   The method of prototyping
988on unbounded bitstreams has proven to be of significant value
989in our work.   Using the prototyping language as input to
990a bitstream compiler has also proven effective in generating
991high-performance code.   Nevertheless, direct programming
992with bitstreams is still a specialized skill; our future
993research includes developing yet higher level tools to
994generate efficient bitstream implementations from grammars,
995regular expressions and other text processing formalisms.
996
997
998\bibliographystyle{plain}
999\bibliography{xmlperf}
1000
1001
1002\end{document}
1003
Note: See TracBrowser for help on using the repository browser.