source: docs/EuroPar2011/europar-cameron.tex @ 1186

Last change on this file since 1186 was 1186, checked in by cameron, 8 years ago

Notes on parallel bitstream-based length sorting.

File size: 44.1 KB
3%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
5% This is the LaTeX source for the instructions to authors using
6% the LaTeX document class 'llncs.cls' for contributions to
7% the Lecture Notes in Computer Sciences series.
8%       Springer Heidelberg 2006/05/04
10% It may be used as a template for your own input - copy it
11% to a new file with a new name and use it as the basis
12% for your article.
14% NB: the document class 'llncs' has its own and detailed documentation, see
27\urldef{\mailsa}\path|{alfred.hofmann, ursula.barth, ingrid.haas, frank.holzwarth,|
28\urldef{\mailsb}\path|anna.kramer, leonie.kunz, christine.reiss, nicole.sator,|
29\urldef{\mailsc}\path|erika.siebert-cole, peter.strasser, lncs}|   
35\mainmatter  % start of an individual contribution
40% first the title is needed
41\title{Parallel Scanning with Bitstream Addition: An XML Case Study}
43% a short form should be given in case it is too long for the running head
44\titlerunning{Parallel Scanning with Bitstream Addition}
46% the name(s) of the author(s) follow(s) next
48% NB: Chinese authors should write their first names(s) in front of
49% their surnames. This ensures that the names appear correctly in
50% the running heads and the author index.
52\author{Robert D. Cameron
53\and Ehsan Amiri \and Kenneth S. Herdy \and Dan Lin \and Thomas C. Shermer \and Fred P. Popowich}
56\authorrunning{Cameron, Amiri, Herdy, Lin, Shermer and Popowich}
57% the affiliations are given next; don't give your e-mail address
58% unless you accept that it will be published
59\institute{Simon Fraser University, Surrey, BC, Canada\\
60\email{\{cameron, eamiri, ksherdy, lindanl, shermer, popowich\}}
67A parallel scanning method using the concept of bitstream addition is
68introduced and studied in application to the problem of XML
69parsing and well-formedness checking.   
70% The method parallelizes
71% finite-state transitions, using carry propagation to achieve up
72% to $W$ transitions with each $W$-bit binary addition operation.
73On processors supporting $W$-bit addition operations,
74the method can perform up to $W$ finite state transitions per instruction.
75The method is based on the concept of parallel bitstream technology,
76in which parallel streams of bits are formed such that each stream
77comprises bits in one-to-one correspondence with the character
78code units of a source data stream.    Parsing routines are initially
79prototyped in Python using its native support for unbounded
80integers to represent arbitrary-length  bitstreams.  A compiler
81then translates the Python code into low-level C-based implementations.
82These low-level implementations take advantage of
83the SIMD (single-instruction multiple-data) capabilities of commodity
84processors to yield a dramatic speed-up over
85traditional alternatives employing byte-at-a-time parsing.
86%\keywords{SIMD text processing, parallel bitstream technology, XML, parsing}
87\keywords{SIMD text processing, parallel bitstreams, XML, parsing}
92Although the finite state machine methods used
93in the scanning and parsing of text streams is considered to be the
94hardest of the ``13 dwarves'' to parallelize \cite{Asanovic:EECS-2006-183},
95parallel bitstream technology shows considerable promise for these
96%types of applications\cite{PPoPP08,CameronHerdyLin2008,Green2009}.
97types of applications \cite{PPoPP08,CameronHerdyLin2008}.
98In this approach, character streams are processed $N$ positions at
99a time using the $N$-bit SIMD registers commonly found on commodity
100processors (e.g., 128-bit XMM registers on Intel/AMD chips). 
101This is achieved by first slicing the byte streams into eight separate
102basis bitstreams, one for each bit position within the byte. 
103These basis bitstreams are then combined with bitwise logic and
104shifting operations to compute further parallel bit streams of
105interest, such as the \verb:[<]: bit stream marking the position
106of all opening angle brackets in an XML document.
108Using these techniques as well as the {\em bit scan} 
109instructions also available on commodity processors, the
110Parabix 1 XML parser was shown to considerably accelerate XML
111% parsing in comparison with conventional byte-at-a-time parser
112parsing in comparison with conventional byte-at-a-time parsers
113in applications such as statistics gathering \cite{CameronHerdyLin2008} and
114as GML to SVG conversion \cite{Herdy2008}
115Other efforts to accelerate XML parsing include the use of custom
116XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, careful coding and schema-based processing\cite{XMLScreamer} and
117multithread/multicore speedups based on data parallelism\cite{Shah2009,ZhangPanChiu09}.
119In this paper, we further increase the parallelism in our methods
120by introducing a new parallel scanning primitive using bitstream
121addition.   In essence, this primitive replaces the sequential bit
122scan operations underlying Parabix 1 with a new approach that
123independently advances multiple marker bits in parallel using
124simple addition and logic operations.   This paper documents the
125technique and evaluates it in application to the problem of XML
126parsing and well-formedness checking.
128Section 2 reviews the basics of parallel bitstream technology
129and introduces our new parallel scanning primitive.  Section 3
130goes on to show how this primitive may be used in XML scanning
131and parsing, while Section 4 discusses the construction of a
132complete XML well-formedness checker based on these techniques.
133Section 5 then briefly describes the compiler technology used to
134generate the low level code for our approach.  A performance
135study in Section 6 shows that the new Parabix 2 parser is
136dramatically faster than traditional byte-at-a-time parsers
137as well as the original Parabix 1 parser, particularly for
138dense XML markup.  Section 7 concludes the paper.
141% The remainder of this paper is organized as follows.
142% Section 2 reviews the basics of parallel bitstream technology
143% and introduces our new parallel scanning primitive.
144% Section 3 illustrates how this primitive may be used
145% in the lexical processing of XML references including the
146% parallel identification of errors.   Section 4 goes on
147% to consider the more complex task of XML tag processing
148% that goes beyond mere tokenization.
149% Building on these methods, Section 5 describes how to
150% construct a
151% complete solution to the problem of XML parsing and
152% well-formedness checking, in order
153% to gauge the applicability and power of the techniques.
154% Section \ref{sec:compile} then considers
155% the translation of high-level operations on unbounded bitstreams into
156% equivalent low-level code using SIMD intrinsics in the C programming
157% language.
158% Performance results are presented in section 7, comparing
159% the results of our generated implementations in comparison with
160% a number of existing alternatives.
161% The paper concludes with
162% comments on the current status of the work and directions for
163% further research.
165\section{The Parallel Bitstream Method}\label{sec:parabit}
170source data $\vartriangleleft$ & \verb`----173942---654----1----49731----321--`\\
171$B_7$ & \verb`.......................................`\\
172$B_6$ & \verb`.......................................`\\
173$B_5$ & \verb`111111111111111111111111111111111111111`\\
174$B_4$ & \verb`....111111...111....1....11111....111..`\\
175$B_3$ & \verb`1111...1..111...1111.1111.1...1111...11`\\
176$B_2$ & \verb`1111.1..1.1111111111.11111.1..1111...11`\\
177$B_1$ & \verb`.....11..1...1.............11.....11...`\\
178$B_0$ & \verb`11111111..111.1.111111111.111111111.111`\\
179\verb:[0-9]: & \verb`....111111...111....1....11111....111..`\\
182\caption{Basis and Character-Class Bitstreams}
188A bitstream is simply a sequence of $0$s and $1$s, where there is one such bit in the bitstream for each character in a source data stream.
189For parsing, and other text processing tasks, we need to consider multiple properties of characters at different stages during the parsing process. A bitstream can be associated with each of these properties, and hence there will be multiple (parallel) bitstreams associated with a source data stream of characters.
191The starting point for bitstream methods are \emph{basis} bitstreams
192and their use in determining \emph{character-class} bitstreams.
193The $k$th basis bitstream $B_k$ consists of the $k$th bit (0-based, starting at the the least significant bit)
194of each character in the source data stream;
195thus each $B_k$ is dependent on the encoding of the source characters (ASCII, UTF-8, UTF-16, etc.).
196Given these basis bitstreams, it is then possible to combine them
197using bitwise logic in order to compute character-class
198bitstreams, that is, streams that identify the positions at which characters belonging
199to a particular class occur.  For example, the character class bitstream
200$D=$\verb:[0-9]: marks with $1$s the positions at which decimal digits
201occur.    These bitstreams are illustrated in Figure \ref{fig:inputstreams},
202for an example source data stream consisting of digits and hyphens.
203This figure also illustrates some of our conventions for figures:  the left triangle $\vartriangleleft$ after
204``source data'' indicates that all streams are read from right to left
205(i.e., they are in little-endian notation).  We also use hyphens
206in the input stream represent any character that is not relevant to a character
207class under consideration, so that relevant characters stand out.
208Furthermore, the $0$ bits in the bitstreams are represented by periods,
209so that the $1$ bits stand out.
212Transposition of source data to basis bitstreams and calculation
213of character-class streams in this way is an overhead on parallel bit
214stream applications, in general.   However, using the SIMD
215capabilities of current commodity processors, these operations are
216fast, with an amortized overhead of about 1 CPU cycle per byte for
217transposition and less than 1 CPU cycle per byte for all the character
218classes needed for XML parsing \cite{CameronHerdyLin2008}.
219%Improved instruction sets using parallel extract operations or
220%inductive doubling techniques may further reduce this overhead significantly \cite{CameronLin2009,HilewitzLee2006}.
222Beyond the bitwise logic needed for character class determination,
223we also need \emph{upshifting} to deal with sequential combination.
224The upshift $n(S)$ of a bitstream $S$ is obtained by shifting the bits in $S$ one position forward,
225then placing a $0$ bit in the starting position of the bitstream; $n$ is meant to be mnemonic of ``next''.
226In $n(S)$, the last bit of $S$ may be eliminated or retained for error-testing purposes.
228\subsection{A Parallel Scanning Primitive}
230In this section, we introduce the principal new feature of the paper,
231a parallel scanning method based on bitstream addition.   Key to this
232method is the concept of {\em marker} bitstreams. 
233Marker bitstreams are used to represent positions of interest in the
234scanning or parsing of a source data stream.
235The appearance of a 1 at a position in a marker bitstream could, for example, denote
236the starting position of an XML tag in the data stream.   In general, the set of
237bit positions in a marker bitstream may be considered to be the current
238parsing positions of multiple parses taking place in parallel throughout
239the source data stream.
241Figure \ref{fig:scan1} illustrates the basic concept
242underlying parallel parsing with bitstream addition.
243All streams are shown in little-endian
244representation, with streams reading from right-to-left.
245The first row shows a source data stream that includes several
246spans of digits, together with other nondigit characters shown
247as hyphens.  The second row specifies the parsing problem
248using a marker bitstream $M_0$ to mark four initial marker
249positions.  In three instances, these markers are at
250the beginning (i.e., little end) of a span, while one is in
251the middle of a span.
252The parallel parsing task is to move each
253of the four markers forward (to the left) through the corresponding spans of
254digits to the immediately following positions.
259source data $\vartriangleleft$ & \verb`----173942---654----1----49731----321--`\\
260$M_0$ & \verb`.........1.....1....1......1...........`\\
261$D = $\verb:[0-9]: & \verb`....111111...111....1....11111....111..`\\
262$M_0 + D$ & \verb`...1........1......1....1...11....111..`\\
263$M_1 = (M_0 + D) \wedge \neg D$ & \verb`...1........1......1....1..............`
268\caption{Parallel Scan Using Bitstream Addition and Mask}
272The third row of Figure \ref{fig:scan1}
273shows the derived character-class bitstream $D$ identifying
274positions of all digits in the source stream. 
275The fourth row then illustrates the key concept: marker movement
276is achieved by binary addition of the marker and character
277class bitstreams.  As a marker 1 bit is combined using binary addition to
278a span of 1s, each 1 in the span becomes 0, generating
279a carry to add to the next position to the left.
280For each such span, the process terminates at the left end
281of the span, generating a 1 bit in the immediately
282following position.   These generated 1 bits represent
283the moved marker bits.   However, the result of the
284addition also produces some additional bits that are
285not involved in the scan operation.   
286These are easily removed as shown in the fifth row,
287by applying bitwise logic to mask
288off any bits from the digit bitstream; these can never
289be marker positions resulting from a scan.
290The addition and masking technique allows matching of
291the regular expression \verb:[0-9]*: for any reasonable
292(conflict-free) set of initial markers specified in $M_0$.
295% The addition and masking technique allows matching of
296% the regular expression \verb:[0-9]*: for any reasonable
297% (conflict-free) set of initial markers specified in $M_0$.
298% A conflict occurs when a span from one marker would run
299% into another marker position.   However, such conflicts
300% do not occur with the normal methods of marker bitstream
301% formation, in which unique syntactic features of
302% the input stream are used to specify the initial marker
303% positions.
305In the remainder of this paper, the notation $s(M, C)$
306denotes the operation to scan
307from an initial set of marker positions $M$ through
308the spans of characters belonging to a character class $C$ found at each position.
309\[s(M, C) = (M + C)  \wedge \neg C\]
312\section{XML Scanning and Parsing}
315We now consider how the parallel scanning primitive can
316be applied to the following problems in scanning and
317parsing of XML structures:  (1) parallel scanning of XML decimal character references,
318and (2) parallel parsing of XML start tags.
319The grammar of these structures is shown in Figure \ref{fig:xmlgrmr}.
324DecRef & ::=   &        '\verb:&#:' Digit${}^{+}$ '\verb:;:'  \\
325Digit  & ::=   &         \verb:[0-9]:\\
326STag         &  ::=   &        '\verb:<:' Name (W  Attribute)* W${}^{?}$ '\verb:>:'  \\
327Attribute & ::=   &        Name W${}^{?}$ '=' W${}^{?}$ AttValue \\
328AttValue  &           ::=   &      (  `\verb:":' \verb:[^<"]*: `\verb:":') $|$ (``\verb:':'' \verb:[^<']*: ``\verb:':'') \\
329        W       &    ::=   &    (\verb:\x20: $|$ \verb:\x9: $|$ \verb:\xD: $|$ \verb:\xA:)${}^{+}$ \\
330%DQuoted & ::= & \verb:[^<"]*:  \\
331%SQuoted & ::= & \verb:[^<']*:
334\caption{XML Grammar: Decimal Character References and Start Tags}
341\multicolumn{2}{l}{source data $\vartriangleright$}     
342                                         & \verb`-&#978;-&9;--&#;--&#13!-`\\
343$M_0$ &                                  & \verb`.1......1....1....1.....`\\
344$M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
345$E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
346$M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
347$E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
348$M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
349$E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
350$M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
351$E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
354\caption{Parsing Decimal References}
358Figure \ref{fig:decref} shows the parallel parsing of
359decimal references together with error checking. 
360For clarity, the streams are now shown in left-to-right
361order as indicated by the $\vartriangleright$ symbol.
362The source data includes four instances of potential
363decimal references beginning with the \verb:&: character.
364Of these, only the first one is legal according to
365the decimal reference syntax, the other three instances
366are in error.   These references may be parsed in
367parallel as follows.  The
368starting marker bitstream $M_0$ is formed from the \verb:[&]:
369character-class bitstream as shown in the second row.  The next row shows the
370result of the marker advance operation $n(M_0)$ to
371produce the new marker bitstream $M_1$.  At this point,
372the grammar requires a hash mark, so the first error bitstream $E_0$ is
373formed using a bitwise ``and'' operation combined with negation,
374to indicate violations of this condition.
375Marker bitstream $M_2$ is then defined as those positions
376immediately following any $M_1$ positions not in error.
377In the following row, the condition that at least
378one digit is required is checked to produce error bitstream $E_1$.
379A parallel scan operation is then applied through the
380digit sequences as shown in the next row to produce
381marker bitstream $M_3$.  The final error bitstream $E_2$ is
382produced to identify any references without a
383closing semicolon.
384In the penultimate row, the final marker bitstream $M_4$ marks the
385positions of all fully-checked decimal references, while the
386last row defines a unified error bitstream $E$ 
387indicating the positions of all detected errors.
389Initialization of marker streams may be achieved in various ways,
390dependent on the task at hand.   
391In the XML parsing context,
392we rely on an important property of well-formed
393XML: after an initial filtering pass to identify
394XML comments, processing instructions and CDATA
395sections, every remaining \verb:<: in the
396file must be the initial character of a start,
397end or empty element tag, and every remaining \verb:&:
398must be the initial character of a general entity
399or character reference. These assumptions permit easy creation of
400marker bitstreams for XML tags and XML references.
402The parsing of XML start tags is a richer problem, involving
403sequential structure of attribute-value pairs as shown in Figure \ref{fig:xmlgrmr}.
404Using the bitstream addition technique, our method
405is to start with the opening angle bracket of all tags as
406the initial marker bitstream for parsing the tags in parallel,
407advance through the element name and then use an iterative
408process to move through attribute-value pairs.
410Figure \ref{fig:stag-ex}
411illustrates the parallel parsing of three XML start tags.
412The figure omits determination
413of error bitstreams, processing of single-quoted attribute values and handling
414of empty element tags, for simplicity.  In this figure, the first
415four rows show the source data and three character class bitstreams:
416$N$ for characters permitted in XML names,
417$W$ for whitespace characters,
418and $Q$ for characters permitted within a double-quoted attribute value string. 
424source data $\vartriangleright$ & \verb`--<e a= "137">---<el2 a="17" a2="3379">---<x>--`\\
425$N = $ name chars & \verb`11.1.1...111..111.111.1..11..11..1111..111.1.11`\\
426$W = $ white space & \verb`....1..1.............1......1..................`\\
427$Q = \neg$\verb:["<]: & \verb`11.11111.111.1111.111111.11.1111.1111.1111.1111`\\
429$M_0$ & \verb`..1..............1........................1....`\\
430$M_1 = n(M_0)$ & \verb`...1..............1........................1...`\\
431$M_{0,7} = s(M_1, N)$ & \verb`....1................1......................1..`\\
432$M_{0,8} = s(M_{0,7}, W) \wedge \neg$\verb:[>]: & \verb`.....1................1........................`\\
434$M_{1,1} = s(M_{0,8}, N)$ & \verb`......1................1.......................`\\
435$M_{1,2} = s(M_{1,1}, W) \wedge$\verb:[=]: & \verb`......1................1.......................`\\
436$M_{1,3} = n(M_{1,2})$ & \verb`.......1................1......................`\\
437$M_{1,4} = s(M_{1,3}, W) \wedge$\verb:["]: & \verb`........1...............1......................`\\
438$M_{1,5} = n(M_{1,4})$ & \verb`.........1...............1.....................`\\
439$M_{1,6} = s(M_{1,5}, Q) \wedge$\verb:["]: & \verb`............1..............1...................`\\
440$M_{1,7} = n(M_{1,6})$ & \verb`.............1..............1..................`\\
441$M_{1,8} = s(M_{1,7}, W) \wedge \neg$\verb:[>]: & \verb`.............................1.................`\\
443$M_{2,1} = s(M_{1,8}, N)$ & \verb`...............................1...............`\\
444$M_{2,2} = s(M_{2,1}, W) \wedge$\verb:[=]: & \verb`...............................1...............`\\
445$M_{2,3} = n(M_{2,2})$ & \verb`................................1..............`\\
446$M_{2,4} = s(M_{2,3}, W) \wedge$\verb:["]: & \verb`................................1..............`\\
447$M_{2,5} = n(M_{2,4})$ & \verb`.................................1.............`\\
448$M_{2,6} = s(M_{2,5}, Q) \wedge$\verb:["]: & \verb`.....................................1.........`\\
449$M_{2,7} = n(M_{2,6})$ & \verb`......................................1........`\\
450$M_{2,8} = s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb`...............................................`
453\caption{Start Tag Parsing}
458The parsing process is illustrated in the remaining rows of the
459figure.    Each successive row shows the set of parsing markers as they
460advance in parallel using bitwise logic and addition.
461Overall, the sets of marker transitions can be divided into three groups.
463The first group
464$M_0$ through $M_{0,8}$ shows the initiation of parsing for each of the
465 tags through the
466opening angle brackets and  the element names, up to the first
467attribute name, if present.  Note that there are no attribute names
468in the final tag shown, so the corresponding marker becomes zeroed
469out at the closing angle bracket.
470Since $M_{0,8}$ is not all $0$s, the parsing continues.
472The second group of marker transitions
473$M_{1,1}$ through $M_{1,8}$ deal with the parallel parsing of the first attribute-value
474pair of the remaining tags.
475After these operations, there are no more attributes
476in the first tag, so its corresponding marker becomes zeroed out.
477However, $M_{1, 8}$ is not all $0$s, as the second tag still has an unparsed attribute-value pair.
478Thus, the parsing continues.
480The third group of marker transitions $M_{2,1}$ through $M_{2,8}$ deal with the parsing of
481the second attribute-value pair of this tag.  The
482final transition to $M_{2,8}$ shows the zeroing out of all remaining markers
483once two iterations of attribute-value processing have taken place.
484Since $M_{2,8}$ is all $0$s, start tag parsing stops.
486The implementation of start tag processing uses a while loop that
487terminates when the set of active markers becomes zero,
488i.e.\  when some $M_{k, 8} = 0$.
490as an iteration over unbounded bitstreams, all start tags in the document
491are processed in parallel, using a number of iterations equal to the maximum
492number of attribute-value pairs in any one tag in the document.   
493However, in block-by-block processing, the cost of iteration is considerably reduced; the iteration for
494each block only requires as many steps as there are attribute-value pairs
495overlapping the block.
500%\subsection{Name Scans}
501%To illustrate the scanning of the name found in an XML start tag,
502%let us consider a sequence that might be found in an HTML file,
503%\verb:<div id="myid">:,
504%which is shown as the source data stream in Figure \ref{fig:stag-scan}.
509%source data & \verb:<div id="myid">:\\
510%$M_0$ & \verb`1..............`\\
511%$C_0$ & \verb`.111.11...1111.`\\
512%$M_1 = n(M_0)$ & \verb`.1.............`\\
513%$M_2 = s(M_1, D_0) \wedge \neg C_0$ & \verb`....1.........`\\
514%lastline & \verb`...............`
517%\caption{Scanning Names}
521%If we set our initial marker bitstream according to the procedure outlined in our discussion of marker bitstream initialization, we %obtain the bitstream $M_0$.
522%According to the grammar in Figure \ref{fig:stag-grmr}, we can then look for a \verb:Name: in an \verb:STag: after we have found a %5\verb:<:.
523%So, $M_1$ is the marker bitstream for the starting position of our name.
524%Although we do not know the length of the name, the $C_0$ bit vector can easily be set to $1$ for the characters that can be contained %in a name.
525%We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position.
527Following the pattern shown here, the remaining syntactic
528features of XML markup can similarly be parsed with
529bitstream based methods.   One complication is that the
530parsing of comments,
531CDATA sections and processing instructions must be
532performed first to determine those regions of text
533within which ordinary XML markups are not parsed (i.e.,
534within each of these types of construct.   This is handled
535by first parsing these structures and
536then forming a {\em mask bitstream}, that is, a stream that
537identifies spans of text to be excluded from parsing
538(comment and CDATA interiors, parameter text to processing instructions).
541\section{XML Well-Formedness}
543In this section, we consider the full application of the parsing techniques
544of the previous section to the problem of XML well-formedness checking \cite{TR:XML}.
545% This application is useful as a well-defined and commercially significant
546% example to assess the overall applicability of parallel bitstream techniques.
547% To what extent can the well-formedness requirements of XML be
548% completely discharged using parallel bitstream techniques?
549% Are those techniques worthwhile in every instance, or
550% do better alternatives exist for certain requirements?
551% For those requirements that cannot be fully implemented
552% using parallel bitstream technology alone, what
553% preprocessing support can be offered by parallel bitstream
554% technology to the discharge of these requirements in other ways?
555% We address each of these questions in this section,
556% and look not only at the question of well-formedness, but also at
557We look not only at the question of well-formedness, but also at
558the identification of error positions in documents that
559are not well-formed.
562%\subsection{Error and Error-Check Bitstreams}
564Most of the requirements of XML well-formedness checking
565can be implemented using two particular types of computed
566bitstream: {\em error bitstreams}, introduced in the previous section, and {\em error-check bitstreams}.
567Recall that an error bitstream stream is a stream marking the location of definite errors in accordance with
568a particular requirement.  For example, the
569$E_0$, $E_1$, and $E_2$ bitstreams as computed during parsing of
570decimal character references in Figure \ref{fig:decref}
571are error bitstreams.  One bits mark definite errors and zero bits mark the
572absence of an error.   
573% absence of error according to the requirement.   
574Thus the complete absence of errors according to the
575requirements listed may be determined by forming the
576bitwise logical ``or'' of these bitstreams and confirming
577that the resulting value is zero. An error check bitstream is one
578that marks potential errors to be further checked in
579some fashion during post-bitstream processing.   
580An example is the bitstream marking the start positions
581of CDATA sections.   This is a useful information stream
582computed during bitstream processing to identify opening
583\verb:<![: sequences, but also marks positions to
584subsequently check for the complete opening
585delimiter  \verb:<![CDATA[: at each position.
587In typical documents, most of these error-check streams will be quite sparse
588% or even zero.   Many of the error conditions could
589or even zero.   Many error conditions could
590actually be fully implemented using bitstream techniques,
591but at the cost of a number of additional logical and shift
592operations.   In general, the conditions are
593easier and more efficient to check one-at-a-time using
594multibyte comparisons on the original source data stream.
595With very sparse streams, it is very unlikely that
596multiple instances occur within any given block, thus
597eliminating the benefit of parallel evaluation of the logic.
599The requirement for name checking merits comment.   XML
600names may use a wide range of Unicode character values.
601It is too expensive to check every instance of an XML name
602against the full range of possible values.   However, it is
603possible and inexpensive to use parallel bitstream
604techniques to verify that any ASCII characters within a name
605are indeed legal name start characters or name characters.
606Furthermore, the characters that may legally follow a
607name in XML are confined to the ASCII range.  This makes
608it useful to define a name scan character class to include all the legal ASCII characters
609for names as well as all non-ASCII characters. 
610A namecheck character class bitstream will then be defined to identify non-ASCII
611characters found within namescans.   In most documents
612this bitstream will be all $0$s; even in documents with substantial
613internationalized content, the tag and attribute names used
614to define the document schema tend to be confined to the
615ASCII repertoire.   In the case that this bitstream is nonempty,
616the positions of all 1 bits in this bitstream denote characters
617that need to be individually validated.
619Attribute names within a single XML start tag or empty
620element tag must be unique.  This requirement could be
621implemented using one of several different approaches. Standard
622approaches include: sequential search, symbol lookup, and Bloom filters
625% In general, the use of error-check bitstreams is a straightforward,
626% convenient and reasonably efficient mechanism for
627% checking the well-formedness requirements.
629%\subsection{Tag Matching}
631Except for empty element tags, XML tags come in pairs with
632names that must be matched.   To discharge this requirement,
633we form a bitstream consisting of the disjunction of three
634bitstreams formed during parsing: the bitstream marking the
635positions of start or empty tags (which have a common
636initial structure), the bitstream marking tags that end using
637the empty tag syntax (``\verb:/>:''), and the bitstream
638marking the occurrences of end tags.   In post-bitstream
639processing, we iterate through this computed bitstream
640and match tags using an iterative stack-based approach.
642%\subsection{Document Structure}
644An XML document consists of a single root element within
645which all others contained; this constraint is also
646checked during post-bitstream processing.   In addition,
647we define the necessary "miscellaneous" bitstreams
648for checking the prolog and epilog material before
649and after the root element.
653Overall, parallel bitstream techniques are well-suited to
654verification problems such as XML well-formedness checking. 
655Many of the character validation and syntax checking requirements
656can be conveniently and efficiently implemented using error streams.
657Other requirements are also supported by the computation of
658error-check streams for simple post-bitstream processing or
659composite stream over which iterative stack-based procedures
660can be defined for checking recursive syntax.  To assess
661the completness of our analysis, we have confirmed that
662our implementations correctly handle all the well-formedness
663checks of the W3C XML Conformance Test Suite.
665\section{Compilation to Block-Based Processing} 
667While our Python implementation of the techniques described in the previous section works on unbounded bitstreams, a corresponding
668C implementation needs to process an input stream in blocks of size equal to the
669SIMD register width of the processor it runs on.
670So, to convert Python code into C, the key question becomes how
671to transfer information from one block to the next.
673The answer lies in the use of {\em carry bits}.
674The parallel scanning primitive uses only addition and bitwise logic.
675The logic operations do not require information flow
676accross block boundaries, so the information flow is entirely
677accounted by the carry bits for addition.   Carry bits also
678capture the information flow associated with upshift
679operations, which move information forward one position
680in the file.   In essence, an upshift by one position for
681a bitstream is equivalent to the addition of the stream
682to itself; the bit shifted out in an upshift is in this
683case equivalent to the carry generated by the additon.
685Properly determining, initializing and inserting carry bits
686into a block-by-block implementation of parallel bitstream
687code is a task too tedious for manual implementation.
688We have thus developed compiler technology to automatically
689insert declarations, initializations and carry save/restore
690operations into appropriate locations when translating
691Python operations on unbounded bitstreams into the
692equivalent low-level C code implemented on a block-by-block
693bases.  Our current compiler toolkit is capable of inserting
694carry logic using a variety of strategies, including both
695simulated carry bit processing with SIMD registers, as
696well as carry-flag processing using the processor general
697purpose registers and ALU.   Details are beyond the
698scope of this paper, but are described in the on-line
699source code repository at
702\section{Performance Results}
704In this section, we compare the performance of our \verb:xmlwf:
705implementation using the Parabix 2 technology described above with several
706other implementations.
707These include the original \verb:xmlwf:
708distributed as an example application of the \verb:expat: XML
709parser,  implementations based on the widely used Xerces
710open source parser using both SAX and DOM interfaces,
711and an implementation using our prior Parabix 1 technology with
712bit scan operations. 
714Table \ref{XMLDocChars} 
715shows the document characteristics of the XML instances selected for this performance study,
716including both document-oriented and data-oriented XML files.
717The jawiki.xml and dewiki.xml XML files are document-oriented XML instances of Wikimedia books, written in Japanese and German, respectively. The remaining files are data-oriented.  The roads.gml file is an instance of Geography Markup Language (GML),
718a modeling language for geographic information systems as well as an open interchange format for geographic transactions on the Internet.  The po.xml file is an example of purchase order data, while the soap.xml file contains a large SOAP message.
719Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.
720This metric is reported for each document.
726File Name               & dewiki.xml            & jawiki.xml            & roads.gml     & po.xml        & soap.xml \\ \hline   
727File Type               & document      & document      & data  & data  & data   \\ \hline     
728File Size (kB)          & 66240                 & 7343                  & 11584         & 76450         & 2717 \\ \hline
729Markup Item Count       & 406792                & 74882                 & 280724        & 4634110       & 18004 \\ \hline               
730Attribute Count         & 18808                 & 3529                  & 160416        & 463397        & 30001\\ \hline
731Avg. Attribute Size     & 8                     & 8                     & 6             & 5             & 9\\ \hline
732Markup Density          & 0.07                  & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
736 \caption{XML Document Characteristics} 
737 \label{XMLDocChars} 
740Table \ref{parsers-cpb} shows performance measurements for the
741various \verb:xmlwf: implementations applied to the
742test suite.   Measurements are made on a single core of an
743Intel Core 2 system running a stock 64-bit Ubuntu 10.10 operating system,
744with all applications compiled with llvm-gcc 4.4.5 optimization level 3.
745Measurements are reported in CPU cycles per input byte of
746the XML data files in each case.
747The first row shows the performance of the Xerces C parser
748using the tree-building DOM interface. 
749Note that the performance
750varies considerably depending on markup density.  Note also that
751the DOM tree construction overhead is substantial and unnecessary
752for XML well-formedness checking.  Using the event-based SAX interface
753to Xerces gives much better results as shown in the
754second row.   
755The third row shows the best performance of our byte-at-a-time
756parsers, using the original  \verb:xmlwf: based on expat.
758The remaining rows of Table \ref{parsers-cpb} show performance
759of parallel bitstream implementations, including post-bitstream
760processing.  The first row shows
761the performance of our Parabix 1 implementation using
762bit scan instructions.   While showing a substantial speed-up
763over the byte-at-a-time parsers in every case, note also that
764the performance advantage increases with increasing markup
765density, as expected.   The last two rows show Parabix 2
766implementations using different carry-handling
767strategies, with the ``simd'' row referring to carry
768computations performed with simulated calculation of
769propagated and generated carries using SIMD operations, while the
770``adc64'' row referring to an implementation directly employing
771the processor carry flags and add-with-carry instructions on
77264-bit general registers.  In both cases, the overall
773performance is impressive, with the increased
774parallelism of parallel bit scans clearly paying off in
775improved performance for dense markup.
782Parser Class & Parser & dewiki.xml  & jawiki.xml    & roads.gml  & po.xml & soap.xml  \\ \hline 
784Byte & Xerces (DOM)    &    37.921   &    40.559   &    72.78   &    105.497   &    125.929  \\ \cline{2-7} 
785at-a & Xerces (SAX)   &     19.829   &    24.883   &    33.435   &    46.891   &    57.119      \\ \cline{2-7}
786Time & expat      &  12.639   &    16.535   &    32.717   &    42.982   &    51.468      \\ \hline 
787Parallel & Parabix1   &    8.313   &    9.335   &     13.345   &    16.136   &      19.047 \\ \cline{2-7}
788Bit& Parabix2 (simd)   &        6.103   &    6.445   &    8.034   &    8.685   &    9.53 \\ \cline{2-7} 
789Stream & Parabix2 (adc64)       &       5.123   &    5.996   &    6.852   &    7.648   &    8.275 \\ \hline
790 \end{tabular}
792 \caption{Parser Performance (Cycles Per Byte)} 
796%gcc (simd\_add)    &   6.174   &       6.405   &       7.948   &       8.565   &       9.172 \\ \hline
797%llvm (simd\_add)   &   6.104   &       6.335   &       8.332   &       8.849   &       9.811 \\ \hline
798%gcc (adc64)        &   9.23   &        9.921   &       10.394   &      10.705   &      11.751 \\ \hline
799%llvm (adc64)       &   5.757   &       6.142   &       6.763   &       7.424   &       7.952 \\ \hline
800%gcc (SAHFLAHF)    &    7.951   &       8.539   &       9.984   &       10.219   &      11.388 \\ \hline
801%llvm(SAHFLAHF)    &    5.61   &        6.02   &        6.901   &       7.597   &       8.183 \\ \hline
808In application to the problem of XML parsing and well-formedness
809checking, the method of parallel parsing with bitstream addition
810is effective and efficient.   Using only bitstream addition
811and bitwise logic, it is possible to handle all of the
812character validation, lexical recognition and parsing problems
813except for the recursive aspects of start and end tag matching.
814Error checking is elegantly supported through the use of error
815streams that eliminate separate if-statements to check for
816errors with each byte.   The techniques are generally very
817efficient particularly when markup density is high.   However, for some
818conditions that occur rarely and/or require complex combinations
819of upshifting and logic, it may be better to define simpler
820error-check streams that require limited postprocessing using
821byte matching techniques.
823The techniques have been implemented and assessed for present-day commodity processors employing current SIMD technology.
824As processor advances see improved instruction sets and increases
825in width of SIMD registers, the relative advantages of the
826techniques over traditional byte-at-a-time sequential
827parsing methods is likely to increase substantially.
828Of particular benefit to this method, instruction set modifications
829that provide for more convenient carry propagation for long
830bitstream arithmetic would be most welcome.
832A significant challenge to the application of these techniques
833is the difficulty of programming.   The method of prototyping
834on unbounded bitstreams has proven to be of significant value
835in our work.   Using the prototyping language as input to
836a bitstream compiler has also proven effective in generating
837high-performance code.   Nevertheless, direct programming
838with bitstreams is still a specialized skill; our future
839research includes developing yet higher level tools to
840generate efficient bitstream implementations from grammars,
841regular expressions and other text processing formalisms.
851Krste Asanovic, Ras Bodik, Bryan~Christopher Catanzaro, Joseph~James Gebis,
852  Parry Husbands, Kurt Keutzer, David~A. Patterson, William~Lester Plishker,
853  John Shalf, Samuel~Webb Williams, and Katherine~A. Yelick.
854\newblock The landscape of parallel computing research: A view from {Berkeley}.
855\newblock Technical Report UCB/EECS-2006-183, EECS Department, University of
856  California, Berkeley, Dec 2006.
859Tim Bray, Jean Paoli, C.~M. Sperberg-McQueen, Eve Maler, and François Yergeau.
860\newblock Extensible markup language ({XML}) 1.0 (fifth edition).
861\newblock W3C Recommendation, 2008.
864Robert~D. Cameron.
865\newblock {A Case Study in SIMD Text Processing with Parallel Bit Streams}.
866\newblock In {\em {ACM Symposium on Principles and Practice of Parallel
867  Programming (PPoPP)}}, {Salt Lake City, Utah}, 2008.
870Robert~D. Cameron, Kenneth~S. Herdy, and Dan Lin.
871\newblock High performance {XML} parsing using parallel bit stream technology.
872\newblock In {\em {CASCON} '08: Proceedings of the 2008 conference of the
873  center for advanced studies on collaborative research}, pages 222--235, New
874  York, NY, USA, 2008. ACM.
877Zefu Dai, Nick Ni, and Jianwen Zhu.
878\newblock A 1 cycle-per-byte {XML} parsing accelerator.
879\newblock In {\em FPGA '10: Proceedings of the 18th Annual ACM/SIGDA
880  International Symposium on Field Programmable Gate Arrays}, pages 199--208,
881  New York, NY, USA, 2010. ACM.
884Kenneth~S. Herdy, David~S. Burggraf, and Robert~D. Cameron.
885\newblock High performance {GML} to {SVG} transformation for the visual
886  presentation of geographic data in web-based mapping systems.
887\newblock In {\em Proceedings of {SVG} Open 2008}, August 2008.
890M.~G. Kostoulas, M.~Matsa, N.~Mendelsohn, E.~Perkins, A.~Heifets, and
891  M.~Mercaldi.
892\newblock {{XML} Screamer: An Integrated Approach to High Performance {XML}
893  Parsing, Validation and Deserialization}.
894\newblock In {\em Proceedings of the 15th International Conference on World
895  Wide Web (WWW '06)}, pages 93--102, 2006.
898Michael Leventhal and Eric Lemoine.
899\newblock The {XML} chip at 6 years.
900\newblock In {\em International Symposium on Processing {XML} Efficiently:
901  Overcoming Limits on Space, Time, or Bandwidth}, August 2009.
904Bhavik Shah, Praveen Rao, Bongki Moon, and Mohan Rajagopalan.
905\newblock A data parallel algorithm for {XML DOM} parsing.
906\newblock In Zohra BellahsÚne, Ela Hunt, Michael Rys, and Rainer Unland,
907  editors, {\em Database and {XML} Technologies}, volume 5679 of {\em Lecture
908  Notes in Computer Science}, pages 75--90. Springer Berlin / Heidelberg, 2009.
911Ying Zhang, Yinfei Pan, and Kenneth Chiu.
912\newblock Speculative p-{DFA}s for parallel {XML} parsing.
913\newblock In {\em High Performance Computing (HiPC), 2009 International
914  Conference on}, pages 388--397, December 2009.
Note: See TracBrowser for help on using the repository browser.