source: docs/EuroPar2011/europar-cameron.tex @ 901

Last change on this file since 901 was 901, checked in by cameron, 8 years ago

Fixes for Tom's points.

File size: 41.6 KB
3%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
5% This is the LaTeX source for the instructions to authors using
6% the LaTeX document class 'llncs.cls' for contributions to
7% the Lecture Notes in Computer Sciences series.
8%       Springer Heidelberg 2006/05/04
10% It may be used as a template for your own input - copy it
11% to a new file with a new name and use it as the basis
12% for your article.
14% NB: the document class 'llncs' has its own and detailed documentation, see
27\urldef{\mailsa}\path|{alfred.hofmann, ursula.barth, ingrid.haas, frank.holzwarth,|
28\urldef{\mailsb}\path|anna.kramer, leonie.kunz, christine.reiss, nicole.sator,|
29\urldef{\mailsc}\path|erika.siebert-cole, peter.strasser, lncs}|   
35\mainmatter  % start of an individual contribution
40% first the title is needed
41\title{Parallel Scanning with Bitstream Addition: An XML Case Study}
43% a short form should be given in case it is too long for the running head
44\titlerunning{Parallel Scanning with Bitstream Addition}
46% the name(s) of the author(s) follow(s) next
48% NB: Chinese authors should write their first names(s) in front of
49% their surnames. This ensures that the names appear correctly in
50% the running heads and the author index.
52\author{Robert D. Cameron
54\and Ehsan Amiri \and Kenneth S. Herdy \and Dan Lin \and Thomas C. Shermer \and Fred P. Popowich}
57\authorrunning{Cameron {\em et al}}
59% the affiliations are given next; don't give your e-mail address
60% unless you accept that it will be published
61\institute{Simon Fraser University, Surrey, BC, Canada\\
62\email{\{cameron, eamiri, ksherdy, lindanl, shermer, popowich\}}
69A parallel scanning method using the concept of bitstream addition is
70introduced and studied in application to the problem of XML
71parsing and well-formedness checking.   
72% The method parallelizes
73% finite-state transitions, using carry propagation to achieve up
74% to $W$ transitions with each $W$-bit binary addition operation.
75On processors supporting $W$-bit addition operations,
76the method can perform up to $W$ finite state transitions per instruction.
77The method is based on the concept of parallel bitstream technology,
78in which parallel streams of bits are formed such that each stream
79comprises bits in one-to-one correspondence with the character
80code units of a source data stream.    Parsing routines are initially
81prototyped in Python using its native support for unbounded
82integers to represent arbitrary-length  bitstreams.  A compiler
83then translates the Python code into low-level C-based implementations.
84These low-level implementations take advantage of
85the SIMD (single-instruction multiple-data) capabilities of commodity
86processors to yield a dramatic speed-up over
87traditional alternatives employing byte-at-a-time parsing.
88%\keywords{SIMD text processing, parallel bitstream technology, XML, parsing}
89\keywords{SIMD text processing, parallel bitstreams, XML, parsing}
95Although the finite state machine methods used
96in the scanning and parsing of text streams is considered to be the
97hardest of the ``13 dwarves'' to parallelize \cite{Asanovic:EECS-2006-183},
98parallel bitstream technology shows considerable promise for these
99%types of applications\cite{PPoPP08,CameronHerdyLin2008,Green2009}.
100types of applications \cite{PPoPP08,CameronHerdyLin2008,Green2009}.
101In this approach, character streams are processed $N$ positions at
102a time using the $N$-bit SIMD registers commonly found on commodity
103processors (e.g., 128-bit XMM registers on Intel/AMD chips). 
104This is achieved by first slicing the byte streams into eight separate
105basis bitstreams, one for each bit position within the byte. 
106These basis bitstreams are then combined with bitwise logic and
107shifting operations to compute further parallel bit streams of
108interest, such as the \verb:[<]: bit stream marking the position
109of all opening angle brackets in an XML document.
111Using these techniques as well as the {\em bit scan} 
112instructions also available on commodity processors, the
113Parabix 1 XML parser was shown to considerably accelerate XML
114% parsing in comparison with conventional byte-at-a-time parser
115parsing in comparison with conventional byte-at-a-time parsers
116in applications such as statistics gathering \cite{CameronHerdyLin2008} and
117as GML to SVG conversion \cite{Herdy2008}
118Other efforts to accelerate XML parsing include the use of custom
119XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and
120multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}.
122In this paper, we further increase the parallelism in our methods
123by introducing a new parallel scanning primitive using bitstream
124addition.   In essence, this primitive replaces the sequential bit
125scan operations underlying Parabix 1 with a new approach that
126independently advances multiple marker bits in parallel using
127simple addition and logic operations.   This paper documents the
128technique and evaluates it in application to the problem of XML
129parsing and well-formedness checking.
131Section 2 reviews the basics of parallel bitstream technology
132and introduces our new parallel scanning primitive.  Section 3
133goes on to show how this primitive may be used in XML scanning
134and parsing, while Section 4 discusses the construction of a
135complete XML well-formedness checker based on these techniques.
136Section 5 then briefly describes the compiler technology used to
137generate the low level code for our approach.  A performance
138study in Section 6 shows that the new Parabix 2 parser is
139dramatically faster than traditional byte-at-a-time parsers
140as well as the original Parabix 1 parser, particularly for
141dense XML markup.  Section 7 concludes the paper.
144% The remainder of this paper is organized as follows.
145% Section 2 reviews the basics of parallel bitstream technology
146% and introduces our new parallel scanning primitive.
147% Section 3 illustrates how this primitive may be used
148% in the lexical processing of XML references including the
149% parallel identification of errors.   Section 4 goes on
150% to consider the more complex task of XML tag processing
151% that goes beyond mere tokenization.
152% Building on these methods, Section 5 describes how to
153% construct a
154% complete solution to the problem of XML parsing and
155% well-formedness checking, in order
156% to gauge the applicability and power of the techniques.
157% Section \ref{sec:compile} then considers
158% the translation of high-level operations on unbounded bitstreams into
159% equivalent low-level code using SIMD intrinsics in the C programming
160% language.
161% Performance results are presented in section 7, comparing
162% the results of our generated implementations in comparison with
163% a number of existing alternatives.
164% The paper concludes with
165% comments on the current status of the work and directions for
166% further research.
168\section{The Parallel Bitstream Method}\label{sec:parabit}
173source data $\vartriangleleft$ & \verb`----173942---654----1----49731----321--`\\
174$B_7$ & \verb`.......................................`\\
175$B_6$ & \verb`.......................................`\\
176$B_5$ & \verb`111111111111111111111111111111111111111`\\
177$B_4$ & \verb`....111111...111....1....11111....111..`\\
178$B_3$ & \verb`1111...1..111...1111.1111.1...1111...11`\\
179$B_2$ & \verb`1111.1..1.1111111111.11111.1..1111...11`\\
180$B_1$ & \verb`.....11..1...1.............11.....11...`\\
181$B_0$ & \verb`11111111..111.1.111111111.111111111.111`\\
182\verb:[0-9]: & \verb`....111111...111....1....11111....111..`\\
185\caption{Basis and Character-Class Bitstreams}
191A bitstream is simply a sequence of $0$s and $1$s, where there is one such bit in the bitstream for each character in a source data stream.
192For parsing, and other text processing tasks, we need to consider multiple properties of characters at different stages during the parsing process. A bitstream can be associated with each of these properties, and hence there will be multiple (parallel) bitstreams associated with a source data stream of characters.
194The starting point for bitstream methods are \emph{basis} bitstreams
195and their use in determining \emph{character-class} bitstreams.
196The $k$th basis bitstream $B_k$ consists of the $k$th bit (0-based, starting at the the least significant bit)
197of each character in the source data stream;
198thus each $B_k$ is dependent on the encoding of the source characters (ASCII, UTF-8, UTF-16, etc.).
199Given these basis bitstreams, it is then possible to combine them
200using bitwise logic in order to compute character-class
201bitstreams, that is, streams that identify the positions at which characters belonging
202to a particular class occur.  For example, the character class bitstream
203$D=$\verb:[0-9]: marks with $1$s the positions at which decimal digits
204occur.    These bitstreams are illustrated in Figure \ref{fig:inputstreams},
205for an example source data stream consisting of digits and hyphens.
206This figure also illustrates some of our conventions for figures:  the left triangle $\vartriangleleft$ after
207``source data'' indicates that all streams are read from right to left
208(i.e., they are in little-endian notation).  We also use hyphens
209in the input stream represent any character that is not relevant to a character
210class under consideration, so that relevant characters stand out.
211Furthermore, the $0$ bits in the bitstreams are represented by periods,
212so that the $1$ bits stand out.
215Transposition of source data to basis bitstreams and calculation
216of character-class streams in this way is an overhead on parallel bit
217stream applications, in general.   However, using the SIMD
218capabilities of current commodity processors, these operations are quite
219fast, with an amortized overhead of about 1 CPU cycle per byte for
220transposition and less than 1 CPU cycle per byte for all the character
221classes needed for XML parsing \cite{CameronHerdyLin2008}.
222%Improved instruction sets using parallel extract operations or
223%inductive doubling techniques may further reduce this overhead significantly \cite{CameronLin2009,HilewitzLee2006}.
225Beyond the bitwise logic needed for character class determination,
226we also need \emph{upshifting} to deal with sequential combination.
227The upshift $n(S)$ of a bitstream $S$ is obtained by shifting the bits in $S$ one position forward,
228then placing a $0$ bit in the starting position of the bitstream; $n$ is meant to be mnemonic of ``next''.
229In $n(S)$, the last bit of $S$ may be eliminated or retained for error-testing purposes.
231\subsection{A Parallel Scanning Primitive}
233In this section, we introduce the principal new feature of the paper,
234a parallel scanning method based on bitstream addition.   Key to this
235method is the concept of {\em marker} bitstreams. 
236Marker bitstreams are used to represent positions of interest in the
237scanning or parsing of a source data stream.
238The appearance of a 1 at a position in a marker bitstream could, for example, denote
239the starting position of an XML tag in the data stream.   In general, the set of
240bit positions in a marker bitstream may be considered to be the current
241parsing positions of multiple parses taking place in parallel throughout
242the source data stream.
244Figure \ref{fig:scan1} illustrates the basic concept
245underlying parallel parsing with bitstream addition.
246All streams are shown in little-endian
247representation, with streams reading from right-to-left.
248The first row shows a source data stream that includes several
249spans of digits, together with other nondigit characters shown
250as hyphens.  The second row specifies the parsing problem
251using a marker bitstream $M_0$ to mark four initial marker
252positions.  In three instances, these markers are at
253the beginning (i.e., little end) of a span, while one is in
254the middle of a span.
255The parallel parsing task is to move each
256of the four markers forward (to the left) through the corresponding spans of
257digits to the immediately following positions.
262source data $\vartriangleleft$ & \verb`----173942---654----1----49731----321--`\\
263$M_0$ & \verb`.........1.....1....1......1...........`\\
264$D = $\verb:[0-9]: & \verb`....111111...111....1....11111....111..`\\
265$M_0 + D$ & \verb`...1........1......1....1...11....111..`\\
266$M_1 = (M_0 + D) \wedge \neg D$ & \verb`...1........1......1....1..............`
271\caption{Parallel Scan Using Bitstream Addition and Mask}
275The third row of Figure \ref{fig:scan1}
276shows the derived character-class bitstream $D$ identifying
277positions of all digits in the source stream. 
278The fourth row then illustrates the key concept: marker movement
279is achieved by binary addition of the marker and character
280class bitstreams.  As a marker 1 bit is combined using binary addition to
281a span of 1s, each 1 in the span becomes 0, generating
282a carry to add to the next position to the left.
283For each such span, the process terminates at the left end
284of the span, generating a 1 bit in the immediately
285following position.   These generated 1 bits represent
286the moved marker bits.   However, the result of the
287addition also produces some additional bits that are
288not involved in the scan operation.   
289However, these are easily removed as shown in the fifth row,
290by applying bitwise logic to mask
291off any bits from the digit bitstream; these can never
292be marker positions resulting from a scan.
293The addition and masking technique allows matching of
294the regular expression \verb:[0-9]*: for any reasonable
295(conflict-free) set of initial markers specified in $M_0$.
298% The addition and masking technique allows matching of
299% the regular expression \verb:[0-9]*: for any reasonable
300% (conflict-free) set of initial markers specified in $M_0$.
301% A conflict occurs when a span from one marker would run
302% into another marker position.   However, such conflicts
303% do not occur with the normal methods of marker bitstream
304% formation, in which unique syntactic features of
305% the input stream are used to specify the initial marker
306% positions.
308In the remainder of this paper, the notation $s(M, C)$
309denotes the operation to scan
310from an initial set of marker positions $M$ through
311the spans of characters belonging to a character class $C$ found at each position.
312\[s(M, C) = (M_0 + C)  \wedge \neg C\]
315\section{XML Scanning and Parsing}
318We now consider how the parallel scanning primitive can
319be applied to the following problems in scanning and
320parsing of XML structures:  (1) parallel scanning of XML decimal character references,
321and (2) parallel parsing of XML start tags.
322The grammar of these structures is shown in Figure \ref{fig:xmlgrmr}.
327DecRef & ::=   &        '\verb:&#:' Digit+ '\verb:;:'  \\
328Digit  & ::=   &         \verb:[0-9]:\\
329STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
330Attribute & ::=   &        Name WS? '=' WS? AttValue \\
331AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
332%DQuoted & ::= & \verb:[^<"]*:  \\
333%SQuoted & ::= & \verb:[^<']*:
336\caption{XML Grammar: Decimal Character References and Start Tags}
343\multicolumn{2}{l}{source data $\vartriangleright$}     
344                                         & \verb`-&#978;-&9;--&#;--&#13!-`\\
345$M_0$ &                                  & \verb`.1......1....1....1.....`\\
346$M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
347$E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
348$M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
349$E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
350$M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
351$E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
352$M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
353$E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
356\caption{Parsing Decimal References}
360Figure \ref{fig:decref} shows the parallel parsing of
361decimal references together with error checking. 
362For clarity, the streams are now shown in left-to-right
363order as indicated by the $\vartriangleright$ symbol.
364The source data includes four instances of potential
365decimal references beginning with the \verb:&: character.
366Of these, only the first one is legal according to
367the decimal reference syntax, the other three instances
368are in error.   These references may be parsed in
369parallel as follows.  The
370starting marker bitstream $M_0$ is formed from the \verb:[&]:
371character-class bitstream as shown in the second row.  The next row shows the
372result of the marker advance operation $n(M_0)$ to
373produce the new marker bitstream $M_1$.  At this point,
374a hash mark is required, so the first error bitstream $E_0$ is
375formed using a bitwise ``and'' operation combined with negation,
376to indicate violations of this condition.
377Marker bitstream $M_2$ is then defined as those positions
378immediately following any $M_1$ positions not in error.
379In the following row, the condition that at least
380one digit is required is checked to produce error bitstream $E_1$.
381A parallel scan operation is then applied through the
382digit sequences as shown in the next row to produce
383marker bitstream $M_3$.  The final error bitstream $E_2$ is
384produced to identify any references without a
385closing semicolon.
386In the penultimate row, the final marker bitstream $M_4$ marks the
387positions of all fully-checked decimal references, while the
388last row defines a unified error bitstream $E$ 
389indicating the positions of all detected errors.
391Initialization of marker streams may be achieved in various ways,
392dependent on the task at hand.   
393In the XML parsing context,
394we rely on an important property of well-formed
395XML: after an initial filtering pass to identify
396XML comments, processing instructions and CDATA
397sections, every remaining \verb:<: in the
398file must be the initial character of a start,
399end or empty element tag, and every remaining \verb:&:
400must be the initial character of a general entity
401or character reference. These assumptions permit easy creation of
402marker bitstreams for XML tags and XML references.
404The parsing of XML start tags is a richer problem, involving
405sequential structure of attribute-value pairs as shown in Figure \ref{fig:xmlgrmr}.
406Using the bitstream addition technique, our method
407is to start with the opening angle bracket of all tags as
408the initial marker bitstream for parsing the tags in parallel,
409advance through the element name and then use an iterative
410process to move through attribute-value pairs.
412Figure \ref{fig:stag-ex}
413illustrates the parallel parsing of three XML start tags.
414The figure omits determination
415of error bitstreams, processing of single-quoted attribute values and handling
416of empty element tags, for simplicity.  In this figure, the first
417four rows show the source data and three character class bitstreams:
418$N$ for characters permitted in XML names,
419$W$ for whitespace characters,
420and $Q$ for characters permitted within a double-quoted attribute value string. 
426source data $\vartriangleright$ & \verb`--<e a= "137">---<el2 a="17" a2="3379">---<x>--`\\
427$N = $ name chars & \verb`11.1.1...111..111.111.1..11..11..1111..111.1.11`\\
428$W = $ white space & \verb`....1..1.............1......1..................`\\
429$Q = \neg$\verb:[">]: & \verb`11.11111.111.1111.111111.11.1111.1111.1111.1111`\\
431$M_0$ & \verb`..1..............1........................1....`\\
432$M_1 = n(M_0)$ & \verb`...1..............1........................1...`\\
433$M_{0,7} = s(M_1, N)$ & \verb`....1................1......................1..`\\
434$M_{0,8} = s(M_{0,7}, W) \wedge \neg$\verb:[>]: & \verb`.....1................1........................`\\
436$M_{1,1} = s(M_{0,8}$ & \verb`......1................1.......................`\\
437$M_{1,2} = s(M_{1,1}, W) \wedge$\verb:[=]: & \verb`......1................1.......................`\\
438$M_{1,3} = n(M_{1,2})$ & \verb`.......1................1......................`\\
439$M_{1,4} = s({1,3}, W) \wedge$\verb:["]: & \verb`........1...............1......................`\\
440$M_{1,5} = n(M_{1,4})$ & \verb`.........1...............1.....................`\\
441$M_{1,6} = s(M_{1,5}, Q) \wedge$\verb:["]: & \verb`............1..............1...................`\\
442$M_{1,7} = n(M_{1,6})$ & \verb`.............1..............1..................`\\
443$M_{1,8} = s(M_{1,7}, W) \wedge \neg$\verb:[>]: & \verb`.............................1.................`\\
445$M_{2,1} = s(M_{1,8}, N)$ & \verb`...............................1...............`\\
446$M_{2,2} = s(M_{2,1}, W) \wedge$\verb:[=]: & \verb`...............................1...............`\\
447$M_{2,3} = n(M_{2,2})$ & \verb`................................1..............`\\
448$M_{2,4} = s(M_{2,3}, W) \wedge$\verb:["]: & \verb`................................1..............`\\
449$M_{2,5} = n(M_{2,4})$ & \verb`.................................1.............`\\
450$M_{2,6} = s(M_{2,5}, Q) \wedge$\verb:["]: & \verb`.....................................1.........`\\
451$M_{2,7} = n(M_{2,6})$ & \verb`......................................1........`\\
452$M_{2,8} = s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb`...............................................`
455\caption{Start Tag Parsing}
460The parsing process is illustrated in the remaining rows of the
461figure.    Each successive row shows the set of parsing markers as they
462advance in parallel using bitwise logic and addition.
463Overall, the sets of marker transitions can be divided into three groups.
465The first group
466$M_0$ through $M_{0,8}$ shows the initiation of parsing for each of the
467 tags through the
468opening angle brackets and  the element names, up to the first
469attribute name, if present.  Note that the there are no attribute names
470in the final tag shown, so the corresponding marker becomes zeroed
471out at the closing angle bracket.
472Since $M_{0,8}$ is not all $0$s, the parsing continues.
474The second group of marker transitions
475$M_{1,1}$ through $M_{1,8}$ deal with the parallel parsing of the first attribute-value
476pair of the remaining tags.
477After these operations, there are no more attributes
478in the first tag, so its corresponding marker becomes zeroed out.
479However, $M_{1, 8}$ is not all $0$s, as the second tag still has an unparsed attribute-value pair.
480Thus, the parsing continues.
482The third group of marker transitions $M_{2,1}$ through $M_{2,8}$ deal with the parsing of
483the second attribute-value pair of this tag.  The
484final transition to $M_{2,8}$ shows the zeroing out of all remaining markers
485once two iterations of attribute-value processing have taken place.
486Since $M_{2,8}$ is all $0$s, start tag parsing stops.
488The implementation of start tag processing uses a while loop that
489terminates when the set of active markers becomes zero,
490i.e.\  when some $M_{k, 8} = 0$.
492as an iteration over unbounded bitstreams, all start tags in the document
493are processed in parallel, using a number of iterations equal to the maximum
494number of attribute-value pairs in any one tag in the document.   
495However, in block-by-block processing, the cost of iteration is considerably reduced; the iteration for
496each block only requires as many steps as there are attribute-value pairs
497overlapping the block.
502%\subsection{Name Scans}
503%To illustrate the scanning of the name found in an XML start tag,
504%let us consider a sequence that might be found in an HTML file,
505%\verb:<div id="myid">:,
506%which is shown as the source data stream in Figure \ref{fig:stag-scan}.
511%source data & \verb:<div id="myid">:\\
512%$M_0$ & \verb`1..............`\\
513%$C_0$ & \verb`.111.11...1111.`\\
514%$M_1 = n(M_0)$ & \verb`.1.............`\\
515%$M_2 = s(M_1, D_0) \wedge \neg C_0$ & \verb`....1.........`\\
516%lastline & \verb`...............`
519%\caption{Scanning Names}
523%If we set our initial marker bitstream according to the procedure outlined in our discussion of marker bitstream initialization, we %obtain the bitstream $M_0$.
524%According to the grammar in Figure \ref{fig:stag-grmr}, we can then look for a \verb:Name: in an \verb:STag: after we have found a %5\verb:<:.
525%So, $M_1$ is the marker bitstream for the starting position of our name.
526%Although we do not know the length of the name, the $C_0$ bit vector can easily be set to $1$ for the characters that can be contained %in a name.
527%We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position.
529Following the pattern shown here, the remaining syntactic
530features of XML markup can similarly be parsed with
531bitstream based methods.   One complexity is that the
532parsing of comments,
533CDATA sections and processing instructions must be
534performed first to determine those regions of text
535within which ordinary XML markups are not parsed (i.e.,
536within each of these types of construct.   This is handled
537by first performance the parsing of these structures and
538then forming a {\em mask bitstream}, that is, a stream that
539identifies spans of text to be excluded from parsing
540(comment and CDATA interiors, parameter text to processing instructions).
543\section{XML Well-Formedness}
545In this section, we consider the full application of the parsing techniques
546of the previous section to the problem of XML well-formedness checking \cite{TR:XML}.
547% This application is useful as a well-defined and commercially significant
548% example to assess the overall applicability of parallel bitstream techniques.
549% To what extent can the well-formedness requirements of XML be
550% completely discharged using parallel bitstream techniques?
551% Are those techniques worthwhile in every instance, or
552% do better alternatives exist for certain requirements?
553% For those requirements that cannot be fully implemented
554% using parallel bitstream technology alone, what
555% preprocessing support can be offered by parallel bitstream
556% technology to the discharge of these requirements in other ways?
557% We address each of these questions in this section,
558% and look not only at the question of well-formedness, but also at
559We look not only at the question of well-formedness, but also at
560the identification of error positions in documents that
561are not well-formed.
564%\subsection{Error and Error-Check Bitstreams}
566Most of the requirements of XML well-formedness checking
567can be implemented using two particular types of computed
568bitstream: {\em error bitstreams}, introduced in the previous section, and {\em error-check bitstreams}.
569Recall that an error bitstream stream is a stream marking the location of definite errors in accordance with
570a particular requirement.  For example, the
571$E_0$, $E_1$, and $E_2$ bitstreams as computed during parsing of
572decimal character references in Figure \ref{fig:decref}
573are error bitstreams.  One bits mark definite errors and zero bits mark the
574absence of an error.   
575% absence of error according to the requirement.   
576Thus the complete absence of errors according to the
577requirements listed may be determined by forming the
578bitwise logical ``or'' of these bitstreams and confirming
579that the resulting value is zero. An error check bitstream is one
580that marks potential errors to be further checked in
581some fashion during post-bitstream processing.   
582An example is the bitstream marking the start positions
583of CDATA sections.   This is a useful information stream
584computed during bitstream processing to identify opening
585\verb:<![: sequences, but also marks positions to
586subsequently check for the complete opening
587delimiter  \verb:<![CDATA[: at each position.
589In typical documents, most of these error-check streams will be quite sparse
590% or even zero.   Many of the error conditions could
591or even zero.   Many error conditions could
592actually be fully implemented using bitstream techniques,
593but at the cost of a number of additional logical and shift
594operations.   In general, the conditions are
595easier and more efficient to check one-at-a-time using
596multibyte comparisons on the original source data stream.
597With very sparse streams, it is very unlikely that
598multiple instances occur within any given block, thus
599eliminating the benefit of parallel evaluation of the logic.
601The requirement for name checking merits comment.   XML
602names may use a wide range of Unicode character values.
603It is too expensive to check every instance of an XML name
604against the full range of possible values.   However, it is
605possible and quite inexpensive to use parallel bitstream
606techniques to verify that any ASCII characters within a name
607are indeed legal name start characters or name characters.
608Furthermore, the characters that may legally follow a
609name in XML are confined to the ASCII range.  This makes
610it useful to define a name scan character class to include all the legal ASCII characters
611for names as well as all non-ASCII characters. 
612A namecheck character class bitstream will then be defined to identify nonASCII
613characters found within namescans.   In most documents
614this bitstream will be all $0$s; even in documents with substantial
615internationalized content, the tag and attribute names used
616to define the document schema tend to be confined to the
617ASCII repertoire.   In the case that this bitstream is nonempty,
618the positions of all 1 bits in this bitstream denote characters
619that need to be individually validated.
621Attribute names within a single XML start tag or empty
622element tag must be unique.  This requirement could be
623implemented using one of several different approaches. Standard
624approaches include: sequential search, symbol lookup, and Bloom filters
627% In general, the use of error-check bitstreams is a straightforward,
628% convenient and reasonably efficient mechanism for
629% checking the well-formedness requirements.
631%\subsection{Tag Matching}
633Except for empty element tags, XML tags come in pairs with
634names that must be matched.   To discharge this requirement,
635we form a bitstream consisting of the disjunction of three
636bitstreams formed during parsing: the bitstream marking the
637positions of start or empty tags (which have a common
638initial structure), the bitstream marking tags that end using
639the empty tag syntax (``\verb:/>:''), and the bitstream
640marking the occurrences of end tags.   In post-bitstream
641processing, we iterate through this computed bitstream
642and match tags using an iterative stack-based approach.
644%\subsection{Document Structure}
646An XML document consists of a single root element with recursively
647defined structure together with material in the document
648prolog and epilogs.  Verifying this top-level structure and
649the structure of the prolog and epilog material is not
650well suited to parallel bitstream techniques, in particular, nor
651to any form of parallelism, in general.  In essence, the
652prolog and epilog materials occur once per document instance
653Thus the requirements to check this top-level structure
654for well-formedness are relatively minor, with an overhead
655that is quite low for sufficiently sized files.
659Overall, parallel bitstream techniques are well-suited to
660verification problems such as XML well-formedness checking. 
661Many of the character validation and syntax checking requirements
662can be conveniently and efficiently implemented using error streams.
663Other requirements are also supported by the computation of
664error-check streams for simple post-bitstream processing or
665composite stream over which iterative stack-based procedures
666can be defined for checking recursive syntax.  To assess
667the completness of our analysis, we have confirmed that
668our implementations correctly handle all the well-formedness
669checks of the W3C XML Conformance Test Suite.
671\section{Compilation to Block-Based Processing} 
673While our Python implementation of the techniques described in the previous section works on unbounded bitstreams, a corresponding
674C implementation needs to process an input stream in blocks of size equal to the
675SIMD register width of the processor it runs on.
676So, to convert Python code into C, the key question becomes how
677to transfer information from one block to the next.
679The answer lies in the use of {\em carry bits}.
680The parallel scanning primitive uses only addition and bitwise logic.
681The logic operations do not require information flow
682accross block boundaries, so the information flow is entirely
683accounted by the carry bits for addition.   Carry bits also
684capture the information flow associated with upshift
685operations, which move information forward one position
686in the file.   In essence, an upshift by one position for
687a bitstream is equivalent to the addition of the stream
688to itself; the bit shifted out in an upshift is in this
689case equivalent to the carry generated by the additon.
690The only other information flow requirement in the
691calculation of parallel bitstreams occurs with the
692bitstream subtractions that are used to calculate span streams.
693In this case, the information flow is based on borrows
694generated, which can be handled in the same way as carries.
696Properly determining, initializing and inserting carry bits
697into a block-by-block implementation of parallel bitstream
698code is a task too tedious for manual implementation.
699We have thus developed compiler technology to automatically
700insert declarations, initializations and carry save/restore
701operations into appropriate locations when translating
702Python operations on unbounded bitstreams into the
703equivalent low-level C code implemented on a block-by-block
704bases.  Our current compiler toolkit is capable of inserting
705carry logic using a variety of strategies, including both
706simulated carry bit processing with SIMD registers, as
707well as carry-flag processing using the processor general
708purpose registers and ALU.   Details are beyond the
709scope of this paper, but are described in the on-line
710source code repository at
713\section{Performance Results}
715In this section, we compare the performance of our \verb:xmlwf:
716implementation using the Parabix 2 technology described above with several
717other implementations.
718These include the original \verb:xmlwf:
719distributed as an example application of the \verb:expat: XML
720parser,  implementations based on the widely used Xerces
721open source parser using both SAX and DOM interfaces,
722and an implementation using our prior Parabix 1 technology with
723bit scan operations. 
725Table \ref{XMLDocChars} 
726shows the document characteristics of the XML instances selected for this performance study,
727including both document-oriented and data-oriented XML files.
728The jawiki.xml and dewiki.xml XML files are document-oriented XML instances of Wikimedia books, written in Japanese and German, respectively. The remaining files are data-oriented.  The roads.gml file is an instance of Geography Markup Language (GML),
729a modeling language for geographic information systems as well as an open interchange format for geographic transactions on the Internet.  The po.xml file is an example of purchase order data, while the soap.xml file contains a large SOAP message.
730Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.
731This metric is reported for each document.
737File Name               & dewiki.xml            & jawiki.xml            & roads.gml     & po.xml        & soap.xml \\ \hline   
738File Type               & document      & document      & data  & data  & data   \\ \hline     
739File Size (kB)          & 66240                 & 7343                  & 11584         & 76450         & 2717 \\ \hline
740Markup Item Count       & 406792                & 74882                 & 280724        & 4634110       & 18004 \\ \hline               
741Attribute Count         & 18808                 & 3529                  & 160416        & 463397        & 30001\\ \hline
742Avg. Attribute Size     & 8                     & 8                     & 6             & 5             & 9\\ \hline
743Markup Density          & 0.07                  & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
747 \caption{XML Document Characteristics} 
748 \label{XMLDocChars} 
751Table \ref{parsers-cpb} shows performance measurements for the
752various \verb:xmlwf: implementations applied to the
753test suite.   Measurements are made on a single core of an
754Intel Core 2 system running a stock 64-bit Ubuntu 10.10 operating system,
755with all applications compiled with llvm-gcc 4.4.5 optimization level 3.
756Measurements are reported in CPU cycles per input byte of
757the XML data files in each case.
758The first row shows the performance of the Xerces C parser
759using the tree-building DOM interface. 
760Note that the performance
761varies considerably depending on markup density.  Note also that
762the DOM tree construction overhead is substantial and unnecessary
763for XML well-formedness checking.  Using the event-based SAX interface
764to Xerces gives much better results as shown in the
765second row.   
766The third row shows the best performance of our byte-at-a-time
767parsers, using the original  \verb:xmlwf: based on expat.
769The remaining rows of Table \ref{parsers-cpb} show performance
770of parallel bitstream implementations.  The first row shows
771the performance of our Parabix 1 implementation using
772bit scan instructions.   While showing a substantial speed-up
773over the byte-at-a-time parsers in every case, note also that
774the performance advantage increases with increasing markup
775density, as expected.   The last two rows show different versions of
776the \verb:xmlwf: implemented based on the Parabix 2 technology
777as discussed in this paper.   They differ in the carry handling
778strategy, with the ``simd'' row referring to carry
779computations performed with simulated calculation of
780propagated and generated carries using SIMD operations, while the
781``adc64'' row refers to an implementation directly employing
782the processor carry flags and add-with-carry instructions on
78364-bit general registers.  In both cases, the overall
784performance is quite impressive, with the increased
785parallelism of parallel bit scans clearly paying off in
786improved performance for dense markup.
793Parser Class & Parser & dewiki.xml  & jawiki.xml    & roads.gml  & po.xml & soap.xml  \\ \hline 
795Byte & Xerces (DOM)    &    37.921   &    40.559   &    72.78   &    105.497   &    125.929  \\ \cline{2-7} 
796at-a & Xerces (SAX)   &     19.829   &    24.883   &    33.435   &    46.891   &    57.119      \\ \cline{2-7}
797Time & expat      &  12.639   &    16.535   &    32.717   &    42.982   &    51.468      \\ \hline 
798Parallel & Parabix1   &    8.313   &    9.335   &     13.345   &    16.136   &      19.047 \\ \cline{2-7}
799Bit& Parabix2 (simd)   &        6.103   &    6.445   &    8.034   &    8.685   &    9.53 \\ \cline{2-7} 
800Stream & Parabix2 (adc64)       &       5.123   &    5.996   &    6.852   &    7.648   &    8.275 \\ \hline
801 \end{tabular}
803 \caption{Parser Performance (Cycles Per Byte)} 
807%gcc (simd\_add)    &   6.174   &       6.405   &       7.948   &       8.565   &       9.172 \\ \hline
808%llvm (simd\_add)   &   6.104   &       6.335   &       8.332   &       8.849   &       9.811 \\ \hline
809%gcc (adc64)        &   9.23   &        9.921   &       10.394   &      10.705   &      11.751 \\ \hline
810%llvm (adc64)       &   5.757   &       6.142   &       6.763   &       7.424   &       7.952 \\ \hline
811%gcc (SAHFLAHF)    &    7.951   &       8.539   &       9.984   &       10.219   &      11.388 \\ \hline
812%llvm(SAHFLAHF)    &    5.61   &        6.02   &        6.901   &       7.597   &       8.183 \\ \hline
819In application to the problem of XML parsing and well-formedness
820checking, the method of parallel parsing with bitstream addition
821is effective and efficient.   Using only bitstream addition
822and bitwise logic, it is possible to handle all of the
823character validation, lexical recognition and parsing problems
824except for the recursive aspects of start and end tag matching.
825Error checking is elegantly supported through the use of error
826streams that eliminate separate if-statements to check for
827errors with each byte.   The techniques are generally very
828efficient particularly when markup density is high.   However, for some
829conditions that occur rarely and/or require complex combinations
830of upshifting and logic, it may be better to define simpler
831error-check streams that require limited postprocessing using
832byte matching techniques.
834The techniques have been implemented and assessed for present-day commodity processors employing current SIMD technology.
835As processor advances see improved instruction sets and increases
836in width of SIMD registers, the relative advantages of the
837techniques over traditional byte-at-a-time sequential
838parsing methods is likely to increase substantially.
839Of particular benefit to this method, instruction set modifications
840that provide for more convenient carry propagation for long
841bitstream arithmetic would be most welcome.
843A significant challenge to the application of these techniques
844is the difficulty of programming.   The method of prototyping
845on unbounded bitstreams has proven to be of significant value
846in our work.   Using the prototyping language as input to
847a bitstream compiler has also proven effective in generating
848high-performance code.   Nevertheless, direct programming
849with bitstreams is still a specialized skill; our future
850research includes developing yet higher level tools to
851generate efficient bitstream implementations from grammars,
852regular expressions and other text processing formalisms.
Note: See TracBrowser for help on using the repository browser.