source: docs/EuroPar2011/europar-cameron.tex @ 893

Last change on this file since 893 was 893, checked in by cameron, 8 years ago

Changes/shortening by Fred.

File size: 44.0 KB
3%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
5% This is the LaTeX source for the instructions to authors using
6% the LaTeX document class 'llncs.cls' for contributions to
7% the Lecture Notes in Computer Sciences series.
8%       Springer Heidelberg 2006/05/04
10% It may be used as a template for your own input - copy it
11% to a new file with a new name and use it as the basis
12% for your article.
14% NB: the document class 'llncs' has its own and detailed documentation, see
27\urldef{\mailsa}\path|{alfred.hofmann, ursula.barth, ingrid.haas, frank.holzwarth,|
28\urldef{\mailsb}\path|anna.kramer, leonie.kunz, christine.reiss, nicole.sator,|
29\urldef{\mailsc}\path|erika.siebert-cole, peter.strasser, lncs}|   
35\mainmatter  % start of an individual contribution
40% first the title is needed
41\title{Parallel Scanning with Bitstream Addition: An XML Case Study}
43% a short form should be given in case it is too long for the running head
44\titlerunning{Parallel Scanning with Bitstream Addition}
46% the name(s) of the author(s) follow(s) next
48% NB: Chinese authors should write their first names(s) in front of
49% their surnames. This ensures that the names appear correctly in
50% the running heads and the author index.
52\author{Robert D. Cameron
54\and Ehsan Amiri \and Kenneth S. Herdy \and Dan Lin \and Thomas C. Shermer \and Fred P. Popowich}
57\authorrunning{Cameron {\em et al}}
59% the affiliations are given next; don't give your e-mail address
60% unless you accept that it will be published
61\institute{Simon Fraser University, Surrey, BC, Canada\\
62\email{\{cameron, eamiri, ksherdy, lindanl, shermer, popowich\}}
69A parallel scanning method using the concept of bitstream addition is
70introduced and studied in application to the problem of XML
71parsing and well-formedness checking.   
72% The method parallelizes
73% finite-state transitions, using carry propagation to achieve up
74% to $W$ transitions with each $W$-bit binary addition operation.
75On processors supporting $W$-bit addition operations,
76the method can perform up to $W$ finite state transitions per instruction.
77The method is based on the concept of parallel bitstream technology,
78in which parallel streams of bits are formed such that each stream
79comprises bits in one-to-one correspondence with the character
80code units of a source data stream.    Parsing routines are initially
81prototyped in Python using its native support for unbounded
82integers to represent arbitrary-length  bitstreams.  A compiler
83then translates the Python code into low-level C-based implementations.
84These low-level implementations take advantage of
85the SIMD (single-instruction multiple-data) capabilities of commodity
86processors to yield a dramatic speed-up over
87traditional alternatives employing byte-at-a-time parsing.
88%\keywords{SIMD text processing, parallel bitstream technology, XML, parsing}
89\keywords{SIMD text processing, parallel bitstreams, XML, parsing}
97Traditional byte-at-a-time parsing technology is increasingly
98mismatched to the capabilities of modern processors.   Current
99commodity processors generally possess 64-bit general purpose registers
100as well as 128-bit SIMD registers, with 256-bit registers now
101appearing.   General purpose processing on graphics processors
102can make available 512-bit or wider registers.   Parsing models
103based on the traditional loading and processing of 8 bits at a time
104would seem to be greatly underutilizing processor resources.
106Unfortunately, parsing is hard to parallelize.   Indeed, in their seminal
107report outlining the landscape of parallel computing research,
108researchers from Berkeley identified the finite state machine
109methods underlying parsing and lexical processing as the hardest
110of the "13 dwarves" to parallelize, concluding at one point that
111"nothing helps." \cite{Asanovic:EECS-2006-183}   SIMD methods, in particular, would seem to
112be ill-suited to parsing, because textual data streams are seldom organized in
113convenient 16-byte blocks, tending to consist instead of
114variable-length items in generally unpredictable patterns.
115Nevertheless, there have been some notable works such as that of
116Scarpazza in applying the multicore and SIMD capabilities of the
117Cell/BE processor to regular expression matching \cite{Scarpazza:2009}.
118Intel has also signalled the importance of accelerated string
119processing to its customers through the introduction of new string processing
120instructions in the SSE 4.2 instruction set extension, demonstrating
121how those features may be used to advantage in activities such as
122XML parsing \cite{XMLSSE42}
124Our research has been exploring a promising alternative approach, however, based on
125the concept of {\em parallel bitstreams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}.   
126In this approach, byte streams
127are first sliced into eight basis bitstreams, one for each
128bit position within the byte.  Bit stream $i$ thus comprises
129the $i$th bit of each byte.   Using 128-bit SIMD registers, then,
130bitwise logic operations on these basis bitstreams allows
131byte classification operations to be carried out in parallel
132for 128 bytes at a time.  For example, consider a character class
133bitstream \verb:[<]: using a 1 bit to mark the position of
134opening angle brackets in a byte stream.  This stream may
135computed as logical combination of the basis bitstreams using
136only seven bitwise logical operations per 128 bytes.
138Based on this approach, our prior work has shown how parallel
139bitstreams may be used to accelerate XML parsing by further
140taking advantage of processor {\em bit scan} instructions, commonly
141found within commodity processors \cite{CameronHerdyLin2008}.
142On current Intel or AMD processors, for example, these instructions
143allow one to determine the position of the first 1 bit in a group of 64
144in a single instruction.   Using these techniques, our Parabix 1
145parser demonstrated offered considerable accelaration of XML
146parsing in statistics gathering \cite{CameronHerdyLin2008} as well
147as GML to SVG conversion \cite{Herdy2008}.
148Other efforts to accelerate XML parsing include the use of custom
149XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and
150multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}.
152In this paper, we further increase the parallelism in our methods
153by introducing a new parallel scanning primitive using bitstream
154addition.   In essence, multiple 1 bits in a marker stream
155identify current scanning positions for multiple instances
156of a particular syntactic context within a byte stream.
157These multiple marker positions may each be independently
158advanced in parallel using addition and masking. 
159The net result is a new scanning primitive that
160allows multiple instances
161of syntactic elements to be parsed simultaneously.   For example,
162in dense XML markup, one might find several instances of particular
163types of markup tags within a given 64-byte block of text; parallel
164addition on 64-bit words allows all such instances to be processed at once.
166% The remainder of this paper is organized as follows.
167% Section 2 reviews the basics of parallel bitstream technology
168% and introduces our new parallel scanning primitive.
169% Section 3 illustrates how this primitive may be used
170% in the lexical processing of XML references including the
171% parallel identification of errors.   Section 4 goes on
172% to consider the more complex task of XML tag processing
173% that goes beyond mere tokenization.
174% Building on these methods, Section 5 describes how to
175% construct a
176% complete solution to the problem of XML parsing and
177% well-formedness checking, in order
178% to gauge the applicability and power of the techniques.
179% Section \ref{sec:compile} then considers
180% the translation of high-level operations on unbounded bitstreams into
181% equivalent low-level code using SIMD intrinsics in the C programming
182% language.
183% Performance results are presented in section 7, comparing
184% the results of our generated implementations in comparison with
185% a number of existing alternatives.
186% The paper concludes with
187% comments on the current status of the work and directions for
188% further research.
190\section{The Parallel Bitstream Method}\label{sec:parabit}
195source data $\vartriangleleft$ & \verb`----173942---654----1----49731----321--`\\
196$B_7$ & \verb`.......................................`\\
197$B_6$ & \verb`.......................................`\\
198$B_5$ & \verb`111111111111111111111111111111111111111`\\
199$B_4$ & \verb`....111111...111....1....11111....111..`\\
200$B_3$ & \verb`1111...1..111...1111.1111.1...1111...11`\\
201$B_2$ & \verb`1111.1..1.1111111111.11111.1..1111...11`\\
202$B_1$ & \verb`.....11..1...1.............11.....11...`\\
203$B_0$ & \verb`11111111..111.1.111111111.111111111.111`\\
204\verb:[0-9]: & \verb`....111111...111....1....11111....111..`\\
207\caption{Basis and Character-Class Bitstreams}
213A bitstream is simply a sequence of $0$s and $1$s, where there is one such bit in the bitstream for each character in a source data stream.
214For parsing, and other text processing tasks, we need to consider multiple properties of characters at different stages during the parsing process. A bitstream can be associated with each of these properties, and hence there will be multiple (parallel) bitstreams associated with a source data stream of characters \cite{Cameron2009,PPoPP08}.
216The starting point for bitstream methods are \emph{basis} bitstreams
217and their use in determining \emph{character-class} bitstreams.
218The $k$th basis bitstream $B_k$ consists of the $k$th bit (0-based, starting at the LSB)
219of each character in the source data stream;
220thus each $B_k$ is dependent on the encoding of the source characters (ASCII, UTF-8, UTF-16, etc.).
221Given these basis bitstreams, it is then possible to combine them
222using bitwise logic in order to compute character-class
223bitstreams, that is, streams that identify the positions at which characters belonging
224to a particular class occur.  For example, the character class bitstream
225$D=$\verb:[0-9]: marks with $1$s the positions at which decimal digits
226occur.    These bitstreams are illustrated in Figure \ref{fig:inputstreams},
227for an example source data stream consisting of digits and hyphens.
228This figure also illustrates some of our conventions for figures:  the left triangle $\vartriangleleft$ after
229``source data'' indicates that all streams are read from right to left
230(i.e., they are in little-endian notation).  We also use hyphens
231in the input stream represent any character that is not relevant to a character
232class under consideration, so that relevant characters stand out.
233Furthermore, the $0$ bits in the bitstreams are represented by periods,
234so that the $1$ bits stand out.
237Transposition of source data to basis bitstreams and calculation
238of character-class streams in this way is an overhead on parallel bit
239stream applications, in general.   However, using the SIMD
240capabilities of current commodity processors, these operations are quite
241fast, with an amortized overhead of about 1 CPU cycle per byte for
242transposition and less than 1 CPU cycle per byte for all the character
243classes needed for XML parsing \cite{CameronHerdyLin2008}.
244Improved instruction sets using parallel extract operations or
245inductive doubling techniques may further reduce this overhead significantly \cite{CameronLin2009,HilewitzLee2006}.
247Beyond the bitwise logic needed for character class determination,
248we also need \emph{upshifting} to deal with sequential combination.
249The upshift $n(S)$ of a bitstream $S$ is obtained by shifting the bits in $S$ one position forward,
250then placing a $0$ bit in the starting position of the bitstream; $n$ is meant to be mnemonic of ``next''.
251In $n(S)$, the last bit of $S$ may be eliminated or retained for error-testing purposes.
253\subsection{A Parallel Scanning Primitive}
255In this section, we introduce the principal new feature of the paper,
256a parallel scanning method based on bitstream addition.   Key to this
257method is the concept of {\em marker} bitstreams. 
258Marker bitstreams are used to represent positions of interest in the
259scanning or parsing of a source data stream.
260The appearance of a 1 at a position in a marker bitstream could, for example, denote
261the starting position of an XML tag in the data stream.   In general, the set of
262bit positions in a marker bitstream may be considered to be the current
263parsing positions of multiple parses taking place in parallel throughout
264the source data stream.
266Figure \ref{fig:scan1} illustrates the basic concept
267underlying parallel parsing with bitstream addition.
268As with the previous figures, all streams are shown in little-endian
269representation, with streams reading from right-to-left.
270The first row shows a source data stream that includes three
271spans of digits, $13840$, $1139845$, and $127$, with other nondigit characters shown
272as hyphens.  The second row specifies the parsing problem
273using a marker bitstream $M_0$ to mark three initial
274marker positions at the start of each span of digits.
275The parallel parsing task is to move each
276of the three markers forward through the corresponding spans of
277digits to the immediately following positions.
282\multicolumn{2}{l}{source data $\vartriangleleft$}
283                          & \verb`--721----5489311-----04831------`\\
284$M_0$ &                   & \verb`....1..........1.........1......`\\
285$D$   & $= \verb`[0..9]`$ & \verb`..111....1111111.....11111......`\\
286$M_1$ & $ = M_0 + D$      & \verb`.1......1...........1...........`
289\caption{Bitstream addition}
293The third row of Figure \ref{fig:scan1}
294shows the derived character-class bitstream $D$ identifying
295positions of all digits in the source stream. 
296The fourth row then illustrates the key concept: marker movement
297is achieved by binary addition of the marker and character
298class bitstreams.  As a marker 1 bit is combined using binary addition to
299a span of 1s, each 1 in the span becomes 0, generating
300a carry to add to the next position to the left. 
301For each span, the process terminates at the left end
302of the span, generating a 1 bit in the immediately
303following position.   In this way, binary addition produces the marker bitstream
304$M_1$, with each of the three markers
305moved independently through their respective spans of digits to the position at the end.
307However, the simple addition technique shown in Figure \ref{fig:scan1}
308does not account for digits in the source stream that do
309not play a role in a particular scanning operation.
310Figure \ref{fig:scan2} shows an example and how this
311may be resolved.   The source data stream is again shown in row 1,
312and the marker bitstream defining the initial marker positions
313for the the parallel parsing tasks shown in row 2.   
314Row 3 again contains the character class bitstream for digits $D$.
315Row 4 shows the result of bitstream addition, in which
316marker bits are advanced, but additional bits not
317involved in the scan operation are included in the result.
318However, these are easily removed in row 5, by masking
319off any bits from the digit bitstream; these can never
320be marker positions resulting from a scan.
321The addition and masking technique allows matching of
322the regular expression \verb:[0-9]*: for any reasonable
323(conflict-free) set of initial markers specified in $M_0$.
328\multicolumn{2}{l}{source data $\vartriangleleft$}     
329                                                        & \verb`--134--31--59127---3--3474--`\\
330$M_0$ &                                                 & \verb`....1.........1..........1..`\\
331$D$   & $= \verb`[0..9]`$ & \verb`..111..11..11111...1..1111..`\\
332$M_1$ & $= M_0 + D$       & \verb`.1.....11.1....1...1.1......`\\
333$M_2$ & $= (M_0 + D) \wedge \neg D$ & \verb`.1........1..........1......`
336\caption{Parallel Scan Using Addition and Mask}
340% The addition and masking technique allows matching of
341% the regular expression \verb:[0-9]*: for any reasonable
342% (conflict-free) set of initial markers specified in $M_0$.
343% A conflict occurs when a span from one marker would run
344% into another marker position.   However, such conflicts
345% do not occur with the normal methods of marker bitstream
346% formation, in which unique syntactic features of
347% the input stream are used to specify the initial marker
348% positions.
350In the remainder of this paper, the notation $s(M, C)$
351denotes the operation to scan
352from an initial set of marker positions $M$ through
353the spans of characters belonging to a character class $C$ found at each position.
354\[s(M, C) = (M_0 + C)  \wedge \neg C\]
357\section{XML Scanning and Parsing}
360We now consider how the parallel scanning primitive can
361be applied to the following problems in scanning and
362parsing of XML structures:  (1) parallel scanning of XML decimal character references,
363and (2) parallel parsing of XML start tags.
364The grammar of these structures is shown in Figure \ref{fig:xmlgrmr}.
369DecRef & ::=   &        '\verb:&#:' Digit+ '\verb:;:'  \\
370Digit  & ::=   &         \verb:[0-9]:\\
371STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
372Attribute & ::=   &        Name WS? '=' WS? AttValue \\
373AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
374%DQuoted & ::= & \verb:[^<"]*:  \\
375%SQuoted & ::= & \verb:[^<']*:
378\caption{XML Grammar: Decimal Character References and Start Tags}
385\multicolumn{2}{l}{source data $\vartriangleright$}     
386                                         & \verb`-&#978;-&9;--&#;--&#13!-`\\
387$M_0$ &                                  & \verb`.1......1....1....1.....`\\
388$M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
389$E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
390$M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
391$E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
392$M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
393$E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
394$M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
395$E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
398\caption{Parsing Decimal References}
402Figure \ref{fig:decref} shows the parallel parsing of
403decimal references together with error checking.
404The source data includes four instances of potential
405decimal references beginning with the \verb:&: character.
406Of these, only the first one is legal according to
407the decimal reference syntax, the other three instances
408are in error.   These references may be parsed in
409parallel as follows.  The
410starting marker bitstream $M_0$ is formed from the \verb:[&]:
411character-class bitstream as shown in the second row.  The next row shows the
412result of the marker advance operation $n(M_0)$ to
413produce the new marker bitstream $M_1$.  At this point,
414a hash mark is required, so the first error bitstream $E_0$ is
415formed using a bitwise ``and'' operation combined with negation,
416to indicate violations of this condition.
417Marker bitstream $M_2$ is then defined as those positions
418immediately following any $M_1$ positions not in error.
419In the following row, the condition that at least
420one digit is required is checked to produce error bitstream $E_1$.
421A parallel scan operation is then applied through the
422digit sequences as shown in the next row to produce
423marker bitstream $M_3$.  The final error bitstream $E_2$ is
424produced to identify any references without a
425closing semicolon.
426In the penultimate row, the final marker bitstream $M_4$ marks the
427positions of all fully-checked decimal references, while the
428last row defines a unified error bitstream $E$ 
429indicating the positions of all detected errors.
432One question that may arise is: how are marker bitstreams initialized?   In general,
433this is an important problem, and dependent on the task at hand.   
434In the XML parsing context,
435we rely on an important property of well-formed
436XML: after an initial filtering pass to identify
437XML comments, processing instructions and CDATA
438sections, every remaining \verb:<: in the
439file must be the initial character of a start,
440end or empty element tag, and every remaining \verb:&:
441must be the initial character of a general entity
442or character reference. These assumptions permit easy creation of
443marker bitstreams for XML tags and XML references.
445The parsing of XML start tags is a richer problem, involving
446sequential structure of attribute-value pairs as shown in Figure \ref{fig:xmlgrmr}.
447Using the bitstream addition technique, our method
448is to start with the opening angle bracket of all tags as
449the initial marker bitstream for parsing the tags in parallel,
450advance through the element name and then use an iterative
451process to move through attribute-value pairs.
453Figure \ref{fig:stag-ex}
454illustrates the parallel parsing of three XML start tags.
455The figure omits determination
456of error bitstreams, processing of single-quoted attribute values and handling
457of empty element tags, for simplicity.  In this figure, the first
458four rows show the source data and three character class bitstreams:
459$N$ for characters permitted in XML names,
460$W$ for whitespace characters,
461and $Q$ for characters permitted within a double-quoted attribute value string. 
467source data $\vartriangleright$ & \verb`--<e a= "137">---<el2 a="17" a2="3379">---<x>--`\\
468$N = $ name chars & \verb`11.1.1...111..111.111.1..11..11..1111..111.1.11`\\
469$W = $ white space & \verb`....1..1.............1......1..................`\\
470$Q = \neg$\verb:[">]: & \verb`11.11111.111.1111.111111.11.1111.1111.1111.1111`\\
472$M_0$ & \verb`..1..............1........................1....`\\
473$M_1 = n(M_0)$ & \verb`...1..............1........................1...`\\
474$M_{0,7} = s(M_1, N)$ & \verb`....1................1......................1..`\\
475$M_{0,8} = s(M_{0,7}, W) \wedge \neg$\verb:[>]: & \verb`.....1................1........................`\\
477$M_{1,1} = s(M_{0,8}$ & \verb`......1................1.......................`\\
478$M_{1,2} = s(M_{1,1}, W) \wedge$\verb:[=]: & \verb`......1................1.......................`\\
479$M_{1,3} = n(M_{1,2})$ & \verb`.......1................1......................`\\
480$M_{1,4} = s({1,3}, W) \wedge$\verb:["]: & \verb`........1...............1......................`\\
481$M_{1,5} = n(M_{1,4})$ & \verb`.........1...............1.....................`\\
482$M_{1,6} = s(M_{1,5}, Q) \wedge$\verb:["]: & \verb`............1..............1...................`\\
483$M_{1,7} = n(M_{1,6})$ & \verb`.............1..............1..................`\\
484$M_{1,8} = s(M_{1,7}, W) \wedge \neg$\verb:[>]: & \verb`.............................1.................`\\
486$M_{2,1} = s(M_{1,8}, N)$ & \verb`...............................1...............`\\
487$M_{2,2} = s(M_{2,1}, W) \wedge$\verb:[=]: & \verb`...............................1...............`\\
488$M_{2,3} = n(M_{2,2})$ & \verb`................................1..............`\\
489$M_{2,4} = s(M_{2,3}, W) \wedge$\verb:["]: & \verb`................................1..............`\\
490$M_{2,5} = n(M_{2,4})$ & \verb`.................................1.............`\\
491$M_{2,6} = s(M_{2,5}, Q) \wedge$\verb:["]: & \verb`.....................................1.........`\\
492$M_{2,7} = n(M_{2,6})$ & \verb`......................................1........`\\
493$M_{2,8} = s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb`...............................................`
496\caption{Start Tag Parsing}
501The parsing process is illustrated in the remaining rows of the
502figure.    Each successive row shows the set of parsing markers as they
503advance in parallel using bitwise logic and addition.
504Overall, the sets of marker transitions can be divided into three groups.
506The first group
507$M_0$ through $M_{0,8}$ shows the initiation of parsing for each of the
508 tags through the
509opening angle brackets and  the element names, up to the first
510attribute name, if present.  Note that the there are no attribute names
511in the final tag shown, so the corresponding marker becomes zeroed
512out at the closing angle bracket.
513Since $M_{0,8}$ is not all $0$s, the parsing continues.
515The second group of marker transitions
516$M_{1,1}$ through $M_{1,8}$ deal with the parallel parsing of the first attribute-value
517pair of the remaining tags.
518After these operations, there are no more attributes
519in the first tag, so its corresponding marker becomes zeroed out.
520However, $M_{1, 8}$ is not all $0$s, as the second tags still has an unparsed attribute-value pair.
521Thus, the parsing continues.
523The third group of marker transitions $M_{2,1}$ through $M_{2,8}$ deal with the parsing of
524the second attribute-value pair of this tag.  The
525final transition to $M_{2,8}$ shows the zeroing out of all remaining markers
526once two iterations of attribute-value processing have taken place.
527Since $M_{2,8}$ is all $0$s, start tag parsing stops.
529The implementation of start tag processing uses a while loop that
530terminates when the set of active markers becomes zero,
531i.e.\  when some $M_{k, 8} = 0$.
533as an iteration over unbounded bitstreams, all start tags in the document
534are processed in parallel, using a number of iterations equal to the maximum
535number of attribute-value pairs in any one tag in the document.   
536However, in block-by-block processing, the cost of iteration is considerably reduced; the iteration for
537each block only requires as many steps as there are attribute-value pairs
538overlapping the block.
543%\subsection{Name Scans}
544%To illustrate the scanning of the name found in an XML start tag,
545%let us consider a sequence that might be found in an HTML file,
546%\verb:<div id="myid">:,
547%which is shown as the source data stream in Figure \ref{fig:stag-scan}.
552%source data & \verb:<div id="myid">:\\
553%$M_0$ & \verb`1..............`\\
554%$C_0$ & \verb`.111.11...1111.`\\
555%$M_1 = n(M_0)$ & \verb`.1.............`\\
556%$M_2 = s(M_1, D_0) \wedge \neg C_0$ & \verb`....1.........`\\
557%lastline & \verb`...............`
560%\caption{Scanning Names}
564%If we set our initial marker bitstream according to the procedure outlined in our discussion of marker bitstream initialization, we %obtain the bitstream $M_0$.
565%According to the grammar in Figure \ref{fig:stag-grmr}, we can then look for a \verb:Name: in an \verb:STag: after we have found a %5\verb:<:.
566%So, $M_1$ is the marker bitstream for the starting position of our name.
567%Although we do not know the length of the name, the $C_0$ bit vector can easily be set to $1$ for the characters that can be contained %in a name.
568%We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position.
570Following the pattern shown here, the remaining syntactic
571features of XML markup can similarly be parsed with
572bitstream based methods.   One complexity is that the
573parsing of comments,
574CDATA sections and processing instructions must be
575performed first to determine those regions of text
576within which ordinary XML markups are not parsed (i.e.,
577within each of these types of construct.   This is handled
578by first performance the parsing of these structures and
579then forming a {\em mask bitstream}, that is, a stream that
580identifies spans of text to be excluded from parsing
581(comment and CDATA interiors, parameter text to processing instructions).
584\section{XML Well-Formedness}
586In this section, we consider the full application of the parsing techniques
587of the previous section to the problem of XML well-formedness checking \cite{TR:XML}.
588% This application is useful as a well-defined and commercially significant
589% example to assess the overall applicability of parallel bitstream techniques.
590% To what extent can the well-formedness requirements of XML be
591% completely discharged using parallel bitstream techniques?
592% Are those techniques worthwhile in every instance, or
593% do better alternatives exist for certain requirements?
594% For those requirements that cannot be fully implemented
595% using parallel bitstream technology alone, what
596% preprocessing support can be offered by parallel bitstream
597% technology to the discharge of these requirements in other ways?
598% We address each of these questions in this section,
599% and look not only at the question of well-formedness, but also at
600We look not only at the question of well-formedness, but also at
601the identification of error positions in documents that
602are not well-formed.
605%\subsection{Error and Error-Check Bitstreams}
607Most of the requirements of XML well-formedness checking
608can be implemented using two particular types of computed
609bitstream: {\em error bitstreams}, introduced in the previous section, and {\em error-check bitstreams}.
610Recall that an error bitstream stream is a stream marking the location of definite errors in accordance with
611a particular requirement.  For example, the
612$E_0$, $E_1$, and $E_2$ bitstreams as computed during parsing of
613decimal character references in Figure \ref{fig:decref}
614are error bitstreams.  One bits mark definite errors and zero bits mark the
615absence of an error.   
616% absence of error according to the requirement.   
617Thus the complete absence of errors according to the
618requirements listed may be determined by forming the
619bitwise logical ``or'' of these bitstreams and confirming
620that the resulting value is zero. An error check bitstream is one
621that marks potential errors to be further checked in
622some fashion during post-bitstream processing.   
623An example is the bitstream marking the start positions
624of CDATA sections.   This is a useful information stream
625computed during bitstream processing to identify opening
626\verb:<![: sequences, but also marks positions to
627subsequently check for the complete opening
628delimiter  \verb:<![CDATA[: at each position.
630In typical documents, most of these error-check streams will be quite sparse
631% or even zero.   Many of the error conditions could
632or even zero.   Many error conditions could
633actually be fully implemented using bitstream techniques,
634but at the cost of a number of additional logical and shift
635operations.   In general, the conditions are
636easier and more efficient to check one-at-a-time using
637multibyte comparisons on the original source data stream.
638With very sparse streams, it is very unlikely that
639multiple instances occur within any given block, thus
640eliminating the benefit of parallel evaluation of the logic.
642The requirement for name checking merits comment.   XML
643names may use a wide range of Unicode character values.
644It is too expensive to check every instance of an XML name
645against the full range of possible values.   However, it is
646possible and quite inexpensive to use parallel bitstream
647techniques to verify that any ASCII characters within a name
648are indeed legal name start characters or name characters.
649Furthermore, the characters that may legally follow a
650name in XML are confined to the ASCII range.  This makes
651it useful to define a name scan character class to include all the legal ASCII characters
652for names as well as all non-ASCII characters. 
653A namecheck character class bitstream will then be defined to identify nonASCII
654characters found within namescans.   In most documents
655this bitstream will be all $0$s; even in documents with substantial
656internationalized content, the tag and attribute names used
657to define the document schema tend to be confined to the
658ASCII repertoire.   In the case that this bitstream is nonempty,
659the positions of all 1 bits in this bitstream denote characters
660that need to be individually validated.
662Attribute names within a single XML start tag or empty
663element tag must be unique.  This requirement could be
664implemented using one of several different approaches. Standard
665approaches include: sequential search, symbol lookup, and Bloom filters
668% In general, the use of error-check bitstreams is a straightforward,
669% convenient and reasonably efficient mechanism for
670% checking the well-formedness requirements.
672%\subsection{Tag Matching}
674Except for empty element tags, XML tags come in pairs with
675names that must be matched.   To discharge this requirement,
676we form a bitstream consisting of the disjunction of three
677bitstreams formed during parsing: the bitstream marking the
678positions of start or empty tags (which have a common
679initial structure), the bitstream marking tags that end using
680the empty tag syntax (``\verb:/>:''), and the bitstream
681marking the occurrences of end tags.   In post-bitstream
682processing, we iterate through this computed bitstream
683and match tags using an iterative stack-based approach.
685%\subsection{Document Structure}
687An XML document consists of a single root element with recursively
688defined structure together with material in the document
689prolog and epilogs.  Verifying this top-level structure and
690the structure of the prolog and epilog material is not
691well suited to parallel bitstream techniques, in particular, nor
692to any form of parallelism, in general.  In essence, the
693prolog and epilog materials occur once per document instance
694Thus the requirements to check this top-level structure
695for well-formedness are relatively minor, with an overhead
696that is quite low for sufficiently sized files.
700Overall, parallel bitstream techniques are quite well-suited to
701verification problems such as XML well-formedness checking. 
702Many of the character validation and syntax checking requirements
703can be conveniently and efficiently implemented using error streams.
704Other requirements are also supported by the computation of
705error-check streams for simple post-bitstream processing or
706composite stream over which iterative stack-based procedures
707can be defined for checking recursive syntax.
709\section{Compilation to Block-Based Processing} 
711While a Python implementation of the techniques described in the previous section works on unbounded bitstreams, a corresponding
712C implementation needs to process an input stream in blocks of size equal to the
713SIMD register width of the processor it runs on.
714To convert Python code into C, the key question becomes how
715to transfer information from one block to the next.
716The answer lies in the use of {\em carry bits}, the collection of carries resulting from bitwise additions.
718In fact, in the methods we have outlined, all the
719the information flow between blocks for parallel bitstream
720calculations can be modeled using carry bits.   The parallel
721scanning primitive uses only addition and bitwise logic.
722Since the logic operations do not require information flow
723accross block boundaries, the information flow is entirely
724accounted by the carry.   Carry bits can also be used to
725capture the information flow associated with upshift
726operations, which move information forward one position
727in the file.   In essence, an upshift by one position for
728a bitstream is equivalent to the addition of the stream
729to itself; the bit shifted out in an upshift is in this
730case equivalent to the carry generated by the additon.
731The only other information flow requirement in the
732calculation of parallel bitstreams occurs with the
733bitstream subtractions that are used to calculate span streams.
734In this case, the information flow is based on borrows
735generated, which can be handled in the same way as carries.
737Properly determining, initializing and inserting carry bits
738into a block-by-block implementation of parallel bitstream
739code is a task too tedious for manual implementation.
740We have thus developed compiler technology to automatically
741insert declarations, initializations and carry save/restore
742operations into appropriate locations when translating
743Python operations on unbounded bitstreams into the
744equivalent low-level C code implemented on a block-by-block
745bases.  Our current compiler toolkit is capable of inserting
746carry logic using a variety of strategies, including both
747simulated carry bit processing with SIMD registers, as
748well as carry-flag processing using the processor general
749purpose registers and ALU.   Details are beyond the
750scope of this paper, but are described in the on-line
751source code repository at
753\section{Performance Results}
755In this section, we compare the performance of our \verb:xmlwf:
756implementation using the Parabix2 technology described above with several
757other implementations.
758These include the original \verb:xmlwf:
759distributed as an example application of the \verb:expat: XML
760parser,  implementations based on the widely used Xerces
761open source parser using both SAX and DOM interfaces,
762and an implementation using our prior Parabix 1 technology with
763bit scan operations. 
765Table \ref{XMLDocChars} 
766shows the document characteristics of the XML instances selected for this performance study,
767including both document-oriented and data-oriented XML files.
768The jawiki.xml and dewiki.xml XML files are document-oriented XML instances of Wikimedia books, written in Japanese and German, respectively. The remaining files are data-oriented.  The roads.gml file is an instance of Geography Markup Language (GML),
769a modeling language for geographic information systems as well as an open interchange format for geographic transactions on the Internet \cite{GML04}.  The po.xml file is an example of purchase order data, while the soap.xml file contains a large SOAP message.
770Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.
771This metric is reported for each document.
777File Name               & dewiki.xml            & jawiki.xml            & roads.gml     & po.xml        & soap.xml \\ \hline   
778File Type               & document      & document      & data  & data  & data   \\ \hline     
779File Size (kB)          & 66240                 & 7343                  & 11584         & 76450         & 2717 \\ \hline
780Markup Item Count       & 406792                & 74882                 & 280724        & 4634110       & 18004 \\ \hline               
781Attribute Count         & 18808                 & 3529                  & 160416        & 463397        & 30001\\ \hline
782Avg. Attribute Size     & 8                     & 8                     & 6             & 5             & 9\\ \hline
783Markup Density          & 0.07                  & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
787 \caption{XML Document Characteristics} 
788 \label{XMLDocChars} 
791Table \ref{parsers-cpb} shows performance measurements for the
792various \verb:xmlwf: implementations applied to the
793test suite.   Measurements are made on a single core of an
794Intel Core 2 system running a stock 64-bit Ubuntu 10.10 operating system,
795with all applications compiled with llvm-gcc 4.4.5 optimization level 3.
796Measurements are reported in CPU cycles per input byte of
797the XML data files in each case.
798The first row shows the performance of the Xerces C parser
799using the tree-building DOM interface. 
800Note that the performance
801varies considerably depending on markup density.  Note also that
802the DOM tree construction overhead is substantial and unnecessary
803for XML well-formedness checking.  Using the event-based SAX interface
804to Xerces gives much better results as shown in the
805second row.   
806The third row shows the best performance of our byte-at-a-time
807parsers, using the original  \verb:xmlwf: based on expat.
809The remaining rows of Table \ref{parsers-cpb} show performance
810of parallel bitstream implementations.  The first row shows
811the performance of our Parabix 1 implementation using
812bit scan instructions.   While showing a substantial speed-up
813over the byte-at-a-time parsers in every case, note also that
814the performance advantage increases with increasing markup
815density, as expected.   The last two rows show different versions of
816the \verb:xmlwf: implemented based on the Parabix 2 technology
817as discussed in this paper.   They differ in the carry handling
818strategy, with the ``simd\_add'' row referring to carry
819computations performed with simulated calculation of
820propagated and generated carries using SIMD operations, while the
821``adc64'' row refers to an implementation directly employing
822the processor carry flags and add-with-carry instructions on
82364-bit general registers.  In both cases, the overall
824performance is quite impressive, with the increased
825parallelism of parallel bit scans clearly paying off in
826improved performance for dense markup.
833Parser Class & Parser & dewiki.xml  & jawiki.xml    & roads.gml  & po.xml & soap.xml  \\ \hline 
835\multirow{3}{*}{Byte-at-a-time} & Xerces (DOM)    &    37.921   &    40.559   &    72.78   &    105.497   &    125.929  \\ \cline{2-7} 
836& Xerces (SAX)   &     19.829   &    24.883   &    33.435   &    46.891   &    57.119      \\ \cline{2-7}
837& expat      &  12.639   &    16.535   &    32.717   &    42.982   &    51.468      \\ \hline 
838\multirow{3}{*}{Parallel Bit Stream} & Parabix1   &    8.313   &    9.335   &     13.345   &    16.136   &      19.047 \\ \cline{2-7}
839& Parabix2 (simd\_add)   &      6.103   &    6.445   &    8.034   &    8.685   &    9.53 \\ \cline{2-7} 
840& Parabix2 (adc64)       &      5.123   &    5.996   &    6.852   &    7.648   &    8.275 \\ \hline
841 \end{tabular}
843 \caption{Parser Performance (Cycles Per Byte)} 
847%gcc (simd\_add)    &   6.174   &       6.405   &       7.948   &       8.565   &       9.172 \\ \hline
848%llvm (simd\_add)   &   6.104   &       6.335   &       8.332   &       8.849   &       9.811 \\ \hline
849%gcc (adc64)        &   9.23   &        9.921   &       10.394   &      10.705   &      11.751 \\ \hline
850%llvm (adc64)       &   5.757   &       6.142   &       6.763   &       7.424   &       7.952 \\ \hline
851%gcc (SAHFLAHF)    &    7.951   &       8.539   &       9.984   &       10.219   &      11.388 \\ \hline
852%llvm(SAHFLAHF)    &    5.61   &        6.02   &        6.901   &       7.597   &       8.183 \\ \hline
859In application to the problem of XML parsing and well-formedness
860checking, the method of parallel parsing with bitstream addition
861is effective and efficient.   Using only bitstream addition
862and bitwise logic, it is possible to handle all of the
863character validation, lexical recognition and parsing problems
864except for the recursive aspects of start and end tag matching.
865Error checking is elegantly supported through the use of error
866streams that eliminate separate if-statements to check for
867errors with each byte.   The techniques are generally very
868efficient particularly when markup density is high.   However, for some
869conditions that occur rarely and/or require complex combinations
870of upshifting and logic, it may be better to define simpler
871error-check streams that require limited postprocessing using
872byte matching techniques.
874The techniques have been implemented and assessed for present-day commodity processors employing current SIMD technology.
875As processor advances see improved instruction sets and increases
876in width of SIMD registers, the relative advantages of the
877techniques over traditional byte-at-a-time sequential
878parsing methods is likely to increase substantially.
879Of particular benefit to this method, instruction set modifications
880that provide for more convenient carry propagation for long
881bitstream arithmetic would be most welcome.
883A significant challenge to the application of these techniques
884is the difficulty of programming.   The method of prototyping
885on unbounded bitstreams has proven to be of significant value
886in our work.   Using the prototyping language as input to
887a bitstream compiler has also proven effective in generating
888high-performance code.   Nevertheless, direct programming
889with bitstreams is still a specialized skill; our future
890research includes developing yet higher level tools to
891generate efficient bitstream implementations from grammars,
892regular expressions and other text processing formalisms.
Note: See TracBrowser for help on using the repository browser.