source: docs/Balisage13/Bal2013came0601/Bal2013came0601.html @ 3041

Last change on this file since 3041 was 3041, checked in by ksherdy, 6 years ago

Replaced verbatim with code tags, replaced math escape with appropriate tags, added unicode escapes, added personal blurbs.

File size: 52.8 KB
1<html lang="en">
3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
5<link rel="stylesheet" href="balisage-plain.css" type="text/css">
6<meta name="keywords" content="">
9<div id="balisage-header"><h1 style="text-align: right; font-family: serif; margin:0.25em">
10<i>Balisage:</i> <small>The Markup Conference</small>
12<div lang="en" class="article">
13<div class="titlepage">
14<h2 class="article-title" id="idp26800"></h2>
15<div class="author">
16<h3 class="author">Nigel Medforth</h3>
17<div class="affiliation">
18<p class="jobtitle">Developer</p>
19<p class="orgname">International Characters Inc.</p>
21<div class="affiliation">
22<p class="jobtitle">Graduate Student, School of Computing Science</p>
23<p class="orgname">Simon Fraser University </p>
25<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
27<div class="author">
28<h3 class="author">Dan Lin</h3>
29<div class="affiliation">
30<p class="jobtitle"></p>
31<p class="orgname"></p>
33<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
35<div class="author">
36<h3 class="author">Kenneth Herdy</h3>
37<div class="affiliation">
38<p class="jobtitle">Graduate Student, School of Computing Science</p>
39<p class="orgname">Simon Fraser University </p>
41<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
43<div class="author">
44<h3 class="author">Rob Cameron</h3>
45<div class="affiliation">
46<p class="jobtitle">Professor of Computing Science</p>
47<p class="orgname">Simon Fraser University</p>
49<div class="affiliation">
50<p class="jobtitle">Chief Technology Officer</p>
51<p class="orgname">International Characters, Inc.</p>
53<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
55<div class="author">
56<h3 class="author">Arrvindh Shriraman</h3>
57<div class="affiliation">
58<p class="jobtitle"></p>
59<p class="orgname"></p>
61<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
63<div class="abstract">
64<p class="title"><b>Abstract</b></p>
65<p id="idp28432">Prior research on the acceleration of XML processing
66using SIMD and multi-core parallelism has lead to
67a number of interesting research prototypes.  This work
68investigates the extent to which the techniques underlying
69these prototypes could result in systematic performance
70benefits when fully integrated into a commercial XML parser.
71The widely used Xerces-C++ parser of the Apache Software
72Foundation was chosen as the foundation for the study.
73A systematic restructuring of the parser was undertaken,
74while maintaining the existing API for application programmers.
75Using SIMD techniques alone, an increase in parsing speed
76of at least 50% was observed in a range of applications.
77When coupled with pipeline parallelism on dual core processors,
78improvements of 2x and beyond were realized.
83<div class="toc">
84<p><b>Table of Contents</b></p>
86<dt><span class="section"><a href="#idp15032" class="toc">Introduction</a></span></dt>
87<dt><span class="section"><a href="#idp15928" class="toc">Background</a></span></dt>
89<dt><span class="section"><a href="#idp16248" class="toc">Xerces C++ Structure</a></span></dt>
90<dt><span class="section"><a href="#idp22392" class="toc">The Parabix Framework</a></span></dt>
91<dt><span class="section"><a href="#idp178296" class="toc">Sequential vs. Parallel Paradigm</a></span></dt>
93<dt><span class="section"><a href="#idp181248" class="toc">Architecture</a></span></dt>
95<dt><span class="section"><a href="#idp181568" class="toc">Overview</a></span></dt>
96<dt><span class="section"><a href="#idp195064" class="toc">Character Set Adapters</a></span></dt>
97<dt><span class="section"><a href="#idm9304" class="toc">Combined Parallel Filtering</a></span></dt>
98<dt><span class="section"><a href="#idp221352" class="toc">Content Stream</a></span></dt>
99<dt><span class="section"><a href="#idp228528" class="toc">Namespace Handling</a></span></dt>
100<dt><span class="section"><a href="#idp240160" class="toc">Error Handling</a></span></dt>
102<dt><span class="section"><a href="#idp247200" class="toc">Multithreading with Pipeline Parallelism</a></span></dt>
103<dt><span class="section"><a href="#idp257920" class="toc">Performance</a></span></dt>
105<dt><span class="section"><a href="#idp259800" class="toc">Xerces C++ SAXCount</a></span></dt>
106<dt><span class="section"><a href="#idp264184" class="toc">GML2SVG</a></span></dt>
108<dt><span class="section"><a href="#idp264760" class="toc">Conclusion and Future Work</a></span></dt>
111<div class="section" id="idp15032">
112<h2 class="title" style="clear: both">Introduction</h2>
113<p id="idp15352"></p>
114<p id="idp15480"></p>
115<p id="idp15608"></p>
116<p id="idp15736"></p>
118<div class="section" id="idp15928">
119<h2 class="title" style="clear: both">Background</h2>
120<div class="section" id="idp16248">
121<h3 class="title" style="clear: both">Xerces C++ Structure</h3>
122<p id="idp16568">
123The Xerces C++ parser
128features comprehensive support for a variety of character encodings
129both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
130multiple XML vocabularies through the XML namespace
131mechanism, as well as complete implementations
132of structure and data validation through multiple grammars
133declared using either legacy DTDs (document type
134definitions) or modern XML Schema facilities.
135Xerces also supports several APIs for accessing
136parser services, including event-based parsing
137using either pull parsing or SAX/SAX2 push-style
138parsing as well as a DOM tree-based parsing interface.
140<p id="idp18112">
144Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
145first to the last byte of input data. Each byte passes through several processing layers and is
146classified and eventually validated within the context of the document state.
147This introduces implicit dependencies between the various tasks within the application that make it
148difficult to optimize for performance.
149As a complex software system, no one feature dominates the overall parsing performance.
150Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
151Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
152parallelization in isolation would only produce a minute improvement in performance.
153Unfortunately, early investigation into these functions found
154that incorporating speculation-free thread-level parallelization was impossible
155and they were already performing well in their given tasks;
156thus only trivial enhancements were attainable.
157In order to obtain a systematic acceleration of Xerces,
158it should be expected that a comprehensive restructuring
159is required, involving all aspects of the parser.
161<p id="idp21496">
168<div class="section" id="idp22392">
169<h3 class="title" style="clear: both">The Parabix Framework</h3>
170<p id="idp22712">
171The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
172(and other forms of text processing.) The key idea is to exploit the availability of wide
173SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
174of input data by using one register bit per single input byte.
175To facilitate this, the input data is first transposed into a set of basis bit streams.
178Boolean-logic operations\footnote{∧, \√ and ¬ denote the boolean AND, OR and NOT operators.}
179are used to classify the input bits into a set of <span class="ital">character-class bit streams</span>, which identify key
180characters (or groups of characters) with a <code class="code">1</code>.
181For example, one of the fundamental characters in XML is a left-angle bracket.
182A character is an <code class="code">'&lt;' if and only if ¬(b<sub>0</sub> âˆš b<sub>1</sub>) ∧ (b<sub>2</sub> âˆ§
183b<sub>3</sub>) ∧ (b<sub>4</sub> âˆ§ b<sub>5</sub>) ∧ ¬ (b<sub>6</sub> âˆš b<sub>7</sub>) = 1</code>.
184Similarly, a character is numeric,
185<code class="code">[0-9] if and only if ¬(b<sub>0</sub> âˆš b<sub>1</sub>) ∧ (b<sub>2</sub> âˆ§ b<sub>3</sub>) ∧ ¬(b<sub>4</sub> âˆ§ (b<sub>5</sub> âˆš b<sub>6</sub>))</code>.
186An important observation here is that ranges of characters may
187require fewer operations than individual characters and
189multiple classes can share the classification cost.
191<p id="idp170240">
194<p id="idp171696">
195Consider, for example, the XML source data stream shown in the first line of .
196The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
197parsing, with each bit of each stream in one-to-one correspondence to the source character code units
198of the input stream.
199For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
200The first bit stream shown is that for the opening
201angle brackets that represent tag openers in XML.
202The second and third streams show a partition of the
203tag openers into start tag marks and end tag marks
204depending on the character immediately following the
205opener (i.e., <code class="code">"/"</code>) or not.  The remaining three
206lines show streams that can be computed in subsequent
207parsing (using the technique
208of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
209attribute names and attribute values of tags. 
211<p id="idp175032">
212Two intuitions may help explain how the Parabix approach can lead
213to improved XML parsing performance. The first is that
214the use of the full register width offers a considerable
215information advantage over sequential byte-at-a-time
216parsing.  That is, sequential processing of bytes
217uses just 8 bits of each register, greatly limiting the
218processor resources that are effectively being used at any one time.
219The second is that byte-at-a-time loop scanning loops are actually
220often just computing a single bit of information per iteration:
221is the scan complete yet?
222Rather than computing these individual decision-bits, an approach that computes
223many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
224should provide substantial benefit.
226<p id="idp175992">
227Previous studies have shown that the Parabix approach improves many aspects of XML processing,
228including transcoding \cite{Cameron2008}, character classification and validation,
229tag parsing and well-formedness checking. 
230The first Parabix parser used processor bit scan instructions to considerably accelerate
231sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
232Recent work has incorporated a method of parallel
233scanning using bitstream addition \cite{cameron-EuroPar2011}, as
234well as combining SIMD methods with 4-stage pipeline parallelism to further improve
235throughput \cite{HPCA2012}.
236Although these research prototypes handled the full syntax of schema-less XML documents,
237they lacked the functionality required by full XML parsers.
239<p id="idp177744">
240Commercial XML processors support transcoding of multiple character sets and can parse and
241validate against multiple document vocabularies.
242Additionally, they provide API facilities beyond those found in research prototypes,
243including the widely used SAX, SAX2 and DOM interfaces.
246<div class="section" id="idp178296">
247<h3 class="title" style="clear: both">Sequential vs. Parallel Paradigm</h3>
248<p id="idp178616">
249Xerces—like all traditional XML parsers—processes XML documents sequentially.
250Each character is examined to distinguish between the
251XML-specific markup, such as a left angle bracket <code class="code">&lt;</code>, and the
252content held within the document. 
253As the parser progresses through the document, it alternates between markup scanning,
254validation and content processing modes.
256<p id="idp179688">
257In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
258  Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
259  current state and the events associated with transitions of states.}.
260Each state transition indicates the processing context of subsequent characters.
261Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
263<p id="idp180352">
264Parabix-style XML parsers utilize a concept of layered processing.
265A block of source text is transformed into a set of lexical bitstreams,
266which undergo a series of operations that can be grouped into logical layers,
267e.g., transposition, character classification, and lexical analysis.
268Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
269To meet the API requirements of the document-ordered Xerces output,
270the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
274<div class="section" id="idp181248">
275<h2 class="title" style="clear: both">Architecture</h2>
276<div class="section" id="idp181568">
277<h3 class="title" style="clear: both">Overview</h3>
278<p id="idp182016">
279icXML is more than an optimized version of Xerces. Many components were grouped, restructured and
280rearchitected with pipeline parallelism in mind.
281In this section, we highlight the core differences between the two systems.
282As shown in Figure \ref{fig:xerces-arch}, Xerces
283is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
284The <span class="ital">Transcoder</span> converts source data into UTF-16 before Xerces parses it as XML;
285the majority of the character set encoding validation is performed as a byproduct of this process.
286The <span class="ital">Reader</span> is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
287It tracks the current line/column position,
289performs line-break normalization and validates context-specific character set issues,
290such as tokenization of qualified-names.
291The <span class="ital">Scanner</span> pulls data through the reader and constructs the intermediate representation (IR)
292of the document; it deals with all issues related to entity expansion, validates
293the XML well-formedness constraints and any character set encoding issues that cannot
294be completely handled by the reader or transcoder (e.g., surrogate characters, validation
295and normalization of character references, etc.)
296The <span class="ital">Namespace Binder</span> is a core piece of the element stack.
297It handles namespace scoping issues between different XML vocabularies.
298This allows the scanner to properly select the correct schema grammar structures.
299The <span class="ital">Validator</span> takes the IR produced by the Scanner (and
300potentially annotated by the Namespace Binder) and assesses whether the final output matches
301the user-defined DTD and schema grammar(s) before passing it to the end-user.
303<p id="idp185640">
306<p id="idp185896">
307In icXML functions are grouped into logical components.
308As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the Markup Processor.
309All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bitstreams.
310The <span class="ital">Character Set Adapter</span>, discussed in Section \ref{arch:character-set-adapter},
311mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
312set of lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}.
313These lexical bitstreams are later transformed into UTF-16 in the Content Stream Generator,
314after additional processing is performed.
315The first precursor to producing UTF-16 is the <span class="ital">Parallel Markup Parser</span> phase.
316It takes the lexical streams and produces a set of marker bitstreams in which a 1-bit identifies
317significant positions within the input data. One bitstream for each of the critical piece of information is created, such as
318the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
319Intra-element well-formedness validation is performed as an artifact of this process.
320Like Xerces, icXML must provide the Line and Column position of each error.
321The <span class="ital">Line-Column Tracker</span> uses the lexical information to keep track of the document position(s) through the use of an
322optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
323From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
325<p id="idp188856">
326A typical XML file contains few unique element and attribute names—but each of them will occur frequently.
327icXML stores these as distinct data structures, called symbols, each with their own global identifier (GID).
328Using the symbol marker streams produced by the Parallel Markup Parser, the <span class="ital">Symbol Resolver</span> scans through
329the raw data to produce a sequence of GIDs, called the <span class="ital">symbol stream</span>.
331<p id="idp190496">
332The final components of the Parabix Subsystem are the <span class="ital">Content Preparation Unit</span> and <span class="ital">Content Stream Generator</span>.
333The former takes the (transposed) basis bitstreams and selectively filters them, according to the
334information provided by the Parallel Markup Parser, and the latter transforms the
335filtered streams into the tagged UTF-16 <span class="ital">content stream</span>, discussed in Section \ref{section:arch:contentstream}.
337<p id="idp191984">
338Combined, the symbol and content stream form icXML's compressed IR of the XML document.
339The <span class="ital">Markup Processor</span>~parses the IR to validate and produce the sequential output for the end user.
340The <span class="ital">Final WF checker</span> performs inter-element well-formedness validation that would be too costly
341to perform in bit space, such as ensuring every start tag has a matching end tag.
342Xerces's namespace binding functionality is replaced by the <span class="ital">Namespace Processor</span>. Unlike Xerces,
343it is a discrete phase that produces a series of URI identifiers (URI IDs), the <span class="ital">URI stream</span>, which are
344associated with each symbol occurrence.
345This is discussed in Section \ref{section:arch:namespacehandling}.
346Finally, the <span class="ital">Validation</span> layer implements the Xerces's validator.
347However, preprocessing associated with each symbol greatly reduces the work of this stage.
349<p id="idp194728">
353<div class="section" id="idp195064">
354<h3 class="title" style="clear: both">Character Set Adapters</h3>
355<p id="idp195712">
356In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
357provide the end-consumer with a single encoding format.
358In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
359because of the need to decode and classify each byte of input, mapping variable-length UTF-8
360byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
361In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
362transcoding imposes at least a cost of buffer copying.
364<p id="idp196496">
365In icXML, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
366Given a specified input encoding, a CSA is responsible for checking that
367input code units represent valid characters, mapping the characters of the encoding into
368the appropriate bitstreams for XML parsing actions (i.e., producing the lexical item
369streams), as well as supporting ultimate transcoding requirements.   All of this work
370is performed using the parallel bitstream representation of the source input.
372<p id="idp197216">
373An important observation is that many character sets are an
374extension to the legacy 7-bit ASCII character set.  This includes the
375various ISO Latin character sets, UTF-8, UTF-16 and many others.
376Furthermore, all significant characters for parsing XML are confined to the
377ASCII repertoire.   Thus, a single common set of lexical item calculations
378serves to compute lexical item streams for all such ASCII-based character sets.
380<p id="idp197848">
381A second observation is that—regardless of which character set is used—quite
382often all of the characters in a particular block of input will be within the ASCII range.
383This is a very simple test to perform using the bitstream representation, simply confirming that the
384bit 0 stream is zero for the entire block.   For blocks satisfying this test,
385all logic dealing with non-ASCII characters can simply be skipped.
386Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
387UTF-16 form are each set to zero in this case.
389<p id="idp199400">
390A third observation is that repeated transcoding of the names of XML
391elements, attributes and so on can be avoided by using a look-up mechanism.
392That is, the first occurrence of each symbol is stored in a look-up
393table mapping the input encoding to a numeric symbol ID.   Transcoding
394of the symbol is applied at this time.  Subsequent look-up operations
395can avoid transcoding by simply retrieving the stored representation.
396As symbol look up is required to apply various XML validation rules,
397there is achieves the effect of transcoding each occurrence without
398additional cost.
400<p id="idp200184">
401The cost of individual character transcoding is avoided whenever a block of input is
402confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
403Furthermore, when transcoding is required, the parallel bitstream representation
404supports efficient transcoding operations.   
405In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bitstreams
406can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
407and all but the final bytes of multi-byte sequences can be marked for deletion as
408discussed in the following subsection.
409In other cases, transcoding within a block only need be applied for non-ASCII
410bytes, which are conveniently identified by iterating through the bit 0 stream
411using bit scan operations.
414<div class="section" id="idm9304">
415<h3 class="title" style="clear: both">Combined Parallel Filtering</h3>
416<p id="idm8952">
417As just mentioned, UTF-8 to UTF-16 transcoding involves marking
418all but the last bytes of multi-byte UTF-8 sequences as
419positions for deletion.   For example, the two
420Chinese characters <code class="code">䜠奜</code>
421are represented as two three-byte UTF-8 sequences <code class="code">E4 BD A0</code>
422and <code class="code">E5 A5 BD</code> while the UTF-16 representation must be
423compressed down to the two code units <code class="code">4F60</code> and <code class="code">597D</code>.
424In the bit parallel representation, this corresponds to a reduction
425from six bit positions representing UTF-8 code units (bytes)
426down to just two bit positions representing UTF-16 code units
427(double bytes).   This compression may be achieved by
428arranging to calculate the correct UTF-16 bits at the
429final position of each sequence and creating a deletion
430mask to mark the first two bytes of each 3-byte sequence
431for deletion.   In this case, the portion of the mask
432corresponding to these input bytes is the bit sequence
433<code class="code">110110</code>.  Using this approach, transcoding may then be
434completed by applying parallel deletion and inverse transposition of the
435UTF-16 bitstreams\cite{Cameron2008}.
437<p id="idm6424">
448<p id="idm3968">
449Rather than immediately paying the
450costs of deletion and transposition just for transcoding,
451however, icXML defers these steps so that the deletion
452masks for several stages of processing may be combined.
453In particular, this includes core XML requirements
454to normalize line breaks and to replace character
455reference and entity references by their corresponding
456text.   In the case of line break normalization,
457all forms of line breaks, including bare carriage
458returns (CR), line feeds (LF) and CR-LF combinations
459must be normalized to a single LF character in
460each case.   In icXML, this is achieved by
461first marking CR positions, performing two
462bit parallel operations to transform the marked
463CRs into LFs, and then marking for deletion any
464LF that is found immediately after the marked CR
465as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
468<p id="idm1464">
469In essence, the deletion masks for transcoding and
470for line break normalization each represent a bitwise
471filter; these filters can be combined using bitwise-or
472so that the parallel deletion algorithm need only be
473applied once.
475<p id="idp217752">
476A further application of combined filtering
477is the processing of XML character and entity
478references.   Consider, for example, the references <code class="code">&amp;</code> or <code class="code">&lt;</code>.
479which must be replaced in XML processing with 
480the single <code class="code">&amp;</code> and <code class="code">&lt;</code> characters, respectively.
481The approach in icXML is to mark all but the first character
482positions of each reference for deletion, leaving a
483single character position unmodified.  Thus, for the
484references <code class="code">&amp;</code> or <code class="code">&lt;</code> the
485masks <code class="code">01111</code> and <code class="code">011111</code> are formed and
486combined into the overall deletion mask.   After the
487deletion and inverse transposition operations are finally
488applied, a post-processing step inserts the proper character
489at these positions.   One note about this process is
490that it is speculative; references are assumed to generally
491be replaced by a single UTF-16 code unit.   In the case,
492that this is not true, it is addressed in post-processing.
494<p id="idp220384">
495The final step of combined filtering occurs during
496the process of reducing markup data to tag bytes
497preceding each significant XML transition as described
498in section~\ref{section:arch:contentstream}.  Overall, icXML
499avoids separate buffer copying operations for each of the
500these filtering steps, paying the cost of parallel
501deletion and inverse transposition only once. 
502Currently, icXML employs the parallel-prefix compress algorithm
503of Steele~\cite{HackersDelight}  Performance
504is independent of the number of positions deleted.
505Future versions of icXML are expected to
506take advantage of the parallel extract operation~\cite{HilewitzLee2006}
507that Intel is now providing in its Haswell architecture.
510<div class="section" id="idp221352">
511<h3 class="title" style="clear: both">Content Stream</h3>
512<p id="idp221696">
513A relatively-unique concept for icXML is the use of a filtered content stream.
514Rather that parsing an XML document in its original format, the input is transformed
515into one that is easier for the parser to iterate through and produce the sequential
517In , the source data
519is transformed into
522through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
524<p id="idp223408">
525Combined with the symbol stream, the parser traverses the content stream to effectively
526reconstructs the input document in its output form.
527The initial <span class="ital">0</span> indicates an empty content string. The following <code class="code">&gt;</code>
528indicates that a start tag without any attributes is the first element in this text and
529the first unused symbol, <code class="code">document</code>, is the element name.
530Succeeding that is the content string <code class="code">fee</code>, which is null-terminated in accordance
531with the Xerces API specification. Unlike Xerces, no memory-copy operations
532are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
533accounts for 6.83% of Xerces's execution time.
534Additionally, it is cheap to locate the terminal character of each string:
535using the String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
536null character in the content stream in parallel, which in turn means the parser can
537directly jump to the end of every string without scanning for it.
539<p id="idp226064">
540Following <code class="code">'fee'</code> is a <code class="code">=</code>, which marks the existence of an attribute.
541Because all of the intra-element was performed in the Parabix Subsystem, this must be a legal attribute.
542Since attributes can only occur within start tags and must be accompanied by a textual value,
543the next symbol in the symbol stream must be the element name of a start tag,
544and the following one must be the name of the attribute and the string that follows the <code class="code">=</code> must be its value.
545However, the subsequent <code class="code">=</code> is not treated as an independent attribute because the parser has yet to
546read a <code class="code">&gt;</code>, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
547it (along with the string value) is added to the element.
548Eventually the parser reaches a <code class="code">/</code>, which marks the existence of an end tag. Every end tag requires an
549element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
550ensure that the appropriate scope-nesting rules have been applied.
553<div class="section" id="idp228528">
554<h3 class="title" style="clear: both">Namespace Handling</h3>
555<p id="idp229176">
556In XML, namespaces prevents naming conflicts when multiple vocabularies are used together.
557It is especially important when a vocabulary application-dependant meaning, such as when
558XML or SVG documents are embedded within XHTML files.
559Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify
560specific names or resources.
561On line 1 of Figure \ref{fig:namespace1}, the <code class="code">xmlns</code> attribute instructs the XML
562processor to bind the prefix <code class="code">p</code> to the URI '<code class="code"></code>' and the default (empty)
563prefix to <code class="code"></code>. Thus to the XML processor, the <code class="code">title</code> on line 2 and
564<code class="code">price</code> on line 4 both read as <code class="code">"":title</code> and <code class="code">"":price</code> 
565respectively, whereas on line 3 and 5, <code class="code">p:name</code> and <code class="code">price</code> are seen as
566<code class="code">"":name</code> and <code class="code">"":price</code>. Even though the actual element name
567<code class="code">price</code>, due to namespace scoping rules they are viewed as two uniquely-named items
568because the current vocabulary is determined by the namespace(s) that are in-scope.
570<p id="idp233096">
573<p id="idp233368">
574In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID.
575These persist for the lifetime of the application through the use of a global URI pool.
576Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
577in the document. Because a namespace declaration affects the entire element, it must be processed prior to
578grammar validation. This is a costly process considering that a typical namespaced XML document only comes
579in one of two forms:
580(1) those that declare a set of namespaces upfront and never change them, and
581(2) those that repeatedly modify the namespaces in predictable patterns.
583<p id="idp233560">
584For that reason, icXML contains an independent namespace stack and utilizes bit vectors to cheaply perform
587When a prefix is declared (e.g., <code class="code">xmlns:p=""</code>), a namespace binding is created that maps
588the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
589Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
590NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
591prefix binding set of <code class="code">p</code> and <code class="code">xmlns</code> would be <code class="code">01</code> and <code class="code">11</code> respectively.
592To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
593maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
594NSID can be found using a bit-scan intrinsic.
595A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
597<p id="idp237336">
600<p id="idp237608">
606<p id="idp239176">
607To ensure that scoping rules are adhered to,
608whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored
609within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
610currently visible namespaces is XORed with the vector at the top of the stack.
611This allows any number of changes to be performed at each scope-level with a constant time.
615<div class="section" id="idp240160">
616<h3 class="title" style="clear: both">Error Handling</h3>
617<p id="idp240504">
619Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
620As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
621if not, the current state determines the type and severity of the error.
622icXML emits errors in the similar manner—but how it discovers them is substantially different.
623Recall that in Figure \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and Markup Processor,
624each with its own system for detecting and producing error messages.
626<p id="idp241640">
627Within the Parabix Subsystem, all computations are performed in parallel, a block at a time.
628Errors are derived as artifacts of bitstream calculations, with a 1-bit marking the byte-position of an error within a block,
629and the type of error is determined by the equation that discovered it.
630The difficulty of error processing in this section is that in Xerces the line and column number must be given
631with every error production. Two major issues exist because of this:
632(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
633followed by a line feed, are counted as a single new line character.
634(2) column position is counted in characters, not bytes or code units;
635thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
636Note that typical XML documents are error-free but the calculation of the
637line/column position is a constant overhead in Xerces.
638To reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence of the error and
639performs the minimal amount of book-keeping necessary to facilitate it.
640icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
641within the Line Column Tracker (LCT).
642One of the CSA's major responsibilities is transcoding an input text.
643During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
644and validated.
645A <span class="ital">line-feed bitstream</span>, which marks the positions of the normalized new lines characters, is a natural derivative of
646this process.
647Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
649Column position is more difficult to calculate.
650It is possible to scan backwards through the bitstream of new line characters to determine the distance (in code-units)
651between the position between which an error was detected and the last line feed. However, this distance may exceed
652than the actual character position for the reasons discussed in (2).
653To handle this, the CSA generates a <span class="ital">skip mask</span> bitstream by ORing together many relevant bitstreams,
654such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
655normalization process.
656When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
657column number.
659<p id="idp245496">
660The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces.
661However, reporting the correct line/column is a much more difficult problem.
662The Markup Processor parses the content stream, which is a series of tagged UTF-16 strings.
663Each string is normalized in accordance with the XML specification.
664All symbol data and unnecessary whitespace is eliminated from the stream;
665thus its impossible to derive the current location using only the content stream.
666To calculate the location, the Markup Processor borrows three additional pieces of information from the Parabix Subsystem:
667the line-feed, skip mask, and a <span class="ital">deletion mask stream</span>, which is a bitstream denoting the (code-unit) position of every
668datum that was suppressed from the source during the production of the content stream.
669Armed with these, it is possible to calculate the actual line/column using
670the same system as the Parabix Subsystem until the sum of the negated deletion mask stream is equal to the current position.
674<div class="section" id="idp247200">
675<h2 class="title" style="clear: both">Multithreading with Pipeline Parallelism</h2>
676<p id="idp247568">
677As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
678These are "embarrassingly sequential."\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
679However, icXML is designed to organize processing into logical layers.   
680In particular, layers within the Parabix Subsystem are designed to operate
681over significant segments of input data before passing their outputs on for
682subsequent processing.  This fits well into the general model of pipeline
683parallelism, in which each thread is in charge of a single module or group
684of modules.
686<p id="idp249104">
687The most straightforward division of work in icXML is to separate
688the Parabix Subsystem and the Markup Processor into distinct logical layers into two separate stages.
689The resultant application, <span class="ital">icXML-p</span>, is a course-grained software-pipeline application.
690In this case, the Parabix Subsystem thread <code class="code">T<sub>1</sub></code> reads 16k of XML input <code class="code">I</code> at a time and produces the
691content, symbol and URI streams, then stores them in a pre-allocated shared data structure <code class="code">S</code>.
692The Markup Processor thread <code class="code">T<sub>2</sub></code> consumes <code class="code">S</code>, performs well-formedness and grammar-based validation,
693and the provides parsed XML data to the application through the Xerces API. 
694The shared data structure is implemented using a ring buffer,
695where every entry contains an independent set of data streams.
696In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
697A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
698In Figure \ref{threads_timeline1} the processing time of <code class="code">T<sub>1</sub></code> is longer than <code class="code">T<sub>2</sub></code>;
699thus <code class="code">T<sub>2</sub></code> always waits for <code class="code">T<sub>1</sub></code> to write to the shared memory.
700Figure \ref{threads_timeline2} illustrates the scenario in which <code class="code">T<sub>1</sub></code> is faster
701and must wait for <code class="code">T<sub>2</sub></code> to finish reading the shared data before it can reuse the memory space.
703<p id="idp254152">
706<p id="idp254424">
707Overall, our design is intended to benefit a range of applications.
708Conceptually, we consider two design points.
709The first, the parsing performed by the Parabix Subsystem dominates at 67% of the overall cost,
710with the cost of application processing (including the driver logic within the Markup Processor) at 33%.   
711The second is almost the opposite scenario, the cost of application processing dominates at 60%,
712while the cost of XML parsing represents an overhead of 40%.
714<p id="idp254616">
715Our design is predicated on a goal of using the Parabix
716framework to achieve a 50% to 100% improvement in the parsing engine itself.   
717In a best case scenario,
718a 100% improvement of the Parabix Subsystem for the design point in which
719XML parsing dominates at 67% of the total application cost.
720In this case, the single-threaded icXML should achieve a 1.5x speedup over Xerces
721so that the total application cost reduces to 67% of the original. 
722However, in icXML-p, our ideal scenario gives us two well-balanced threads
723each performing about 33% of the original work.   
724In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.
726<p id="idp256344">
727At the other extreme of our design range, we consider an application
728in which core parsing cost is 40%.  Assuming the 2x speedup of
729the Parabix Subsystem over the corresponding Xerces core, single-threaded
730icXML delivers a 25% speedup.  However, the most significant
731aspect of our two-stage multi-threaded design then becomes the
732ability to hide the entire latency of parsing within the serial time
733required by the application.  In this case, we achieve
734an overall speedup in processing time by 1.67x.
736<p id="idp257048">
737Although the structure of the Parabix Subsystem allows division of the work into
738several pipeline stages and has been demonstrated to be effective
739for four pipeline stages in a research prototype \cite{HPCA2012},
740our analysis here suggests that the further pipelining of work within
741the Parabix Subsystem is not worthwhile if the cost of application logic is little as
74233% of the end-to-end cost using Xerces.  To achieve benefits of
743further parallelization with multi-core technology, there would
744need to be reductions in the cost of application logic that
745could match reductions in core parsing cost.
748<div class="section" id="idp257920">
749<h2 class="title" style="clear: both">Performance</h2>
750<p id="idp258256">
751We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications:
752the Xerces C++ SAXCount sample application,
753and a real world GML to SVG transformation application.
754We investigated XML parser performance using an Intel Core i7 quad-core
755(Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core),
75632+32 kB (per core) L1 cache,
757256 kB (per core) L2 cache,
7588 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
760<p id="idp258920">
761We analyzed the execution profiles of each XML parser
762using the performance counters found in the processor.
763We chose several key hardware events that provide insight into the profile of each
764application and indicate if the processor is doing useful work. 
765The set of events included in our study are:
766processor cycles, branch instructions, branch mispredictions,
767and cache misses. The Performance Application Programming Interface
768(PAPI) Version 5.5.0 \cite{papi} toolkit
769was installed on the test system to facilitate the
770collection of hardware performance monitoring
771statistics. In addition, we used the Linux perf \cite{perf} utility
772to collect per core hardware events.
774<div class="section" id="idp259800">
775<h3 class="title" style="clear: both">Xerces C++ SAXCount</h3>
776<p id="idp260144">
777Xerces comes with sample applications that demonstrate salient features of the parser.
778SAXCount is the simplest such application:
779it counts the elements, attributes and characters of a given XML file using the (event based) SAX API
780and prints out the totals.
782<p id="idp260608">
785<p id="idp260880">
786Table \ref{XMLDocChars} shows the document characteristics of the XML input
787files selected for the Xerces C++ SAXCount benchmark. The jaw.xml
788represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence
789required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented
790XML documents and consist entirely of single byte encoded ASCII characters.
792<p id="idp261072">
793A key predictor of the overall parsing performance of an XML file is markup density\footnote{
794  Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}.
795This metric has substantial influence on the performance of traditional recursive descent XML parsers
796because it directly corresponds to the number of state transitions that occur when parsing a document.
797We use a mixture of document-oriented and
798data-oriented XML files to analyze performance over a spectrum
799of markup densities.
801<p id="idp262824">
802Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML in terms of
803CPU cycles per byte for the SAXCount application.
804The speedup for icXML over Xerces is 1.3x to 1.8x.
805With two threads on the multicore machine, icXML-p can achieve speedup up to 2.7x.
806Xerces is substantially slowed by dense markup
807but icXML is less affected through a reduction in branches and the use of parallel-processing techniques.
808icXML-p performs better as markup-density increases because the work performed by each stage is
809well balanced in this application.
811<p id="idp263592">
815<div class="section" id="idp264184">
816<h3 class="title" style="clear: both">GML2SVG</h3>
817<p id="idp264504"></p>
820<div class="section" id="idp264760">
821<h2 class="title" style="clear: both">Conclusion and Future Work</h2>
822<p id="idp265112">
823This paper is the first case study documenting the significant
824performance benefits that may be realized through the integration
825of parallel bitstream technology into existing widely-used software libraries.
826In the case of the Xerces-C++ XML parser, the
827combined integration of SIMD and multicore parallelism was
828shown capable of dramatic producing dramatic increases in
829throughput and reductions in branch mispredictions and cache misses.
830The modified parser, going under the name icXML is designed
831to provide the full functionality of the original Xerces library
832with complete compatibility of APIs.  Although substantial
833re-engineering was required to realize the
834performance potential of parallel technologies, this
835is an important case study demonstrating the general
836feasibility of these techniques.
838<p id="idp266936">
839The further development of icXML to move beyond 2-stage
840pipeline parallelism is ongoing, with realistic prospects for
841four reasonably balanced stages within the library.  For
842applications such as GML2SVG which are dominated by time
843spent on XML parsing, such a multistage pipelined parsing
844library should offer substantial benefits. 
846<p id="idp267472">
847The example of XML parsing may be considered prototypical
848of finite-state machines applications which have sometimes
849been considered "embarassingly sequential" and so
850difficult to parallelize that "nothing works."  So the
851case study presented here should be considered an important
852data point in making the case that parallelization can
853indeed be helpful across a broad array of application types.
855<p id="idp268584">
856To overcome the software engineering challenges in applying
857parallel bitstream technology to existing software systems,
858it is clear that better library and tool support is needed.
859The techniques used in the implementation of icXML and
860documented in this paper could well be generalized for
861applications in other contexts and automated through
862the creation of compiler technology specifically supporting
863parallel bitstream programming.
866<div class="bibliography" id="idp269544">
867<h2 class="title" style="clear:both">Bibliography</h2>
868<p class="bibliomixed" id="XMLChip09">[Leventhal and Lemoine 2009] Leventhal, Michael and
869         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
870         Processing XML Efficiently 2009, Montréal.</p>
871<p class="bibliomixed" id="Datapower09">[Salz, Achilles and Maze 2009] Salz, Richard,
872         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
873         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
874         Efficiently 2009, Montréal.</p>
875<p class="bibliomixed" id="PPoPP08">[Cameron 2007] Cameron, Robert D. 2007. A Case Study
876         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
877         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
878         Lake City, Utah. On the Web at <a href="" class="link" target="_new"></a>.</p>
879<p class="bibliomixed" id="CASCON08">[Cameron, Herdy and Lin 2008] Cameron, Robert D.,
880         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
881         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
882         Practice of Parallel Programming 2008, Toronto.</p>
883<p class="bibliomixed" id="SVGOpen08">[Herdy, Burggraf and Cameron 2008] Herdy, Kenneth
884         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
885         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
886         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
887         Nuremburg. On the Web at
888            <a href="" class="link" target="_new"></a>.</p>
889<p class="bibliomixed" id="Ross06">[Ross 2006] Ross, Kenneth A. 2006. Efficient hash
890         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
891            <a href="" class="link" target="_new"></a>.</p>
892<p class="bibliomixed" id="ASPLOS09">[Cameron and Lin 2009] Cameron, Robert D. and Dan
893         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
894         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</p>
895<p class="bibliomixed" id="Wu08">[Wu et al. 2008] Wu, Yu, Qi Zhang, Zhiqiang Yu and
896         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
897         Proceedings of Balisage 2008, Montréal. On the Web at
898            <a href="" class="link" target="_new"></a>.</p>
899<p class="bibliomixed" id="u8u16">[Cameron 2008] u8u16 - A High-Speed UTF-8 to UTF-16
900         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
901         Science Simon Fraser University, June 21 2007.</p>
902<p class="bibliomixed" id="XML10">[XML 1.0] Extensible Markup Language (XML) 1.0 (Fifth
903         Edition) W3C Recommendation 26 November 2008. On the Web at
904            <a href="" class="link" target="_new"></a>.</p>
905<p class="bibliomixed" id="Unicode">[Unicode] The Unicode Consortium. 2009. On the Web at
906            <a href="" class="link" target="_new"></a>.</p>
907<p class="bibliomixed" id="Pex06">[Hilewitz and Lee 2006]  Hilewitz, Y. and Ruby B. Lee.
908         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
909         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
910         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</p>
911<p class="bibliomixed" id="InfoSet">[XML Infoset] XML Information Set (Second Edition) W3C
912         Recommendation 4 February 2004. On the Web at
913         <a href="" class="link" target="_new"></a>.</p>
914<p class="bibliomixed" id="Saxon">[Saxon] SAXON The XSLT and XQuery Processor. On the Web
915         at <a href="" class="link" target="_new"></a>.</p>
916<p class="bibliomixed" id="Kay08">[Kay 2008]  Kay, Michael Y. 2008. Ten Reasons Why Saxon
917         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</p>
918<p class="bibliomixed" id="AElfred">[Ælfred]  The Ælfred XML Parser. On the Web at
919            <a href="" class="link" target="_new"></a>.</p>
920<p class="bibliomixed" id="JNI">[Hitchens 2002] Hitchens, Ron. Java NIO. O'Reilly, 2002.</p>
921<p class="bibliomixed" id="Expat">[Expat] The Expat XML Parser.
922            <a href="" class="link" target="_new"></a>.</p>
925<div id="balisage-footer"><h3 style="font-family: serif; margin:0.25em">
926<i>Balisage:</i> <small>The Markup Conference</small>
Note: See TracBrowser for help on using the repository browser.