source: docs/Balisage13/Bal2013came0601/Bal2013came0601.html @ 3039

Last change on this file since 3039 was 3039, checked in by ksherdy, 6 years ago

Initial translation. Special characters, figures, tables, bib, to go.

File size: 50.9 KB
1<html lang="en">
3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
5<link rel="stylesheet" href="balisage-plain.css" type="text/css">
6<meta name="keywords" content="">
9<div id="balisage-header"><h1 style="text-align: right; font-family: serif; margin:0.25em">
10<i>Balisage:</i> <small>The Markup Conference</small>
12<div lang="en" class="article">
13<div class="titlepage">
14<h2 class="article-title" id="idp197760"></h2>
15<div class="author">
16<h3 class="author">Nigel Medforth</h3>
17<div class="affiliation">
18<p class="jobtitle"></p>
19<p class="orgname"></p>
21<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
23<div class="author">
24<h3 class="author">Dan Lin</h3>
25<div class="affiliation">
26<p class="jobtitle"></p>
27<p class="orgname"></p>
29<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
31<div class="author">
32<h3 class="author">Kenneth Herdy</h3>
33<div class="affiliation">
34<p class="jobtitle">Graduate Student, School of Computing Science</p>
35<p class="orgname">Simon Fraser University </p>
37<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
39<div class="author">
40<h3 class="author">Rob Cameron</h3>
41<div class="affiliation">
42<p class="jobtitle">Professor of Computing Science</p>
43<p class="orgname">Simon Fraser University</p>
45<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
47<div class="author">
48<h3 class="author">Arrvindh Shriraman</h3>
49<div class="affiliation">
50<p class="jobtitle"></p>
51<p class="orgname"></p>
53<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
55<div class="abstract">
56<p class="title"><b>Abstract</b></p>
57<p id="idp286432">Prior research on the acceleration of XML processing
58using SIMD and multi-core parallelism has lead to
59a number of interesting research prototypes.  This work
60investigates the extent to which the techniques underlying
61these prototypes could result in systematic performance
62benefits when fully integrated into a commercial XML parser.
63The widely used Xerces-C++ parser of the Apache Software
64Foundation was chosen as the foundation for the study.
65A systematic restructuring of the parser was undertaken,
66while maintaining the existing API for application programmers.
67Using SIMD techniques alone, an increase in parsing speed
68of at least 50% was observed in a range of applications.
69When coupled with pipeline parallelism on dual core processors,
70improvements of 2x and beyond were realized.
75<div class="toc">
76<p><b>Table of Contents</b></p>
78<dt><span class="section"><a href="#idp273984" class="toc">Introduction</a></span></dt>
79<dt><span class="section"><a href="#idp274904" class="toc">Background</a></span></dt>
81<dt><span class="section"><a href="#idp275240" class="toc">Xerces C++ Structure</a></span></dt>
82<dt><span class="section"><a href="#idp280344" class="toc">The Parabix Framework</a></span></dt>
83<dt><span class="section"><a href="#idp416432" class="toc">Sequential vs. Parallel Paradigm</a></span></dt>
85<dt><span class="section"><a href="#idp419384" class="toc">Architecture</a></span></dt>
87<dt><span class="section"><a href="#idp419704" class="toc">Overview</a></span></dt>
88<dt><span class="section"><a href="#idp429312" class="toc">Character Set Adapters</a></span></dt>
89<dt><span class="section"><a href="#idp435504" class="toc">Combined Parallel Filtering</a></span></dt>
90<dt><span class="section"><a href="#idp249920" class="toc">Content Stream</a></span></dt>
91<dt><span class="section"><a href="#idp254672" class="toc">Namespace Handling</a></span></dt>
92<dt><span class="section"><a href="#idp469568" class="toc">Error Handling</a></span></dt>
94<dt><span class="section"><a href="#idp476680" class="toc">Multithreading with Pipeline Parallelism</a></span></dt>
95<dt><span class="section"><a href="#idp483472" class="toc">Performance</a></span></dt>
97<dt><span class="section"><a href="#idp485352" class="toc">Xerces C++ SAXCount</a></span></dt>
98<dt><span class="section"><a href="#idp490336" class="toc">GML2SVG</a></span></dt>
100<dt><span class="section"><a href="#idp490912" class="toc">Conclusion and Future Work</a></span></dt>
103<div class="section" id="idp273984">
104<h2 class="title" style="clear: both">Introduction</h2>
105<p id="idp274328"></p>
106<p id="idp274456"></p>
107<p id="idp274584"></p>
108<p id="idp274712"></p>
110<div class="section" id="idp274904">
111<h2 class="title" style="clear: both">Background</h2>
112<div class="section" id="idp275240">
113<h3 class="title" style="clear: both">Xerces C++ Structure</h3>
114<p id="idp275584">
115The Xerces C++ parser
120features comprehensive support for a variety of character encodings
121both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
122multiple XML vocabularies through the XML namespace
123mechanism, as well as complete implementations
124of structure and data validation through multiple grammars
125declared using either legacy DTDs (document type
126definitions) or modern XML Schema facilities.
127Xerces also supports several APIs for accessing
128parser services, including event-based parsing
129using either pull parsing or SAX/SAX2 push-style
130parsing as well as a DOM tree-based parsing interface.
132<p id="idp277232">
136Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
137first to the last byte of input data. Each byte passes through several processing layers and is
138classified and eventually validated within the context of the document state.
139This introduces implicit dependencies between the various tasks within the application that make it
140difficult to optimize for performance.
141As a complex software system, no one feature dominates the overall parsing performance.
142Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
143Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
144parallelization in isolation would only produce a minute improvement in performance.
145Unfortunately, early investigation into these functions found
146that incorporating speculation-free thread-level parallelization was impossible
147and they were already performing well in their given tasks;
148thus only trivial enhancements were attainable.
149In order to obtain a systematic acceleration of Xerces,
150it should be expected that a comprehensive restructuring
151is required, involving all aspects of the parser.
153<p id="idp279408">
160<div class="section" id="idp280344">
161<h3 class="title" style="clear: both">The Parabix Framework</h3>
162<p id="idp280664">
163The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
164(and other forms of text processing.) The key idea is to exploit the availability of wide
165SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
166of input data by using one register bit per single input byte.
167To facilitate this, the input data is first transposed into a set of basis bit streams.
170Boolean-logic operations\footnote{$∧$, $\√$ and $¬$ denote the boolean AND, OR and NOT operators.}
171are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
172characters (or groups of characters) with a $1$.
173For example, one of the fundamental characters in XML is a left-angle bracket.
174A character is a<code class="code">&lt; if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
175b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1</code>.
176Similarly, a character is numeric
177<code class="code">[0-9] if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))</code>.
178An important observation here is that ranges of characters may
179require fewer operations than individual characters and
181multiple classes can share the classification cost.
183<p id="idp409632">
186<p id="idp411072">
187Consider, for example, the XML source data stream shown in the first line of .
188The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
189parsing, with each bit of each stream in one-to-one correspondence to the source character code units
190of the input stream.
191For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
192The first bit stream shown is that for the opening
193angle brackets that represent tag openers in XML.
194The second and third streams show a partition of the
195tag openers into start tag marks and end tag marks
196depending on the character immediately following the
197opener (i.e., ``\verb:/:'') or not.  The remaining three
198lines show streams that can be computed in subsequent
199parsing (using the technique
200of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
201attribute names and attribute values of tags. 
203<p id="idp413944">
204Two intuitions may help explain how the Parabix approach can lead
205to improved XML parsing performance. The first is that
206the use of the full register width offers a considerable
207information advantage over sequential byte-at-a-time
208parsing.  That is, sequential processing of bytes
209uses just 8 bits of each register, greatly limiting the
210processor resources that are effectively being used at any one time.
211The second is that byte-at-a-time loop scanning loops are actually
212often just computing a single bit of information per iteration:
213is the scan complete yet?
214Rather than computing these individual decision-bits, an approach that computes
215many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
216should provide substantial benefit.
218<p id="idp414904">
219Previous studies have shown that the Parabix approach improves many aspects of XML processing,
220including transcoding \cite{Cameron2008}, character classification and validation,
221tag parsing and well-formedness checking. 
222The first Parabix parser used processor bit scan instructions to considerably accelerate
223sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
224Recent work has incorporated a method of parallel
225scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as
226well as combining SIMD methods with 4-stage pipeline parallelism to further improve
227throughput \cite{HPCA2012}.
228Although these research prototypes handled the full syntax of schema-less XML documents,
229they lacked the functionality required by full XML parsers.
231<p id="idp415880">
232Commercial XML processors support transcoding of multiple character sets and can parse and
233validate against multiple document vocabularies.
234Additionally, they provide API facilities beyond those found in research prototypes,
235including the widely used SAX, SAX2 and DOM interfaces.
238<div class="section" id="idp416432">
239<h3 class="title" style="clear: both">Sequential vs. Parallel Paradigm</h3>
240<p id="idp416752">
241Xerces—like all traditional XML parsers—processes XML documents sequentially.
242Each character is examined to distinguish between the
243XML-specific markup, such as a left angle bracket <code class="code">&lt;</code>, and the
244content held within the document. 
245As the parser progresses through the document, it alternates between markup scanning,
246validation and content processing modes.
248<p id="idp417824">
249In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
250  Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
251  current state and the events associated with transitions of states.}.
252Each state transition indicates the processing context of subsequent characters.
253Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
255<p id="idp418488">
256Parabix-style XML parsers utilize a concept of layered processing.
257A block of source text is transformed into a set of lexical \bitstream{}s,
258which undergo a series of operations that can be grouped into logical layers,
259e.g., transposition, character classification, and lexical analysis.
260Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
261To meet the API requirements of the document-ordered Xerces output,
262the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
266<div class="section" id="idp419384">
267<h2 class="title" style="clear: both">Architecture</h2>
268<div class="section" id="idp419704">
269<h3 class="title" style="clear: both">Overview</h3>
270<p id="idp420152">
271\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
272rearchitected with pipeline parallelism in mind.
273In this section, we highlight the core differences between the two systems.
274As shown in Figure \ref{fig:xerces-arch}, Xerces
275is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
276The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
277the majority of the character set encoding validation is performed as a byproduct of this process.
278The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
279It tracks the current line/column position,
281performs line-break normalization and validates context-specific character set issues,
282such as tokenization of qualified-names.
283The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
284of the document; it deals with all issues related to entity expansion, validates
285the XML well-formedness constraints and any character set encoding issues that cannot
286be completely handled by the reader or transcoder (e.g., surrogate characters, validation
287and normalization of character references, etc.)
288The {\it Namespace Binder} is a core piece of the element stack.
289It handles namespace scoping issues between different XML vocabularies.
290This allows the scanner to properly select the correct schema grammar structures.
291The {\it Validator} takes the IR produced by the Scanner (and
292potentially annotated by the Namespace Binder) and assesses whether the final output matches
293the user-defined DTD and schema grammar(s) before passing it to the end-user.
295<p id="idp422304">
298<p id="idp422560">
299In \icXML{} functions are grouped into logical components.
300As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
301All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
302The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
303mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
304set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
305These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
306after additional processing is performed.
307The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
308It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
309significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
310the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
311Intra-element well-formedness validation is performed as an artifact of this process.
312Like Xerces, \icXML{} must provide the Line and Column position of each error.
313The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
314optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
315From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
317<p id="idp426184">
318A typical XML file contains few unique element and attribute names—but each of them will occur frequently.
319\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
320Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
321the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
323<p id="idp427352">
324The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
325The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
326information provided by the Parallel Markup Parser, and the latter transforms the
327filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
329<p id="idp427944">
330Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
331The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
332The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
333to perform in bit space, such as ensuring every start tag has a matching end tag.
334Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
335it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
336associated with each symbol occurrence.
337This is discussed in Section \ref{section:arch:namespacehandling}.
338Finally, the {\it Validation} layer implements the Xerces's validator.
339However, preprocessing associated with each symbol greatly reduces the work of this stage.
341<p id="idp428992">
345<div class="section" id="idp429312">
346<h3 class="title" style="clear: both">Character Set Adapters</h3>
347<p id="idp429936">
348In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
349provide the end-consumer with a single encoding format.
350In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
351because of the need to decode and classify each byte of input, mapping variable-length UTF-8
352byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
353In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
354transcoding imposes at least a cost of buffer copying.
356<p id="idp430720">
357In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
358Given a specified input encoding, a CSA is responsible for checking that
359input code units represent valid characters, mapping the characters of the encoding into
360the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item
361streams), as well as supporting ultimate transcoding requirements.   All of this work
362is performed using the parallel \bitstream{} representation of the source input.
364<p id="idp431448">
365An important observation is that many character sets are an
366extension to the legacy 7-bit ASCII character set.  This includes the
367various ISO Latin character sets, UTF-8, UTF-16 and many others.
368Furthermore, all significant characters for parsing XML are confined to the
369ASCII repertoire.   Thus, a single common set of lexical item calculations
370serves to compute lexical item streams for all such ASCII-based character sets.
372<p id="idp432080">
373A second observation is that—regardless of which character set is used—quite
374often all of the characters in a particular block of input will be within the ASCII range.
375This is a very simple test to perform using the \bitstream{} representation, simply confirming that the
376bit 0 stream is zero for the entire block.   For blocks satisfying this test,
377all logic dealing with non-ASCII characters can simply be skipped.
378Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the
379UTF-16 form are each set to zero in this case.
381<p id="idp433648">
382A third observation is that repeated transcoding of the names of XML
383elements, attributes and so on can be avoided by using a look-up mechanism.
384That is, the first occurrence of each symbol is stored in a look-up
385table mapping the input encoding to a numeric symbol ID.   Transcoding
386of the symbol is applied at this time.  Subsequent look-up operations
387can avoid transcoding by simply retrieving the stored representation.
388As symbol look up is required to apply various XML validation rules,
389there is achieves the effect of transcoding each occurrence without
390additional cost.
392<p id="idp434432">
393The cost of individual character transcoding is avoided whenever a block of input is
394confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
395Furthermore, when transcoding is required, the parallel \bitstream{} representation
396supports efficient transcoding operations.   
397In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s
398can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
399and all but the final bytes of multi-byte sequences can be marked for deletion as
400discussed in the following subsection.
401In other cases, transcoding within a block only need be applied for non-ASCII
402bytes, which are conveniently identified by iterating through the bit 0 stream
403using bit scan operations.
406<div class="section" id="idp435504">
407<h3 class="title" style="clear: both">Combined Parallel Filtering</h3>
408<p id="idp435824">
409As just mentioned, UTF-8 to UTF-16 transcoding involves marking
410all but the last bytes of multi-byte UTF-8 sequences as
411positions for deletion.   For example, the two
412Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
413are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
414and \verb'E5 A5 BD' while the UTF-16 representation must be
415compressed down to the two code units \verb'4F60' and \verb'597D'.
416In the bit parallel representation, this corresponds to a reduction
417from six bit positions representing UTF-8 code units (bytes)
418down to just two bit positions representing UTF-16 code units
419(double bytes).   This compression may be achieved by
420arranging to calculate the correct UTF-16 bits at the
421final position of each sequence and creating a deletion
422mask to mark the first two bytes of each 3-byte sequence
423for deletion.   In this case, the portion of the mask
424corresponding to these input bytes is the bit sequence
425\verb'110110'.  Using this approach, transcoding may then be
426completed by applying parallel deletion and inverse transposition of the
427UTF-16 \bitstream{}s\cite{Cameron2008}.
429<p id="idp438672">
440<p id="idp441152">
441Rather than immediately paying the
442costs of deletion and transposition just for transcoding,
443however, \icXML{} defers these steps so that the deletion
444masks for several stages of processing may be combined.
445In particular, this includes core XML requirements
446to normalize line breaks and to replace character
447reference and entity references by their corresponding
448text.   In the case of line break normalization,
449all forms of line breaks, including bare carriage
450returns (CR), line feeds (LF) and CR-LF combinations
451must be normalized to a single LF character in
452each case.   In \icXML{}, this is achieved by
453first marking CR positions, performing two
454bit parallel operations to transform the marked
455CRs into LFs, and then marking for deletion any
456LF that is found immediately after the marked CR
457as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
460<p id="idp441472">
461In essence, the deletion masks for transcoding and
462for line break normalization each represent a bitwise
463filter; these filters can be combined using bitwise-or
464so that the parallel deletion algorithm need only be
465applied once.
467<p id="idp443472">
468A further application of combined filtering
469is the processing of XML character and entity
470references.   Consider, for example, the references <code class="code">&amp;</code> or <code class="code">&lt;</code>.
471which must be replaced in XML processing with 
472the single <code class="code">&amp;</code> and <code class="code">&lt;</code> characters, respectively.
473The approach in \icXML{} is to mark all but the first character
474positions of each reference for deletion, leaving a
475single character position unmodified.  Thus, for the
476references <code class="code">&amp;</code> or <code class="code">&lt;</code> the
477masks <code class="code">01111</code> and <code class="code">011111</code> are formed and
478combined into the overall deletion mask.   After the
479deletion and inverse transposition operations are finally
480applied, a post-processing step inserts the proper character
481at these positions.   One note about this process is
482that it is speculative; references are assumed to generally
483be replaced by a single UTF-16 code unit.   In the case,
484that this is not true, it is addressed in post-processing.
486<p id="idp446576">
487The final step of combined filtering occurs during
488the process of reducing markup data to tag bytes
489preceding each significant XML transition as described
490in section~\ref{section:arch:contentstream}.  Overall, \icXML{}
491avoids separate buffer copying operations for each of the
492these filtering steps, paying the cost of parallel
493deletion and inverse transposition only once. 
494Currently, \icXML{} employs the parallel-prefix compress algorithm
495of Steele~\cite{HackersDelight}  Performance
496is independent of the number of positions deleted.
497Future versions of \icXML{} are expected to
498take advantage of the parallel extract operation~\cite{HilewitzLee2006}
499that Intel is now providing in its Haswell architecture.
502<div class="section" id="idp249920">
503<h3 class="title" style="clear: both">Content Stream</h3>
504<p id="idp250240">
505A relatively-unique concept for \icXML{} is the use of a filtered content stream.
506Rather that parsing an XML document in its original format, the input is transformed
507into one that is easier for the parser to iterate through and produce the sequential
509In , the source data
511is transformed into
514through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
516<p id="idp251840">
517Combined with the symbol stream, the parser traverses the content stream to effectively
518reconstructs the input document in its output form.
519The initial {\tt\it 0} indicates an empty content string. The following \verb|&gt;|
520indicates that a start tag without any attributes is the first element in this text and
521the first unused symbol, <code class="code">document</code>, is the element name.
522Succeeding that is the content string <code class="code">fee</code>, which is null-terminated in accordance
523with the Xerces API specification. Unlike Xerces, no memory-copy operations
524are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
525accounts for 6.83% of Xerces's execution time.
526Additionally, it is cheap to locate the terminal character of each string:
527using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each
528null character in the content stream in parallel, which in turn means the parser can
529directly jump to the end of every string without scanning for it.
531<p id="idp253376">
532Following ``\verb`fee`'' is a \verb`=`, which marks the existence of an attribute.
533Because all of the intra-element was performed in the \PS{}, this must be a legal attribute.
534Since attributes can only occur within start tags and must be accompanied by a textual value,
535the next symbol in the symbol stream must be the element name of a start tag,
536and the following one must be the name of the attribute and the string that follows the \verb`=` must be its value.
537However, the subsequent \verb`=` is not treated as an independent attribute because the parser has yet to
538read a \verb`&gt;`, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
539it (along with the string value) is added to the element.
540Eventually the parser reaches a \verb`/`, which marks the existence of an end tag. Every end tag requires an
541element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
542ensure that the appropriate scope-nesting rules have been applied.
545<div class="section" id="idp254672">
546<h3 class="title" style="clear: both">Namespace Handling</h3>
547<p id="idp255320">
548In XML, namespaces prevents naming conflicts when multiple vocabularies are used together.
549It is especially important when a vocabulary application-dependant meaning, such as when
550XML or SVG documents are embedded within XHTML files.
551Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify
552specific names or resources.
553On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML
554processor to bind the prefix <code class="code">p</code> to the URI '<code class="code"></code>' and the default (empty)
555prefix to <code class="code"></code>. Thus to the XML processor, the \verb|title| on line 2 and
556\verb|price| on line 4 both read as \verb|"":title| and \verb|"":price|
557respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as
558\verb|"":name| and \verb|"":price|. Even though the actual element name
559\verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items
560because the current vocabulary is determined by the namespace(s) that are in-scope.
562<p id="idp257760">
565<p id="idp463944">
566In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID.
567These persist for the lifetime of the application through the use of a global URI pool.
568Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
569in the document. Because a namespace declaration affects the entire element, it must be processed prior to
570grammar validation. This is a costly process considering that a typical namespaced XML document only comes
571in one of two forms:
572(1) those that declare a set of namespaces upfront and never change them, and
573(2) those that repeatedly modify the namespaces in predictable patterns.
575<p id="idp464136">
576For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform
579When a prefix is declared (e.g., \verb|xmlns:p=""|), a namespace binding is created that maps
580the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
581Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
582NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
583prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively.
584To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
585maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
586NSID can be found using a bit-scan intrinsic.
587A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
589<p id="idp466872">
592<p id="idp467128">
598<p id="idp468584">
599To ensure that scoping rules are adhered to,
600whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored
601within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
602currently visible namespaces is XORed with the vector at the top of the stack.
603This allows any number of changes to be performed at each scope-level with a constant time.
607<div class="section" id="idp469568">
608<h3 class="title" style="clear: both">Error Handling</h3>
609<p id="idp469912">
611Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
612As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
613if not, the current state determines the type and severity of the error.
614\icXML{} emits errors in the similar manner—but how it discovers them is substantially different.
615Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{},
616each with its own system for detecting and producing error messages.
618<p id="idp471056">
619Within the \PS{}, all computations are performed in parallel, a block at a time.
620Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block,
621and the type of error is determined by the equation that discovered it.
622The difficulty of error processing in this section is that in Xerces the line and column number must be given
623with every error production. Two major issues exist because of this:
624(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
625followed by a line feed, are counted as a single new line character.
626(2) column position is counted in characters, not bytes or code units;
627thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
628Note that typical XML documents are error-free but the calculation of the
629line/column position is a constant overhead in Xerces.
630To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and
631performs the minimal amount of book-keeping necessary to facilitate it.
632\icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
633within the Line Column Tracker (LCT).
634One of the CSA's major responsibilities is transcoding an input text.
635During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
636and validated.
637A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of
638this process.
639Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
641Column position is more difficult to calculate.
642It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units)
643between the position between which an error was detected and the last line feed. However, this distance may exceed
644than the actual character position for the reasons discussed in (2).
645To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s,
646such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
647normalization process.
648When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
649column number.
651<p id="idp474368">
652The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
653However, reporting the correct line/column is a much more difficult problem.
654The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
655Each string is normalized in accordance with the XML specification.
656All symbol data and unnecessary whitespace is eliminated from the stream;
657thus its impossible to derive the current location using only the content stream.
658To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
659the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every
660datum that was suppressed from the source during the production of the content stream.
661Armed with these, it is possible to calculate the actual line/column using
662the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
666<div class="section" id="idp476680">
667<h2 class="title" style="clear: both">Multithreading with Pipeline Parallelism</h2>
668<p id="idp477000">
669As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
670These are ``embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
671However, \icXML{} is designed to organize processing into logical layers.   
672In particular, layers within the \PS{} are designed to operate
673over significant segments of input data before passing their outputs on for
674subsequent processing.  This fits well into the general model of pipeline
675parallelism, in which each thread is in charge of a single module or group
676of modules.
678<p id="idp477784">
679The most straightforward division of work in \icXML{} is to separate
680the \PS{} and the \MP{} into distinct logical layers into two separate stages.
681The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application.
682In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
683content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
684The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation,
685and the provides parsed XML data to the application through the Xerces API. 
686The shared data structure is implemented using a ring buffer,
687where every entry contains an independent set of data streams.
688In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
689A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
690In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
691thus $T_2$ always waits for $T_1$ to write to the shared memory.
692Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
693and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
695<p id="idp479272">
698<p id="idp479528">
699Overall, our design is intended to benefit a range of applications.
700Conceptually, we consider two design points.
701The first, the parsing performed by the \PS{} dominates at 67% of the overall cost,
702with the cost of application processing (including the driver logic within the \MP{}) at 33%.   
703The second is almost the opposite scenario, the cost of application processing dominates at 60%,
704while the cost of XML parsing represents an overhead of 40%.
706<p id="idp479720">
707Our design is predicated on a goal of using the Parabix
708framework to achieve a 50% to 100% improvement in the parsing engine itself.   
709In a best case scenario,
710a 100% improvement of the \PS{} for the design point in which
711XML parsing dominates at 67% of the total application cost.
712In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces
713so that the total application cost reduces to 67% of the original. 
714However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads
715each performing about 33% of the original work.   
716In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.
718<p id="idp481424">
719At the other extreme of our design range, we consider an application
720in which core parsing cost is 40%.   Assuming the 2x speedup of
721the \PS{} over the corresponding Xerces core, single-threaded
722\icXML{} delivers a 25% speedup.   However, the most significant
723aspect of our two-stage multi-threaded design then becomes the
724ability to hide the entire latency of parsing within the serial time
725required by the application.   In this case, we achieve
726an overall speedup in processing time by 1.67x.
728<p id="idp482624">
729Although the structure of the \PS{} allows division of the work into
730several pipeline stages and has been demonstrated to be effective
731for four pipeline stages in a research prototype \cite{HPCA2012},
732our analysis here suggests that the further pipelining of work within
733the \PS{} is not worthwhile if the cost of application logic is little as
73433% of the end-to-end cost using Xerces.  To achieve benefits of
735further parallelization with multi-core technology, there would
736need to be reductions in the cost of application logic that
737could match reductions in core parsing cost.
740<div class="section" id="idp483472">
741<h2 class="title" style="clear: both">Performance</h2>
742<p id="idp483808">
743We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications:
744the Xerces C++ SAXCount sample application,
745and a real world GML to SVG transformation application.
746We investigated XML parser performance using an Intel Core i7 quad-core
747(Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core),
74832+32 kB (per core) L1 cache,
749256 kB (per core) L2 cache,
7508 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
752<p id="idp484472">
753We analyzed the execution profiles of each XML parser
754using the performance counters found in the processor.
755We chose several key hardware events that provide insight into the profile of each
756application and indicate if the processor is doing useful work. 
757The set of events included in our study are:
758processor cycles, branch instructions, branch mispredictions,
759and cache misses. The Performance Application Programming Interface
760(PAPI) Version 5.5.0 \cite{papi} toolkit
761was installed on the test system to facilitate the
762collection of hardware performance monitoring
763statistics. In addition, we used the Linux perf \cite{perf} utility
764to collect per core hardware events.
766<div class="section" id="idp485352">
767<h3 class="title" style="clear: both">Xerces C++ SAXCount</h3>
768<p id="idp485696">
769Xerces comes with sample applications that demonstrate salient features of the parser.
770SAXCount is the simplest such application:
771it counts the elements, attributes and characters of a given XML file using the (event based) SAX API
772and prints out the totals.
774<p id="idp486160">
777<p id="idp486432">
778Table \ref{XMLDocChars} shows the document characteristics of the XML input
779files selected for the Xerces C++ SAXCount benchmark. The jaw.xml
780represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence
781required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented
782XML documents and consist entirely of single byte encoded ASCII characters.
784<p id="idp486624">
785A key predictor of the overall parsing performance of an XML file is markup density\footnote{
786  Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}.
787This metric has substantial influence on the performance of traditional recursive descent XML parsers
788because it directly corresponds to the number of state transitions that occur when parsing a document.
789We use a mixture of document-oriented and
790data-oriented XML files to analyze performance over a spectrum
791of markup densities.
793<p id="idp488376">
794Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of
795CPU cycles per byte for the SAXCount application.
796The speedup for \icXML{} over Xerces is 1.3x to 1.8x.
797With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x.
798Xerces is substantially slowed by dense markup
799but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques.
800\icXMLp{} performs better as markup-density increases because the work performed by each stage is
801well balanced in this application.
803<p id="idp489744">
807<div class="section" id="idp490336">
808<h3 class="title" style="clear: both">GML2SVG</h3>
809<p id="idp490656"></p>
812<div class="section" id="idp490912">
813<h2 class="title" style="clear: both">Conclusion and Future Work</h2>
814<p id="idp491264">
815This paper is the first case study documenting the significant
816performance benefits that may be realized through the integration
817of parallel \bitstream{} technology into existing widely-used software libraries.
818In the case of the Xerces-C++ XML parser, the
819combined integration of SIMD and multicore parallelism was
820shown capable of dramatic producing dramatic increases in
821throughput and reductions in branch mispredictions and cache misses.
822The modified parser, going under the name \icXML{} is designed
823to provide the full functionality of the original Xerces library
824with complete compatibility of APIs.  Although substantial
825re-engineering was required to realize the
826performance potential of parallel technologies, this
827is an important case study demonstrating the general
828feasibility of these techniques.
830<p id="idp492280">
831The further development of \icXML{} to move beyond 2-stage
832pipeline parallelism is ongoing, with realistic prospects for
833four reasonably balanced stages within the library.  For
834applications such as GML2SVG which are dominated by time
835spent on XML parsing, such a multistage pipelined parsing
836library should offer substantial benefits. 
838<p id="idp492824">
839The example of XML parsing may be considered prototypical
840of finite-state machines applications which have sometimes
841been considered ``embarassingly sequential'' and so
842difficult to parallelize that ``nothing works.''  So the
843case study presented here should be considered an important
844data point in making the case that parallelization can
845indeed be helpful across a broad array of application types.
847<p id="idp493424">
848To overcome the software engineering challenges in applying
849parallel \bitstream{} technology to existing software systems,
850it is clear that better library and tool support is needed.
851The techniques used in the implementation of \icXML{} and
852documented in this paper could well be generalized for
853applications in other contexts and automated through
854the creation of compiler technology specifically supporting
855parallel \bitstream{} programming.
858<div class="bibliography" id="idp494392">
859<h2 class="title" style="clear:both">Bibliography</h2>
860<p class="bibliomixed" id="XMLChip09">[Leventhal and Lemoine 2009] Leventhal, Michael and
861         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
862         Processing XML Efficiently 2009, Montréal.</p>
863<p class="bibliomixed" id="Datapower09">[Salz, Achilles and Maze 2009] Salz, Richard,
864         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
865         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
866         Efficiently 2009, Montréal.</p>
867<p class="bibliomixed" id="PPoPP08">[Cameron 2007] Cameron, Robert D. 2007. A Case Study
868         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
869         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
870         Lake City, Utah. On the Web at <a href="" class="link" target="_new"></a>.</p>
871<p class="bibliomixed" id="CASCON08">[Cameron, Herdy and Lin 2008] Cameron, Robert D.,
872         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
873         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
874         Practice of Parallel Programming 2008, Toronto.</p>
875<p class="bibliomixed" id="SVGOpen08">[Herdy, Burggraf and Cameron 2008] Herdy, Kenneth
876         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
877         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
878         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
879         Nuremburg. On the Web at
880            <a href="" class="link" target="_new"></a>.</p>
881<p class="bibliomixed" id="Ross06">[Ross 2006] Ross, Kenneth A. 2006. Efficient hash
882         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
883            <a href="" class="link" target="_new"></a>.</p>
884<p class="bibliomixed" id="ASPLOS09">[Cameron and Lin 2009] Cameron, Robert D. and Dan
885         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
886         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</p>
887<p class="bibliomixed" id="Wu08">[Wu et al. 2008] Wu, Yu, Qi Zhang, Zhiqiang Yu and
888         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
889         Proceedings of Balisage 2008, Montréal. On the Web at
890            <a href="" class="link" target="_new"></a>.</p>
891<p class="bibliomixed" id="u8u16">[Cameron 2008] u8u16 - A High-Speed UTF-8 to UTF-16
892         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
893         Science Simon Fraser University, June 21 2007.</p>
894<p class="bibliomixed" id="XML10">[XML 1.0] Extensible Markup Language (XML) 1.0 (Fifth
895         Edition) W3C Recommendation 26 November 2008. On the Web at
896            <a href="" class="link" target="_new"></a>.</p>
897<p class="bibliomixed" id="Unicode">[Unicode] The Unicode Consortium. 2009. On the Web at
898            <a href="" class="link" target="_new"></a>.</p>
899<p class="bibliomixed" id="Pex06">[Hilewitz and Lee 2006]  Hilewitz, Y. and Ruby B. Lee.
900         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
901         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
902         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</p>
903<p class="bibliomixed" id="InfoSet">[XML Infoset] XML Information Set (Second Edition) W3C
904         Recommendation 4 February 2004. On the Web at
905         <a href="" class="link" target="_new"></a>.</p>
906<p class="bibliomixed" id="Saxon">[Saxon] SAXON The XSLT and XQuery Processor. On the Web
907         at <a href="" class="link" target="_new"></a>.</p>
908<p class="bibliomixed" id="Kay08">[Kay 2008]  Kay, Michael Y. 2008. Ten Reasons Why Saxon
909         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</p>
910<p class="bibliomixed" id="AElfred">[Ælfred]  The Ælfred XML Parser. On the Web at
911            <a href="" class="link" target="_new"></a>.</p>
912<p class="bibliomixed" id="JNI">[Hitchens 2002] Hitchens, Ron. Java NIO. O'Reilly, 2002.</p>
913<p class="bibliomixed" id="Expat">[Expat] The Expat XML Parser.
914            <a href="" class="link" target="_new"></a>.</p>
917<div id="balisage-footer"><h3 style="font-family: serif; margin:0.25em">
918<i>Balisage:</i> <small>The Markup Conference</small>
Note: See TracBrowser for help on using the repository browser.