source: docs/Balisage13/Bal2013came0601/Bal2013came0601.xml @ 3039

Last change on this file since 3039 was 3039, checked in by ksherdy, 6 years ago

Initial translation. Special characters, figures, tables, bib, to go.

File size: 58.5 KB
Line 
1<?xml version="1.0" encoding="UTF-8"?>
2<!-- MODIFIED DTD LOCATION -->
3<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
4<article xmlns="http://docbook.org/ns/docbook" version="5.0-subset Balisage-1.3"
5  xml:id="HR-23632987-8973">
6   <title></title>
7   <info>
8<!--
9      <confgroup>
10         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
11            Space, Time, or Bandwidth</conftitle>
12         <confdates>August 10 2009</confdates>
13      </confgroup>
14-->
15      <abstract>
16         <para>Prior research on the acceleration of XML processing
17using SIMD and multi-core parallelism has lead to
18a number of interesting research prototypes.  This work
19investigates the extent to which the techniques underlying
20these prototypes could result in systematic performance
21benefits when fully integrated into a commercial XML parser.
22The widely used Xerces-C++ parser of the Apache Software
23Foundation was chosen as the foundation for the study.
24A systematic restructuring of the parser was undertaken,
25while maintaining the existing API for application programmers.
26Using SIMD techniques alone, an increase in parsing speed
27of at least 50% was observed in a range of applications.
28When coupled with pipeline parallelism on dual core processors,
29improvements of 2x and beyond were realized.
30</para>
31      </abstract>
32      <author>
33         <personname>
34            <firstname>Nigel</firstname>
35            <surname>Medforth</surname>
36         </personname>
37         <personblurb>
38            <para></para>
39         </personblurb>
40         <affiliation>
41            <jobtitle></jobtitle>
42            <orgname></orgname>
43         </affiliation>
44         <email></email>
45      </author>
46      <author>
47         <personname>
48            <firstname>Dan</firstname>
49            <surname>Lin</surname>
50         </personname>
51         <personblurb>
52            <para></para>
53         </personblurb>
54         <affiliation>
55            <jobtitle></jobtitle>
56            <orgname></orgname>
57         </affiliation>
58         <email></email>
59      </author>
60      <author>
61         <personname>
62            <firstname>Kenneth</firstname>
63            <surname>Herdy</surname>
64         </personname>
65         <personblurb>
66            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
67               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
68               of Science in Computing Science with a Certificate in Spatial Information Systems at
69               Simon Fraser University in 2005.
70                                                </para>
71            <para> Ken is currently pursuing graduate studies in Computing Science at Simon Fraser
72               University with industrial scholarship support from the Natural Sciences and
73               Engineering Research Council of Canada, the Mathematics of Information Technology and
74               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
75               of the principal techniques that may be used to improve XML processing performance in
76               the context of the Geography Markup Language (GML).
77                                                </para>
78         </personblurb>
79         <affiliation>
80            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
81            <orgname>Simon Fraser University </orgname>
82         </affiliation>
83         <email>ksherdy@sfu.ca</email>
84      </author>
85      <author>
86         <personname>
87            <firstname>Rob</firstname>
88            <surname>Cameron</surname>
89         </personname>
90         <personblurb>
91            <para>Dr. Rob Cameron is Professor and Director of Computing Science at Simon Fraser
92               University. With a broad spectrum of research interests related to programming
93               languages, software engineering and sociotechnical design of public computing
94               infrastructure, he has recently been focusing on high performance text processing
95               using parallel bit stream technology and its applications to XML. He is also a
96               patentleft evangelist, advocating university-based technology transfer models
97               dedicated to free use in open source. </para>
98         </personblurb>
99         <affiliation>
100            <jobtitle>Professor of Computing Science</jobtitle>
101            <orgname>Simon Fraser University</orgname>
102         </affiliation>
103         <email>cameron@cs.sfu.ca</email>
104      </author>
105      <author>
106         <personname>
107            <firstname>Arrvindh</firstname>
108            <surname>Shriraman</surname>
109         </personname>
110         <personblurb>
111            <para></para>
112         </personblurb>
113         <affiliation>
114            <jobtitle></jobtitle>
115            <orgname></orgname>
116         </affiliation>
117         <email></email>
118      </author>
119<!--
120      <legalnotice>
121         <para>Copyright &#x000A9; 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
122            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
123            Works 2.5 Canada License.</para>
124      </legalnotice>
125-->
126      <keywordset role="author">
127         <keyword/>
128      </keywordset>
129
130   </info>
131   <section>
132      <title>Introduction</title>
133      <para></para>
134      <para></para>
135      <para></para>
136      <para></para>
137   </section>
138
139   <section>
140      <title>Background</title>
141      <section>
142         <title>Xerces C++ Structure</title>
143<para>
144The Xerces C++ parser
145<!-- is a widely-used standards-conformant -->
146<!-- XML parser produced as open-source software -->
147<!-- by the Apache Software Foundation. -->
148<!-- It -->
149features comprehensive support for a variety of character encodings
150both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
151multiple XML vocabularies through the XML namespace
152mechanism, as well as complete implementations
153of structure and data validation through multiple grammars
154declared using either legacy DTDs (document type
155definitions) or modern XML Schema facilities.
156Xerces also supports several APIs for accessing
157parser services, including event-based parsing
158using either pull parsing or SAX/SAX2 push-style
159parsing as well as a DOM tree-based parsing interface.
160</para>
161<para>
162<!--What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?-->
163<!--Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars-->
164<!--Should we show a val-grind summary of a few files in a linechart form?-->
165Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
166first to the last byte of input data. Each byte passes through several processing layers and is
167classified and eventually validated within the context of the document state.
168This introduces implicit dependencies between the various tasks within the application that make it
169difficult to optimize for performance.
170As a complex software system, no one feature dominates the overall parsing performance.
171Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
172Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
173parallelization in isolation would only produce a minute improvement in performance.
174Unfortunately, early investigation into these functions found
175that incorporating speculation-free thread-level parallelization was impossible
176and they were already performing well in their given tasks;
177thus only trivial enhancements were attainable.
178In order to obtain a systematic acceleration of Xerces,
179it should be expected that a comprehensive restructuring
180is required, involving all aspects of the parser.
181</para>
182<para>
183<!-- In order to obtain systematic acceleration of the Xerces parser,-->
184<!-- it should be expected that a comprehensive restructuring-->
185<!-- is required, involving all aspects of the parser.-->
186<!-- FIGURE
187\begin{figure}[h]
188\begin{tabular}{r|l}
189Time (\%) & Function Name \\
190\hline
19113.29   &       XMLUTF8Transcoder::transcodeFrom \\
1927.45    &       IGXMLScanner::scanCharData \\
1936.83    &       memcpy \\
1945.83    &       XMLReader::getNCName \\
1954.67    &       IGXMLScanner::buildAttList \\
1964.54    &       RefHashTableOf\verb|<>|::findBucketElem \\
1974.20    &       IGXMLScanner::scanStartTagNS \\
1983.75    &       ElemStack::mapPrefixToURI \\
1993.58    &       ReaderMgr::getNextChar \\
2003.20    &       IGXMLScanner::basicAttrValueScan \\
201\end{tabular}
202\caption{Execution Time of Top 10 Xerces Functions}
203\label {fig:xerces-profile}
204\end{figure}
205-->
206</para>
207      </section>
208      <section>
209         <title>The Parabix Framework</title>
210<para>
211The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
212(and other forms of text processing.) The key idea is to exploit the availability of wide
213SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
214of input data by using one register bit per single input byte.
215To facilitate this, the input data is first transposed into a set of basis bit streams.
216In <!--FIGURE REF Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
217is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
218-->
219<!-- The bits used to construct $\tt b_7$ have been highlighted in this example. -->
220Boolean-logic operations\footnote{&#8743;, \&#8744; and &#172; denote the boolean AND, OR and NOT operators.}
221are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
222characters (or groups of characters) with a $1$.
223For example, one of the fundamental characters in XML is a left-angle bracket.
224A character is a<code>&lt; if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
225b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1</code>.
226Similarly, a character is numeric
227<code>[0-9] if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))</code>.
228An important observation here is that ranges of characters may
229require fewer operations than individual characters and
230<!-- the classification cost could be amortized over many character classes.-->
231multiple classes can share the classification cost.
232</para>
233<para>
234<!-- FIGURE
235\begin{figure}[h]
236\begin{center}
237\begin{tabular}{r c c c c }
238String & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
239ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
240\hline
241\end{tabular}
242\end{center}
243\begin{center}
244\begin{tabular}{r |c |c |c |c |c |c |c |c |}
245 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
246 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
247 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
248 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
249 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
250\end{tabular}
251\end{center}
252\caption{8-bit ASCII Basis Bit Streams}
253\label{fig:BitStreamsExample}
254\end{figure}
255-->
256</para>
257<!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
258<!-- bit streams can be transformed into lexical bit streams, where the presense of -->
259<!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
260<!-- process, intra-element well-formedness validation is performed on each block -->
261<!-- of text. -->
262<para>
263Consider, for example, the XML source data stream shown in the first line of <!-- FIGURE REF Figure \ref{fig:parabix1} -->.
264The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
265parsing, with each bit of each stream in one-to-one correspondence to the source character code units
266of the input stream.
267For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
268The first bit stream shown is that for the opening
269angle brackets that represent tag openers in XML.
270The second and third streams show a partition of the
271tag openers into start tag marks and end tag marks
272depending on the character immediately following the
273opener (i.e., ``\verb:/:'') or not.  The remaining three
274lines show streams that can be computed in subsequent
275parsing (using the technique
276of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
277attribute names and attribute values of tags. 
278</para>
279<para>
280Two intuitions may help explain how the Parabix approach can lead
281to improved XML parsing performance. The first is that
282the use of the full register width offers a considerable
283information advantage over sequential byte-at-a-time
284parsing.  That is, sequential processing of bytes
285uses just 8 bits of each register, greatly limiting the
286processor resources that are effectively being used at any one time.
287The second is that byte-at-a-time loop scanning loops are actually
288often just computing a single bit of information per iteration:
289is the scan complete yet?
290Rather than computing these individual decision-bits, an approach that computes
291many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
292should provide substantial benefit.
293</para>
294<para>
295Previous studies have shown that the Parabix approach improves many aspects of XML processing,
296including transcoding \cite{Cameron2008}, character classification and validation,
297tag parsing and well-formedness checking. 
298The first Parabix parser used processor bit scan instructions to considerably accelerate
299sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
300Recent work has incorporated a method of parallel
301scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as
302well as combining SIMD methods with 4-stage pipeline parallelism to further improve
303throughput \cite{HPCA2012}.
304Although these research prototypes handled the full syntax of schema-less XML documents,
305they lacked the functionality required by full XML parsers.
306</para>
307<para>
308Commercial XML processors support transcoding of multiple character sets and can parse and
309validate against multiple document vocabularies.
310Additionally, they provide API facilities beyond those found in research prototypes,
311including the widely used SAX, SAX2 and DOM interfaces.
312</para>
313      </section>
314      <section>
315         <title>Sequential vs. Parallel Paradigm</title>
316<para>
317Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents sequentially.
318Each character is examined to distinguish between the
319XML-specific markup, such as a left angle bracket <code>&lt;</code>, and the
320content held within the document. 
321As the parser progresses through the document, it alternates between markup scanning,
322validation and content processing modes.
323</para>
324<para>
325In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
326  Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
327  current state and the events associated with transitions of states.}.
328Each state transition indicates the processing context of subsequent characters.
329Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
330</para>
331<para>
332Parabix-style XML parsers utilize a concept of layered processing.
333A block of source text is transformed into a set of lexical \bitstream{}s,
334which undergo a series of operations that can be grouped into logical layers,
335e.g., transposition, character classification, and lexical analysis.
336Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
337To meet the API requirements of the document-ordered Xerces output,
338the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
339</para>
340      </section>                       
341     </section>                 
342                <section>
343                        <title>Architecture</title>             
344                        <section>
345                       <title>Overview</title>
346<!--\def \CSG{Content Stream Generator}-->
347<para>
348\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
349rearchitected with pipeline parallelism in mind.
350In this section, we highlight the core differences between the two systems.
351As shown in Figure \ref{fig:xerces-arch}, Xerces
352is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
353The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
354the majority of the character set encoding validation is performed as a byproduct of this process.
355The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
356It tracks the current line/column position,
357<!--(which is reported in the unlikely event that the input contains an error), -->
358performs line-break normalization and validates context-specific character set issues,
359such as tokenization of qualified-names.
360The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
361of the document; it deals with all issues related to entity expansion, validates
362the XML well-formedness constraints and any character set encoding issues that cannot
363be completely handled by the reader or transcoder (e.g., surrogate characters, validation
364and normalization of character references, etc.)
365The {\it Namespace Binder} is a core piece of the element stack.
366It handles namespace scoping issues between different XML vocabularies.
367This allows the scanner to properly select the correct schema grammar structures.
368The {\it Validator} takes the IR produced by the Scanner (and
369potentially annotated by the Namespace Binder) and assesses whether the final output matches
370the user-defined DTD and schema grammar(s) before passing it to the end-user.
371</para>
372<para>
373<!-- FIGURE
374\begin{figure}[h]
375\begin{center}
376\includegraphics[height=0.45\textheight,keepaspectratio]{plots/xerces.pdf}
377\caption{Xerces Architecture}
378\label{fig:xerces-arch}
379\end{center}
380\end{figure}
381-->
382</para>
383<para>
384In \icXML{} functions are grouped into logical components.
385As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
386All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
387The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
388mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
389set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
390These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
391after additional processing is performed.
392The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
393It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
394significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
395the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
396Intra-element well-formedness validation is performed as an artifact of this process.
397Like Xerces, \icXML{} must provide the Line and Column position of each error.
398The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
399optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
400From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
401</para>
402<para>
403A typical XML file contains few unique element and attribute names&#8212;but each of them will occur frequently.
404\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
405Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
406the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
407</para>
408<para>
409The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
410The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
411information provided by the Parallel Markup Parser, and the latter transforms the
412filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
413</para>
414<para>
415Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
416The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
417The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
418to perform in bit space, such as ensuring every start tag has a matching end tag.
419Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
420it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
421associated with each symbol occurrence.
422This is discussed in Section \ref{section:arch:namespacehandling}.
423Finally, the {\it Validation} layer implements the Xerces's validator.
424However, preprocessing associated with each symbol greatly reduces the work of this stage.
425</para>
426<para>
427<!-- FIGURE
428\begin{figure}[h]
429\begin{center}
430\includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf}
431\end{center}
432\caption{\icXML{} Architecture}
433\label{fig:icxml-arch}
434\end{figure}
435-->
436</para>
437            </section>
438                        <section>
439                       <title>Character Set Adapters</title>
440<para>
441In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
442provide the end-consumer with a single encoding format.
443In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
444because of the need to decode and classify each byte of input, mapping variable-length UTF-8
445byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
446In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
447transcoding imposes at least a cost of buffer copying.
448</para>
449<para>
450In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
451Given a specified input encoding, a CSA is responsible for checking that
452input code units represent valid characters, mapping the characters of the encoding into
453the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item
454streams), as well as supporting ultimate transcoding requirements.   All of this work
455is performed using the parallel \bitstream{} representation of the source input.
456</para>
457<para>
458An important observation is that many character sets are an
459extension to the legacy 7-bit ASCII character set.  This includes the
460various ISO Latin character sets, UTF-8, UTF-16 and many others.
461Furthermore, all significant characters for parsing XML are confined to the
462ASCII repertoire.   Thus, a single common set of lexical item calculations
463serves to compute lexical item streams for all such ASCII-based character sets.
464</para>
465<para>
466A second observation is that&#8212;regardless of which character set is used&#8212;quite
467often all of the characters in a particular block of input will be within the ASCII range.
468This is a very simple test to perform using the \bitstream{} representation, simply confirming that the
469bit 0 stream is zero for the entire block.   For blocks satisfying this test,
470all logic dealing with non-ASCII characters can simply be skipped.
471Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the
472UTF-16 form are each set to zero in this case.
473</para>
474<para>
475A third observation is that repeated transcoding of the names of XML
476elements, attributes and so on can be avoided by using a look-up mechanism.
477That is, the first occurrence of each symbol is stored in a look-up
478table mapping the input encoding to a numeric symbol ID.   Transcoding
479of the symbol is applied at this time.  Subsequent look-up operations
480can avoid transcoding by simply retrieving the stored representation.
481As symbol look up is required to apply various XML validation rules,
482there is achieves the effect of transcoding each occurrence without
483additional cost.
484</para>
485<para>
486The cost of individual character transcoding is avoided whenever a block of input is
487confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
488Furthermore, when transcoding is required, the parallel \bitstream{} representation
489supports efficient transcoding operations.   
490In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s
491can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
492and all but the final bytes of multi-byte sequences can be marked for deletion as
493discussed in the following subsection.
494In other cases, transcoding within a block only need be applied for non-ASCII
495bytes, which are conveniently identified by iterating through the bit 0 stream
496using bit scan operations.
497</para>
498            </section>
499                        <section>
500                       <title>Combined Parallel Filtering</title>
501<para>
502As just mentioned, UTF-8 to UTF-16 transcoding involves marking
503all but the last bytes of multi-byte UTF-8 sequences as
504positions for deletion.   For example, the two
505Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
506are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
507and \verb'E5 A5 BD' while the UTF-16 representation must be
508compressed down to the two code units \verb'4F60' and \verb'597D'.
509In the bit parallel representation, this corresponds to a reduction
510from six bit positions representing UTF-8 code units (bytes)
511down to just two bit positions representing UTF-16 code units
512(double bytes).   This compression may be achieved by
513arranging to calculate the correct UTF-16 bits at the
514final position of each sequence and creating a deletion
515mask to mark the first two bytes of each 3-byte sequence
516for deletion.   In this case, the portion of the mask
517corresponding to these input bytes is the bit sequence
518\verb'110110'.  Using this approach, transcoding may then be
519completed by applying parallel deletion and inverse transposition of the
520UTF-16 \bitstream{}s\cite{Cameron2008}.
521</para>
522<para>
523<!-- FIGURE
524\begin{figure*}[tbh]
525\begin{center}
526\begin{tabular}{rr}\\
527Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
528-->
529<!-- Tag Openers & \verb`1____________1____________________________1____________1__________`\\-->
530<!-- Start Tag Marks & \verb`_1____________1___________________________________________________`\\-->
531<!-- End Tag Marks & \verb`___________________________________________1____________1_________`\\-->
532<!-- Empty Tag Marks & \verb`__________________________________________________________________`\\-->
533<!-- Element Names & \verb`_11111111_____1111111_____________________________________________`\\-->
534<!-- Attribute Names & \verb`______________________11_______11_________________________________`\\-->
535<!-- Attribute Values & \verb`__________________________111________111__________________________`\\-->
536<!-- FIGURE
537String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
538Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
539Deletion Mask & \verb`_11111111_____1111111111_1____1111_11_______11111111_____111111111`\\
540Undeleted Data & \verb``{\tt\it 0}\verb`________>fee`{\tt\it 0}\verb`__________=_fie`{\tt\it 0}\verb`____=__foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/________fum`{\tt\it 0}\verb`/_________`
541\end{tabular}
542\end{center}
543\caption{XML Source Data and Derived Parallel Bit Streams}
544\label{fig:parabix2}
545\end{figure*}
546-->
547</para>
548<para>
549Rather than immediately paying the
550costs of deletion and transposition just for transcoding,
551however, \icXML{} defers these steps so that the deletion
552masks for several stages of processing may be combined.
553In particular, this includes core XML requirements
554to normalize line breaks and to replace character
555reference and entity references by their corresponding
556text.   In the case of line break normalization,
557all forms of line breaks, including bare carriage
558returns (CR), line feeds (LF) and CR-LF combinations
559must be normalized to a single LF character in
560each case.   In \icXML{}, this is achieved by
561first marking CR positions, performing two
562bit parallel operations to transform the marked
563CRs into LFs, and then marking for deletion any
564LF that is found immediately after the marked CR
565as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
566<!-- FIGURE
567\begin{figure}
568\begin{verbatim}
569# XML 1.0 line-break normalization rules.
570if lex.CR:
571# Modify CR (#x0D) to LF (#x0A)
572  u16lo.bit_5 ^= lex.CR
573  u16lo.bit_6 ^= lex.CR
574  u16lo.bit_7 ^= lex.CR
575  CRLF = pablo.Advance(lex.CR) & lex.LF
576  callouts.delmask |= CRLF
577# Adjust LF streams for line/column tracker
578  lex.LF |= lex.CR
579  lex.LF ^= CRLF
580\end{verbatim}
581\caption{Line Break Normalization Logic}\label{fig:LBnormalization}
582\end{figure}
583-->
584</para>
585<para>
586In essence, the deletion masks for transcoding and
587for line break normalization each represent a bitwise
588filter; these filters can be combined using bitwise-or
589so that the parallel deletion algorithm need only be
590applied once.
591</para>
592<para>
593A further application of combined filtering
594is the processing of XML character and entity
595references.   Consider, for example, the references <code>&amp;</code> or <code>&#x3C;</code>.
596which must be replaced in XML processing with 
597the single <code>&amp;</code> and <code>&lt;</code> characters, respectively.
598The approach in \icXML{} is to mark all but the first character
599positions of each reference for deletion, leaving a
600single character position unmodified.  Thus, for the
601references <code>&amp;</code> or <code>&#x3C;</code> the
602masks <code>01111</code> and <code>011111</code> are formed and
603combined into the overall deletion mask.   After the
604deletion and inverse transposition operations are finally
605applied, a post-processing step inserts the proper character
606at these positions.   One note about this process is
607that it is speculative; references are assumed to generally
608be replaced by a single UTF-16 code unit.   In the case,
609that this is not true, it is addressed in post-processing.
610</para>
611<para>
612The final step of combined filtering occurs during
613the process of reducing markup data to tag bytes
614preceding each significant XML transition as described
615in section~\ref{section:arch:contentstream}.  Overall, \icXML{}
616avoids separate buffer copying operations for each of the
617these filtering steps, paying the cost of parallel
618deletion and inverse transposition only once. 
619Currently, \icXML{} employs the parallel-prefix compress algorithm
620of Steele~\cite{HackersDelight}  Performance
621is independent of the number of positions deleted.
622Future versions of \icXML{} are expected to
623take advantage of the parallel extract operation~\cite{HilewitzLee2006}
624that Intel is now providing in its Haswell architecture.
625</para>
626            </section>
627                        <section>
628                       <title>Content Stream</title>
629<para>
630A relatively-unique concept for \icXML{} is the use of a filtered content stream.
631Rather that parsing an XML document in its original format, the input is transformed
632into one that is easier for the parser to iterate through and produce the sequential
633output.
634In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
635<!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
636is transformed into
637<!-- CODE -->
638<!--``{\tt\it 0}\verb`>fee`{\tt\it 0}\verb`=fie`{\tt\it 0}\verb`=foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/fum`{\tt\it 0}\verb`/`''-->
639through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
640</para>
641<para>
642Combined with the symbol stream, the parser traverses the content stream to effectively
643reconstructs the input document in its output form.
644The initial {\tt\it 0} indicates an empty content string. The following \verb|>|
645indicates that a start tag without any attributes is the first element in this text and
646the first unused symbol, <code>document</code>, is the element name.
647Succeeding that is the content string <code>fee</code>, which is null-terminated in accordance
648with the Xerces API specification. Unlike Xerces, no memory-copy operations
649are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
650accounts for 6.83% of Xerces's execution time.
651Additionally, it is cheap to locate the terminal character of each string:
652using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each
653null character in the content stream in parallel, which in turn means the parser can
654directly jump to the end of every string without scanning for it.
655</para>
656<para>
657Following ``\verb`fee`'' is a \verb`=`, which marks the existence of an attribute.
658Because all of the intra-element was performed in the \PS{}, this must be a legal attribute.
659Since attributes can only occur within start tags and must be accompanied by a textual value,
660the next symbol in the symbol stream must be the element name of a start tag,
661and the following one must be the name of the attribute and the string that follows the \verb`=` must be its value.
662However, the subsequent \verb`=` is not treated as an independent attribute because the parser has yet to
663read a \verb`>`, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
664it (along with the string value) is added to the element.
665Eventually the parser reaches a \verb`/`, which marks the existence of an end tag. Every end tag requires an
666element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
667ensure that the appropriate scope-nesting rules have been applied.
668</para>
669            </section>
670                        <section>
671                       <title>Namespace Handling</title>
672<!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
673<para>
674In XML, namespaces prevents naming conflicts when multiple vocabularies are used together.
675It is especially important when a vocabulary application-dependant meaning, such as when
676XML or SVG documents are embedded within XHTML files.
677Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify
678specific names or resources.
679On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML
680processor to bind the prefix <code>p</code> to the URI &apos;<code>pub.net</code>&apos; and the default (empty)
681prefix to <code>book.org</code>. Thus to the XML processor, the \verb|title| on line 2 and
682\verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price|
683respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as
684\verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name
685\verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items
686because the current vocabulary is determined by the namespace(s) that are in-scope.
687</para>
688<para>
689<!-- FIGURE
690\begin{figure}[h]
691\begin{tabular}{l|l}
6921. & \verb|<book xmlns:p="pub.net" xmlns="book.org">| \\
6932. & \verb|  <title>BOOK NAME</title>| \\
6943. & \verb|  <p:name>PUBLISHER NAME</p:name>| \\
6954. & \verb|  <price>X</price>| \\
6965. & \verb|  <price xmlns="publisher.net">Y</price>| \\
6976. & \verb|</book>| \\
698\end{tabular}
699\caption{XML Namespace Example}
700\label {fig:namespace1}
701\end{figure}
702-->
703</para>
704<para>
705In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID.
706These persist for the lifetime of the application through the use of a global URI pool.
707Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
708in the document. Because a namespace declaration affects the entire element, it must be processed prior to
709grammar validation. This is a costly process considering that a typical namespaced XML document only comes
710in one of two forms:
711(1) those that declare a set of namespaces upfront and never change them, and
712(2) those that repeatedly modify the namespaces in predictable patterns.
713</para>
714<para>
715For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform
716<!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
717<!-- performance advantage figure?? average cycles/byte cost? -->
718When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps
719the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
720Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
721NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
722prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively.
723To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
724maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
725NSID can be found using a bit-scan intrinsic.
726A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
727</para>
728<para>
729<!-- FIGURE
730\begin{table}[h]
731\begin{center}
732\begin{tabular}{|c||c|c|c|c|}\hline
733NSID & Prefix & URI & Prefix ID & URI ID \\ \hline\hline
7340 & {\tt p} & {\tt pub.net} & 0 & 0 \\ \hline
7351 & {\tt xmlns} & {\tt books.org} & 1 & 1 \\ \hline
7362 & {\tt xmlns} & {\tt pub.net} & 1 & 0 \\ \hline
737\end{tabular}
738\caption{Namespace Binding Table Example}
739\end{center}
740\label{tbl:namespace1}
741\end{table}
742-->
743</para>
744<para>
745<!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
746<!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
747<!-- NSid = bitscan(VisiblePrefixBinding); -->
748<!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
749</para>
750<para>
751To ensure that scoping rules are adhered to,
752whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored
753within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
754currently visible namespaces is XORed with the vector at the top of the stack.
755This allows any number of changes to be performed at each scope-level with a constant time.
756<!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
757</para>
758            </section>
759                        <section>
760                       <title>Error Handling</title>
761<para>
762<!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
763Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
764As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
765if not, the current state determines the type and severity of the error.
766\icXML{} emits errors in the similar manner&#8212;but how it discovers them is substantially different.
767Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{},
768each with its own system for detecting and producing error messages.
769</para>
770<para>
771Within the \PS{}, all computations are performed in parallel, a block at a time.
772Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block,
773and the type of error is determined by the equation that discovered it.
774The difficulty of error processing in this section is that in Xerces the line and column number must be given
775with every error production. Two major issues exist because of this:
776(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
777followed by a line feed, are counted as a single new line character.
778(2) column position is counted in characters, not bytes or code units;
779thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
780Note that typical XML documents are error-free but the calculation of the
781line/column position is a constant overhead in Xerces. <!-- that must be maintained in the case that one occurs. -->
782To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and
783performs the minimal amount of book-keeping necessary to facilitate it.
784\icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
785within the Line Column Tracker (LCT).
786One of the CSA's major responsibilities is transcoding an input text. <!-- from some encoding format to near-output-ready UTF-16. -->
787During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
788and validated.
789A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of
790this process.
791Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
792<!-- The optimization delays the counting process .... -->
793Column position is more difficult to calculate.
794It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units)
795between the position between which an error was detected and the last line feed. However, this distance may exceed
796than the actual character position for the reasons discussed in (2).
797To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s,
798such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
799normalization process.
800When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
801column number.
802</para>
803<para>
804The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
805However, reporting the correct line/column is a much more difficult problem.
806The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
807Each string is normalized in accordance with the XML specification.
808All symbol data and unnecessary whitespace is eliminated from the stream;
809thus its impossible to derive the current location using only the content stream.
810To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
811the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every
812datum that was suppressed from the source during the production of the content stream.
813Armed with these, it is possible to calculate the actual line/column using
814the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
815</para>
816            </section>
817                </section>
818
819                <section>
820                        <title>Multithreading with Pipeline Parallelism</title>         
821<para>
822As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
823These are ``embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
824However, \icXML{} is designed to organize processing into logical layers.   
825In particular, layers within the \PS{} are designed to operate
826over significant segments of input data before passing their outputs on for
827subsequent processing.  This fits well into the general model of pipeline
828parallelism, in which each thread is in charge of a single module or group
829of modules.
830</para>
831<para>
832The most straightforward division of work in \icXML{} is to separate
833the \PS{} and the \MP{} into distinct logical layers into two separate stages.
834The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application.
835In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
836content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
837The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation,
838and the provides parsed XML data to the application through the Xerces API. 
839The shared data structure is implemented using a ring buffer,
840where every entry contains an independent set of data streams.
841In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
842A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
843In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
844thus $T_2$ always waits for $T_1$ to write to the shared memory.
845Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
846and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
847</para>
848<para>
849<!-- FIGURE
850\begin{figure}
851\subfigure[]{
852\includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf}
853\label{threads_timeline1}
854}
855\hfill
856\subfigure[]{
857\includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf}
858\label{threads_timeline2}
859}
860\caption{Thread Balance in Two-Stage Pipelines}
861
862\end{figure}
863-->
864</para>
865<para>
866Overall, our design is intended to benefit a range of applications.
867Conceptually, we consider two design points.
868The first, the parsing performed by the \PS{} dominates at 67% of the overall cost,
869with the cost of application processing (including the driver logic within the \MP{}) at 33%.   
870The second is almost the opposite scenario, the cost of application processing dominates at 60%,
871while the cost of XML parsing represents an overhead of 40%.
872</para>
873<para>
874Our design is predicated on a goal of using the Parabix
875framework to achieve a 50% to 100% improvement in the parsing engine itself.   
876In a best case scenario,
877a 100% improvement of the \PS{} for the design point in which
878XML parsing dominates at 67% of the total application cost.
879In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces
880so that the total application cost reduces to 67% of the original. 
881However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads
882each performing about 33% of the original work.   
883In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.
884</para>
885<para>
886At the other extreme of our design range, we consider an application
887in which core parsing cost is 40%.   Assuming the 2x speedup of
888the \PS{} over the corresponding Xerces core, single-threaded
889\icXML{} delivers a 25% speedup.   However, the most significant
890aspect of our two-stage multi-threaded design then becomes the
891ability to hide the entire latency of parsing within the serial time
892required by the application.   In this case, we achieve
893an overall speedup in processing time by 1.67x.
894</para>
895<para>
896Although the structure of the \PS{} allows division of the work into
897several pipeline stages and has been demonstrated to be effective
898for four pipeline stages in a research prototype \cite{HPCA2012},
899our analysis here suggests that the further pipelining of work within
900the \PS{} is not worthwhile if the cost of application logic is little as
90133% of the end-to-end cost using Xerces.  To achieve benefits of
902further parallelization with multi-core technology, there would
903need to be reductions in the cost of application logic that
904could match reductions in core parsing cost.
905</para>
906                </section>
907
908                <section>
909                        <title>Performance</title>             
910<para>
911We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications:
912the Xerces C++ SAXCount sample application,
913and a real world GML to SVG transformation application.
914We investigated XML parser performance using an Intel Core i7 quad-core
915(Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core),
91632+32 kB (per core) L1 cache,
917256 kB (per core) L2 cache,
9188 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
919</para>
920<para>
921We analyzed the execution profiles of each XML parser
922using the performance counters found in the processor.
923We chose several key hardware events that provide insight into the profile of each
924application and indicate if the processor is doing useful work. 
925The set of events included in our study are:
926processor cycles, branch instructions, branch mispredictions,
927and cache misses. The Performance Application Programming Interface
928(PAPI) Version 5.5.0 \cite{papi} toolkit
929was installed on the test system to facilitate the
930collection of hardware performance monitoring
931statistics. In addition, we used the Linux perf \cite{perf} utility
932to collect per core hardware events.
933</para>
934                        <section>
935                       <title>Xerces C++ SAXCount</title>
936<para>
937Xerces comes with sample applications that demonstrate salient features of the parser.
938SAXCount is the simplest such application:
939it counts the elements, attributes and characters of a given XML file using the (event based) SAX API
940and prints out the totals.
941</para>
942<para>
943<!-- TABLE
944\begin{table}
945\begin{center}
946{
947\footnotesize
948\begin{tabular}{|l||l|l|l|l|l|}
949\hline
950File Name               & jaw.xml               & road.gml      & po.xml        & soap.xml \\ \hline   
951File Type               & document              & data          & data          & data   \\ \hline     
952File Size (kB)          & 7343                  & 11584         & 76450         & 2717 \\ \hline
953Markup Item Count       & 74882                 & 280724        & 4634110       & 18004 \\ \hline
954Markup Density          & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
955\end{tabular}
956}
957\end{center}
958\caption{XML Document Characteristics}
959\label{XMLDocChars}
960\end{table}
961-->
962</para>
963<para>
964Table \ref{XMLDocChars} shows the document characteristics of the XML input
965files selected for the Xerces C++ SAXCount benchmark. The jaw.xml
966represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence
967required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented
968XML documents and consist entirely of single byte encoded ASCII characters.
969</para>
970<para>
971A key predictor of the overall parsing performance of an XML file is markup density\footnote{
972  Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}.
973This metric has substantial influence on the performance of traditional recursive descent XML parsers
974because it directly corresponds to the number of state transitions that occur when parsing a document.
975We use a mixture of document-oriented and
976data-oriented XML files to analyze performance over a spectrum
977of markup densities.
978</para>
979<para>
980Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of
981CPU cycles per byte for the SAXCount application.
982The speedup for \icXML{} over Xerces is 1.3x to 1.8x.
983With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x.
984Xerces is substantially slowed by dense markup
985but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques.
986\icXMLp{} performs better as markup-density increases because the work performed by each stage is
987well balanced in this application.
988</para>
989<para>
990<!-- FIGURE
991\begin{figure}
992\includegraphics[width=0.5\textwidth]{plots/perf_SAX.pdf}
993\caption{SAXCount Performance Comparison}
994\label{perf_SAX}
995\end{figure}
996-->
997</para>
998            </section>
999                        <section>
1000                       <title>GML2SVG</title>
1001                       <para></para>
1002            </section>
1003                </section>
1004
1005                <section>
1006                        <title>Conclusion and Future Work</title>               
1007<para>
1008This paper is the first case study documenting the significant
1009performance benefits that may be realized through the integration
1010of parallel \bitstream{} technology into existing widely-used software libraries.
1011In the case of the Xerces-C++ XML parser, the
1012combined integration of SIMD and multicore parallelism was
1013shown capable of dramatic producing dramatic increases in
1014throughput and reductions in branch mispredictions and cache misses.
1015The modified parser, going under the name \icXML{} is designed
1016to provide the full functionality of the original Xerces library
1017with complete compatibility of APIs.  Although substantial
1018re-engineering was required to realize the
1019performance potential of parallel technologies, this
1020is an important case study demonstrating the general
1021feasibility of these techniques.
1022</para>
1023<para>
1024The further development of \icXML{} to move beyond 2-stage
1025pipeline parallelism is ongoing, with realistic prospects for
1026four reasonably balanced stages within the library.  For
1027applications such as GML2SVG which are dominated by time
1028spent on XML parsing, such a multistage pipelined parsing
1029library should offer substantial benefits. 
1030</para>
1031<para>
1032The example of XML parsing may be considered prototypical
1033of finite-state machines applications which have sometimes
1034been considered ``embarassingly sequential'' and so
1035difficult to parallelize that ``nothing works.''  So the
1036case study presented here should be considered an important
1037data point in making the case that parallelization can
1038indeed be helpful across a broad array of application types.
1039</para>
1040<para>
1041To overcome the software engineering challenges in applying
1042parallel \bitstream{} technology to existing software systems,
1043it is clear that better library and tool support is needed.
1044The techniques used in the implementation of \icXML{} and
1045documented in this paper could well be generalized for
1046applications in other contexts and automated through
1047the creation of compiler technology specifically supporting
1048parallel \bitstream{} programming.
1049</para>
1050                </section>
1051
1052<!--     
1053   <section>
1054      <title>Acknowledgments</title>
1055      <para></para>
1056   </section>
1057-->
1058   <bibliography>
1059      <title>Bibliography</title>
1060      <bibliomixed xml:id="XMLChip09" xreflabel="Leventhal and Lemoine 2009">Leventhal, Michael and
1061         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
1062         Processing XML Efficiently 2009, Montréal.</bibliomixed>
1063      <bibliomixed xml:id="Datapower09" xreflabel="Salz, Achilles and Maze 2009">Salz, Richard,
1064         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
1065         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
1066         Efficiently 2009, Montréal.</bibliomixed>
1067      <bibliomixed xml:id="PPoPP08" xreflabel="Cameron 2007">Cameron, Robert D. 2007. A Case Study
1068         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
1069         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
1070         Lake City, Utah. On the Web at <link>http://research.ihost.com/ppopp08/</link>.</bibliomixed>
1071      <bibliomixed xml:id="CASCON08" xreflabel="Cameron, Herdy and Lin 2008">Cameron, Robert D.,
1072         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
1073         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
1074         Practice of Parallel Programming 2008, Toronto.</bibliomixed>
1075      <bibliomixed xml:id="SVGOpen08" xreflabel="Herdy, Burggraf and Cameron 2008">Herdy, Kenneth
1076         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
1077         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
1078         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
1079         Nuremburg. On the Web at
1080            <link>http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/</link>.</bibliomixed>
1081      <bibliomixed xml:id="Ross06" xreflabel="Ross 2006">Ross, Kenneth A. 2006. Efficient hash
1082         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
1083            <link>www.cs.columbia.edu/~kar/pubsk/icde2007.pdf</link>.</bibliomixed>
1084      <bibliomixed xml:id="ASPLOS09" xreflabel="Cameron and Lin 2009">Cameron, Robert D. and Dan
1085         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
1086         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</bibliomixed>
1087      <bibliomixed xml:id="Wu08" xreflabel="Wu et al. 2008">Wu, Yu, Qi Zhang, Zhiqiang Yu and
1088         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
1089         Proceedings of Balisage 2008, Montréal. On the Web at
1090            <link>http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html</link>.</bibliomixed>
1091      <bibliomixed xml:id="u8u16" xreflabel="Cameron 2008">u8u16 - A High-Speed UTF-8 to UTF-16
1092         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
1093         Science Simon Fraser University, June 21 2007.</bibliomixed>
1094      <bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth
1095         Edition) W3C Recommendation 26 November 2008. On the Web at
1096            <link>http://www.w3.org/TR/REC-xml/</link>.</bibliomixed>
1097      <bibliomixed xml:id="Unicode" xreflabel="Unicode">The Unicode Consortium. 2009. On the Web at
1098            <link>http://unicode.org/</link>.</bibliomixed>
1099      <bibliomixed xml:id="Pex06" xreflabel="Hilewitz and Lee 2006"> Hilewitz, Y. and Ruby B. Lee.
1100         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
1101         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
1102         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</bibliomixed>
1103      <bibliomixed xml:id="InfoSet" xreflabel="XML Infoset">XML Information Set (Second Edition) W3C
1104         Recommendation 4 February 2004. On the Web at
1105         <link>http://www.w3.org/TR/xml-infoset/</link>.</bibliomixed>
1106      <bibliomixed xml:id="Saxon" xreflabel="Saxon">SAXON The XSLT and XQuery Processor. On the Web
1107         at <link>http://saxon.sourceforge.net/</link>.</bibliomixed>
1108      <bibliomixed xml:id="Kay08" xreflabel="Kay 2008"> Kay, Michael Y. 2008. Ten Reasons Why Saxon
1109         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</bibliomixed>
1110      <bibliomixed xml:id="AElfred" xreflabel="Ælfred"> The Ælfred XML Parser. On the Web at
1111            <link>http://saxon.sourceforge.net/aelfred.html</link>.</bibliomixed>
1112      <bibliomixed xml:id="JNI" xreflabel="Hitchens 2002">Hitchens, Ron. Java NIO. O'Reilly, 2002.</bibliomixed>
1113      <bibliomixed xml:id="Expat" xreflabel="Expat">The Expat XML Parser.
1114            <link>http://expat.sourceforge.net/</link>.</bibliomixed>
1115   </bibliography>
1116
1117</article>
Note: See TracBrowser for help on using the repository browser.