source: docs/Balisage13/Bal2013came0601/Bal2013came0601.xml @ 3049

Last change on this file since 3049 was 3049, checked in by lindanl, 6 years ago

Add figures

File size: 73.9 KB
1<?xml version="1.0" encoding="UTF-8"?>
3<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
4<article xmlns="" version="5.0-subset Balisage-1.3"
5   xml:id="HR-23632987-8973">
6   <title/>
7   <info>
8      <!--
9      <confgroup>
10         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
11            Space, Time, or Bandwidth</conftitle>
12         <confdates>August 10 2009</confdates>
13      </confgroup>
15      <abstract>
16         <para>Prior research on the acceleration of XML processing using SIMD and multi-core
17            parallelism has lead to a number of interesting research prototypes. This work
18            investigates the extent to which the techniques underlying these prototypes could result
19            in systematic performance benefits when fully integrated into a commercial XML parser.
20            The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the
21            foundation for the study. A systematic restructuring of the parser was undertaken, while
22            maintaining the existing API for application programmers. Using SIMD techniques alone,
23            an increase in parsing speed of at least 50% was observed in a range of applications.
24            When coupled with pipeline parallelism on dual core processors, improvements of 2x and
25            beyond were realized. </para>
26      </abstract>
27      <author>
28         <personname>
29            <firstname>Nigel</firstname>
30            <surname>Medforth</surname>
31         </personname>
32         <personblurb>
33            <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
34               developer of icXML. He earned a Bachelor of Technology in Information Technology at
35               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
36               Outstanding Achievement.</para>
37            <para>Nigel is currently researching ways to leverage both the Parabix framework and
38               stream-processing models to further accelerate XML parsing within icXML.</para>
39         </personblurb>
40         <affiliation>
41            <jobtitle>Developer</jobtitle>
42            <orgname>International Characters Inc.</orgname>
43         </affiliation>
44         <affiliation>
45            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
46            <orgname>Simon Fraser University </orgname>
47         </affiliation>
48         <email></email>
49      </author>
50      <author>
51         <personname>
52            <firstname>Dan</firstname>
53            <surname>Lin</surname>
54         </personname>
55         <personblurb>
56           <para>Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
57             in Computing Science at Simon Fraser University in 2010. Her research focus on on high
58             performance algorithms that exploit parallelization strategies on various multicore platforms.
59           </para>
60         </personblurb>
61         <affiliation>
62            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
63            <orgname>Simon Fraser University </orgname>
64         </affiliation>
65         <email></email>
66      </author>
67      <author>
68         <personname>
69            <firstname>Kenneth</firstname>
70            <surname>Herdy</surname>
71         </personname>
72         <personblurb>
73            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
74               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
75               of Science in Computing Science with a Certificate in Spatial Information Systems at
76               Simon Fraser University in 2005. </para>
77            <para> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
78               University with industrial scholarship support from the Natural Sciences and
79               Engineering Research Council of Canada, the Mathematics of Information Technology and
80               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
81               of the principal techniques that may be used to improve XML processing performance in
82               the context of the Geography Markup Language (GML). </para>
83         </personblurb>
84         <affiliation>
85            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
86            <orgname>Simon Fraser University </orgname>
87         </affiliation>
88         <email></email>
89      </author>
90      <author>
91         <personname>
92            <firstname>Rob</firstname>
93            <surname>Cameron</surname>
94         </personname>
95         <personblurb>
96            <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
97               Sciences at Simon Fraser University. His research interests include programming
98               language and software system technology, with a specific focus on high performance
99               text processing using SIMD and multicore parallelism. He is the developer of the REX
100               XML shallow parser as well as the parallel bit stream (Parabix) framework for SIMD
101               text processing. </para>
102         </personblurb>
103         <affiliation>
104            <jobtitle>Professor of Computing Science</jobtitle>
105            <orgname>Simon Fraser University</orgname>
106         </affiliation>
107         <affiliation>
108            <jobtitle>Chief Technology Officer</jobtitle>
109            <orgname>International Characters, Inc.</orgname>
110         </affiliation>
111         <email></email>
112      </author>
113      <author>
114         <personname>
115            <firstname>Arrvindh</firstname>
116            <surname>Shriraman</surname>
117         </personname>
118         <personblurb>
119            <para/>
120         </personblurb>
121         <affiliation>
122            <jobtitle/>
123            <orgname/>
124         </affiliation>
125         <email/>
126      </author>
127      <!--
128      <legalnotice>
129         <para>Copyright &#x000A9; 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
130            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
131            Works 2.5 Canada License.</para>
132      </legalnotice>
134      <keywordset role="author">
135         <keyword/>
136      </keywordset>
138   </info>
139   <section>
140      <title>Introduction</title>
141      <para/>
142      <para/>
143      <para/>
144      <para/>
145   </section>
147   <section>
148      <title>Background</title>
149      <section>
150         <title>Xerces C++ Structure</title>
151         <para> The Xerces C++ parser <!-- is a widely-used standards-conformant -->
152            <!-- XML parser produced as open-source software -->
153            <!-- by the Apache Software Foundation. -->
154            <!-- It --> features comprehensive support for a variety of character encodings both
155            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
156            XML vocabularies through the XML namespace mechanism, as well as complete
157            implementations of structure and data validation through multiple grammars declared
158            using either legacy DTDs (document type definitions) or modern XML Schema facilities.
159            Xerces also supports several APIs for accessing parser services, including event-based
160            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
161            tree-based parsing interface. </para>
162         <para>
163            <!--What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?-->
164            <!--Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars-->
165            <!--Should we show a val-grind summary of a few files in a linechart form?--> Xerces,
166            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
167            the first to the last byte of input data. Each byte passes through several processing
168            layers and is classified and eventually validated within the context of the document
169            state. This introduces implicit dependencies between the various tasks within the
170            application that make it difficult to optimize for performance. As a complex software
171            system, no one feature dominates the overall parsing performance. Figure
172            \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a
173            typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of
174            these functions for parallelization in isolation would only produce a minute improvement
175            in performance. Unfortunately, early investigation into these functions found that
176            incorporating speculation-free thread-level parallelization was impossible and they were
177            already performing well in their given tasks; thus only trivial enhancements were
178            attainable. In order to obtain a systematic acceleration of Xerces, it should be
179            expected that a comprehensive restructuring is required, involving all aspects of the
180            parser. </para>
181             <table>
182                  <caption>
183                     <para>Execution Time of Top 10 Xerces Functions</para>
184                  </caption>
185                  <colgroup>
186                     <col align="left" valign="top"/>
187                     <col align="left" valign="top"/>
188                  </colgroup>
189                  <thead><tr><th>Time (%) </th><th> Function Name </th></tr></thead>
190                  <tbody>
191<tr valign="top"><td>13.29      </td>   <td>XMLUTF8Transcoder::transcodeFrom </td></tr>
192<tr valign="top"><td>7.45       </td>   <td>IGXMLScanner::scanCharData </td></tr>
193<tr valign="top"><td>6.83       </td>   <td>memcpy </td></tr>
194<tr valign="top"><td>5.83       </td>   <td>XMLReader::getNCName </td></tr>
195<tr valign="top"><td>4.67       </td>   <td>IGXMLScanner::buildAttList </td></tr>
196<tr valign="top"><td>4.54       </td>   <td>RefHashTableO&lt;&gt;::findBucketElem </td></tr>
197<tr valign="top"><td>4.20       </td>   <td>IGXMLScanner::scanStartTagNS </td></tr>
198<tr valign="top"><td>3.75       </td>   <td>ElemStack::mapPrefixToURI </td></tr>
199<tr valign="top"><td>3.58       </td>   <td>ReaderMgr::getNextChar </td></tr>
200<tr valign="top"><td>3.20       </td>   <td>IGXMLScanner::basicAttrValueScan </td></tr>
201                  </tbody>
202               </table>
203      </section>
204      <section>
205         <title>The Parabix Framework</title>
206         <para> The Parabix (parallel bit stream) framework is a transformative approach to XML
207            parsing (and other forms of text processing.) The key idea is to exploit the
208            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
209            data from long blocks of input data by using one register bit per single input byte. To
210            facilitate this, the input data is first transposed into a set of basis bit streams. In <!--FIGURE REF Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
211is represented as 8 basis bit streams, $\tt b<subscript>{0 \ldots 7}$.
213            <!-- The bits used to construct $\tt <subscript>7</subscript>$ have been highlighted in this example. -->
214            Boolean-logic operations\footnote{&#8743;, \&#8744; and &#172; denote the
215            boolean AND, OR and NOT operators.} are used to classify the input bits into a set of
216               <emphasis role="ital">character-class bit streams</emphasis>, which identify key
217            characters (or groups of characters) with a <code>1</code>. For example, one of the
218            fundamental characters in XML is a left-angle bracket. A character is an
219               <code>&apos;&lt;&apos; if and only if
220               &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>)
221               &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>)
222               &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>)
223               &#8743; &#172; (b<subscript>6</subscript> &#8744;
224               b<subscript>7</subscript>) = 1</code>. Similarly, a character is numeric, <code>[0-9]
225               if and only if &#172;(b<subscript>0</subscript> &#8744;
226               b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
227                  b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript>
228               &#8743; (b<subscript>5</subscript> &#8744;
229            b<subscript>6</subscript>))</code>. An important observation here is that ranges of
230            characters may require fewer operations than individual characters and
231            <!-- the classification cost could be amortized over many character classes.--> multiple
232            classes can share the classification cost. </para>
233         <para>
234            <!-- FIGURE
237\begin{tabular}{r c c c c }
238String & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
239ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
244\begin{tabular}{r |c |c |c |c |c |c |c |c |}
245 & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{0}</subscript>$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{1}</subscript>$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{2}</subscript>$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{4}</subscript>$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{5}</subscript>$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{6}</subscript>$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b<subscript>{7}</subscript>$}$ \\
246 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
247 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
248 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
249 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
252\caption{8-bit ASCII Basis Bit Streams}
256         </para>
257         <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
258         <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
259         <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
260         <!-- process, intra-element well-formedness validation is performed on each block -->
261         <!-- of text. -->
262         <para> Consider, for example, the XML source data stream shown in the first line of
263            <!-- FIGURE REF Figure \ref{fig:parabix1} -->. The remaining lines of this figure show
264            several parallel bit streams that are computed in Parabix-style parsing, with each bit
265            of each stream in one-to-one correspondence to the source character code units of the
266            input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are
267            represented as underscores. The first bit stream shown is that for the opening angle
268            brackets that represent tag openers in XML. The second and third streams show a
269            partition of the tag openers into start tag marks and end tag marks depending on the
270            character immediately following the opener (i.e., <code>&quot;/&quot;</code>) or
271            not. The remaining three lines show streams that can be computed in subsequent parsing
272            (using the technique of bitstream addition \cite{cameron-EuroPar2011}), namely streams
273            marking the element names, attribute names and attribute values of tags. </para>
274         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
275            parsing performance. The first is that the use of the full register width offers a
276            considerable information advantage over sequential byte-at-a-time parsing. That is,
277            sequential processing of bytes uses just 8 bits of each register, greatly limiting the
278            processor resources that are effectively being used at any one time. The second is that
279            byte-at-a-time loop scanning loops are actually often just computing a single bit of
280            information per iteration: is the scan complete yet? Rather than computing these
281            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
282            bytes at a time using 128-bit registers) should provide substantial benefit. </para>
283         <para> Previous studies have shown that the Parabix approach improves many aspects of XML
284            processing, including transcoding \cite{Cameron2008}, character classification and
285            validation, tag parsing and well-formedness checking. The first Parabix parser used
286            processor bit scan instructions to considerably accelerate sequential scanning loops for
287            individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method
288            of parallel scanning using bitstream addition \cite{cameron-EuroPar2011}, as well as
289            combining SIMD methods with 4-stage pipeline parallelism to further improve throughput
290            \cite{HPCA2012}. Although these research prototypes handled the full syntax of
291            schema-less XML documents, they lacked the functionality required by full XML parsers. </para>
292         <para> Commercial XML processors support transcoding of multiple character sets and can
293            parse and validate against multiple document vocabularies. Additionally, they provide
294            API facilities beyond those found in research prototypes, including the widely used SAX,
295            SAX2 and DOM interfaces. </para>
296      </section>
297      <section>
298         <title>Sequential vs. Parallel Paradigm</title>
299         <para> Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents
300            sequentially. Each character is examined to distinguish between the XML-specific markup,
301            such as a left angle bracket <code>&lt;</code>, and the content held within the
302            document. As the parser progresses through the document, it alternates between markup
303            scanning, validation and content processing modes. </para>
304         <para> In other words, Xerces belongs to an equivalent class applications termed FSM
305            applications\footnote{ Herein FSM applications are considered software systems whose
306            behaviour is defined by the inputs, current state and the events associated with
307            transitions of states.}. Each state transition indicates the processing context of
308            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
309            character could induce a state transition. </para>
310         <para> Parabix-style XML parsers utilize a concept of layered processing. A block of source
311            text is transformed into a set of lexical bitstreams, which undergo a series of
312            operations that can be grouped into logical layers, e.g., transposition, character
313            classification, and lexical analysis. Each layer is pipeline parallel and require
314            neither speculation nor pre-parsing stages\cite{HPCA2012}. To meet the API requirements
315            of the document-ordered Xerces output, the results of the Parabix processing layers must
316            be interleaved to produce the equivalent behaviour. </para>
317      </section>
318   </section>
319   <section>
320      <title>Architecture</title>
321      <section>
322         <title>Overview</title>
323         <!--\def \CSG{Content Stream Generator}-->
324         <para> icXML is more than an optimized version of Xerces. Many components were grouped,
325            restructured and rearchitected with pipeline parallelism in mind. In this section, we
326            highlight the core differences between the two systems. As shown in Figure
327            \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader,
328            scanner, namespace binder, and validator. The <emphasis role="ital"
329            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
330            the majority of the character set encoding validation is performed as a byproduct of
331            this process. The <emphasis role="ital">Reader</emphasis> is responsible for the
332            streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current
333            line/column position,
334            <!--(which is reported in the unlikely event that the input contains an error), -->
335            performs line-break normalization and validates context-specific character set issues,
336            such as tokenization of qualified-names. The <emphasis role="ital">Scanner</emphasis>
337            pulls data through the reader and constructs the intermediate representation (IR) of the
338            document; it deals with all issues related to entity expansion, validates the XML
339            well-formedness constraints and any character set encoding issues that cannot be
340            completely handled by the reader or transcoder (e.g., surrogate characters, validation
341            and normalization of character references, etc.) The <emphasis role="ital">Namespace
342               Binder</emphasis> is a core piece of the element stack. It handles namespace scoping
343            issues between different XML vocabularies. This allows the scanner to properly select
344            the correct schema grammar structures. The <emphasis role="ital">Validator</emphasis>
345            takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder)
346            and assesses whether the final output matches the user-defined DTD and schema grammar(s)
347            before passing it to the end-user. </para>     
348        <figure xml:id="xerces-arch">
349          <title>Xerces Architecture</title>
350          <mediaobject>
351            <imageobject>
352              <imagedata format="png" fileref="xerces.png" width="150cm"/>
353            </imageobject>
354          </mediaobject>
355          <caption>
356          </caption>
357        </figure>
358         <para> In icXML functions are grouped into logical components. As shown in Figure
359            \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the
360            Markup Processor. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which
361            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
362               Adapter</emphasis>, discussed in Section \ref{arch:character-set-adapter}, mirrors
363            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
364            lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}. These lexical
365            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
366            additional processing is performed. The first precursor to producing UTF-16 is the
367               <emphasis role="ital">Parallel Markup Parser</emphasis> phase. It takes the lexical
368            streams and produces a set of marker bitstreams in which a 1-bit identifies significant
369            positions within the input data. One bitstream for each of the critical piece of
370            information is created, such as the beginning and ending of start tags, end tags,
371            element names, attribute names, attribute values and content. Intra-element
372            well-formedness validation is performed as an artifact of this process. Like Xerces,
373            icXML must provide the Line and Column position of each error. The <emphasis role="ital"
374               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
375            document position(s) through the use of an optimized population count algorithm,
376            described in Section \ref{section:arch:errorhandling}. From here, two data-independent
377            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
378         <para> A typical XML file contains few unique element and attribute names&#8212;but
379            each of them will occur frequently. icXML stores these as distinct data structures,
380            called symbols, each with their own global identifier (GID). Using the symbol marker
381            streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol
382               Resolver</emphasis> scans through the raw data to produce a sequence of GIDs, called
383            the <emphasis role="ital">symbol stream</emphasis>. </para>
384         <para> The final components of the Parabix Subsystem are the <emphasis role="ital">Content
385               Preparation Unit</emphasis> and <emphasis role="ital">Content Stream
386            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
387            filters them, according to the information provided by the Parallel Markup Parser, and
388            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital"
389               >content stream</emphasis>, discussed in Section \ref{section:arch:contentstream}. </para>
390         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
391            document. The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to
392            validate and produce the sequential output for the end user. The <emphasis role="ital"
393               >Final WF checker</emphasis> performs inter-element well-formedness validation that
394            would be too costly to perform in bit space, such as ensuring every start tag has a
395            matching end tag. Xerces's namespace binding functionality is replaced by the <emphasis
396               role="ital">Namespace Processor</emphasis>. Unlike Xerces, it is a discrete phase
397            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
398               stream</emphasis>, which are associated with each symbol occurrence. This is
399            discussed in Section \ref{section:arch:namespacehandling}. Finally, the <emphasis
400               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
401            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
402        <figure xml:id="icxml-arch">
403          <title>icXML Architecture</title>
404          <mediaobject>
405            <imageobject>
406              <imagedata format="png" fileref="icxml.png" width="500cm"/>
407            </imageobject>
408          </mediaobject>
409          <caption>
410          </caption>
411        </figure>
412      </section>
413      <section>
414         <title>Character Set Adapters</title>
415         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
416            Xerces itself and provide the end-consumer with a single encoding format. In the
417            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
418            because of the need to decode and classify each byte of input, mapping variable-length
419            UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In
420            other cases, transcoding may involve table look-up operations for each byte of input. In
421            any case, transcoding imposes at least a cost of buffer copying. </para>
422         <para> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
423            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
424            that input code units represent valid characters, mapping the characters of the encoding
425            into the appropriate bitstreams for XML parsing actions (i.e., producing the lexical
426            item streams), as well as supporting ultimate transcoding requirements. All of this work
427            is performed using the parallel bitstream representation of the source input. </para>
428         <para> An important observation is that many character sets are an extension to the legacy
429            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
430            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
431            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
432            serves to compute lexical item streams for all such ASCII-based character sets. </para>
433         <para> A second observation is that&#8212;regardless of which character set is
434            used&#8212;quite often all of the characters in a particular block of input will be
435            within the ASCII range. This is a very simple test to perform using the bitstream
436            representation, simply confirming that the bit 0 stream is zero for the entire block.
437            For blocks satisfying this test, all logic dealing with non-ASCII characters can simply
438            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
439            UTF-16 form are each set to zero in this case. </para>
440         <para> A third observation is that repeated transcoding of the names of XML elements,
441            attributes and so on can be avoided by using a look-up mechanism. That is, the first
442            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
443            numeric symbol ID. Transcoding of the symbol is applied at this time. Subsequent look-up
444            operations can avoid transcoding by simply retrieving the stored representation. As
445            symbol look up is required to apply various XML validation rules, there is achieves the
446            effect of transcoding each occurrence without additional cost. </para>
447         <para> The cost of individual character transcoding is avoided whenever a block of input is
448            confined to the ASCII subset and for all but the first occurrence of any XML element or
449            attribute name. Furthermore, when transcoding is required, the parallel bitstream
450            representation supports efficient transcoding operations. In the important case of UTF-8
451            to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit
452            parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes
453            of multi-byte sequences can be marked for deletion as discussed in the following
454            subsection. In other cases, transcoding within a block only need be applied for
455            non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream
456            using bit scan operations. </para>
457      </section>
458      <section>
459         <title>Combined Parallel Filtering</title>
460         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
461            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
462            Chinese characters <code>&#x4F60;&#x597D;</code> are represented as two
463            three-byte UTF-8 sequences <code>E4 BD A0</code> and <code>E5 A5 BD</code> while the
464            UTF-16 representation must be compressed down to the two code units <code>4F60</code>
465            and <code>597D</code>. In the bit parallel representation, this corresponds to a
466            reduction from six bit positions representing UTF-8 code units (bytes) down to just two
467            bit positions representing UTF-16 code units (double bytes). This compression may be
468            achieved by arranging to calculate the correct UTF-16 bits at the final position of each
469            sequence and creating a deletion mask to mark the first two bytes of each 3-byte
470            sequence for deletion. In this case, the portion of the mask corresponding to these
471            input bytes is the bit sequence <code>110110</code>. Using this approach, transcoding
472            may then be completed by applying parallel deletion and inverse transposition of the
473            UTF-16 bitstreams\cite{Cameron2008}. </para>
474         <para>
475            <!-- FIGURE
479Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
481            <!-- Tag Openers & \verb`1____________1____________________________1____________1__________`\\-->
482            <!-- Start Tag Marks & \verb`_1____________1___________________________________________________`\\-->
483            <!-- End Tag Marks & \verb`___________________________________________1____________1_________`\\-->
484            <!-- Empty Tag Marks & \verb`__________________________________________________________________`\\-->
485            <!-- Element Names & \verb`_11111111_____1111111_____________________________________________`\\-->
486            <!-- Attribute Names & \verb`______________________11_______11_________________________________`\\-->
487            <!-- Attribute Values & \verb`__________________________111________111__________________________`\\-->
488            <!-- FIGURE
489String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
490Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
491Deletion Mask & \verb`_11111111_____1111111111_1____1111_11_______11111111_____111111111`\\
492Undeleted Data & \verb``<emphasis role="ital">0</emphasis>\verb`________>fee`<emphasis role="ital">0</emphasis>\verb`__________=_fie`<emphasis role="ital">0</emphasis>\verb`____=__foe`{\tt<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/________fum`<emphasis role="ital">0</emphasis>\verb`/_________`
495\caption{XML Source Data and Derived Parallel Bit Streams}
499         </para>
500         <para> Rather than immediately paying the costs of deletion and transposition just for
501            transcoding, however, icXML defers these steps so that the deletion masks for several
502            stages of processing may be combined. In particular, this includes core XML requirements
503            to normalize line breaks and to replace character reference and entity references by
504            their corresponding text. In the case of line break normalization, all forms of line
505            breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations
506            must be normalized to a single LF character in each case. In icXML, this is achieved by
507            first marking CR positions, performing two bit parallel operations to transform the
508            marked CRs into LFs, and then marking for deletion any LF that is found immediately
509            after the marked CR as shown by the Pablo source code in Figure
510            \ref{fig:LBnormalization}.
511            <!-- FIGURE
514# XML 1.0 line-break normalization rules.
515if lex.CR:
516# Modify CR (#x0D) to LF (#x0A)
517  u16lo.bit_5 ^= lex.CR
518  u16lo.bit_6 ^= lex.CR
519  u16lo.bit_7 ^= lex.CR
520  CRLF = pablo.Advance(lex.CR) & lex.LF
521  callouts.delmask |= CRLF
522# Adjust LF streams for line/column tracker
523  lex.LF |= lex.CR
524  lex.LF ^= CRLF
526\caption{Line Break Normalization Logic}\label{fig:LBnormalization}
529         </para>
530         <para> In essence, the deletion masks for transcoding and for line break normalization each
531            represent a bitwise filter; these filters can be combined using bitwise-or so that the
532            parallel deletion algorithm need only be applied once. </para>
533         <para> A further application of combined filtering is the processing of XML character and
534            entity references. Consider, for example, the references <code>&amp;</code> or
535               <code>&#x3C;</code>. which must be replaced in XML processing with the single
536               <code>&amp;</code> and <code>&lt;</code> characters, respectively. The
537            approach in icXML is to mark all but the first character positions of each reference for
538            deletion, leaving a single character position unmodified. Thus, for the references
539               <code>&amp;</code> or <code>&#x3C;</code> the masks <code>01111</code> and
540               <code>011111</code> are formed and combined into the overall deletion mask. After the
541            deletion and inverse transposition operations are finally applied, a post-processing
542            step inserts the proper character at these positions. One note about this process is
543            that it is speculative; references are assumed to generally be replaced by a single
544            UTF-16 code unit. In the case, that this is not true, it is addressed in
545            post-processing. </para>
546         <para> The final step of combined filtering occurs during the process of reducing markup
547            data to tag bytes preceding each significant XML transition as described in
548            section~\ref{section:arch:contentstream}. Overall, icXML avoids separate buffer copying
549            operations for each of the these filtering steps, paying the cost of parallel deletion
550            and inverse transposition only once. Currently, icXML employs the parallel-prefix
551            compress algorithm of Steele~\cite{HackersDelight} Performance is independent of the
552            number of positions deleted. Future versions of icXML are expected to take advantage of
553            the parallel extract operation~\cite{HilewitzLee2006} that Intel is now providing in its
554            Haswell architecture. </para>
555      </section>
556      <section>
557         <title>Content Stream</title>
558         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
559            Rather that parsing an XML document in its original format, the input is transformed
560            into one that is easier for the parser to iterate through and produce the sequential
561            output. In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
562            <!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
563            is transformed into <!-- CODE -->
564            <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
565            through the parallel filtering algorithm, described in section \ref{sec:parfilter}. </para>
566         <para> Combined with the symbol stream, the parser traverses the content stream to
567            effectively reconstructs the input document in its output form. The initial <emphasis
568               role="ital">0</emphasis> indicates an empty content string. The following
569               <code>&gt;</code> indicates that a start tag without any attributes is the first
570            element in this text and the first unused symbol, <code>document</code>, is the element
571            name. Succeeding that is the content string <code>fee</code>, which is null-terminated
572            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
573            operations are required to produce these strings, which as
574            Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time.
575            Additionally, it is cheap to locate the terminal character of each string: using the
576            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
577            null character in the content stream in parallel, which in turn means the parser can
578            directly jump to the end of every string without scanning for it. </para>
579         <para> Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the
580            existence of an attribute. Because all of the intra-element was performed in the Parabix
581            Subsystem, this must be a legal attribute. Since attributes can only occur within start
582            tags and must be accompanied by a textual value, the next symbol in the symbol stream
583            must be the element name of a start tag, and the following one must be the name of the
584            attribute and the string that follows the <code>=</code> must be its value. However, the
585            subsequent <code>=</code> is not treated as an independent attribute because the parser
586            has yet to read a <code>&gt;</code>, which marks the end of a start tag. Thus only
587            one symbol is taken from the symbol stream and it (along with the string value) is added
588            to the element. Eventually the parser reaches a <code>/</code>, which marks the
589            existence of an end tag. Every end tag requires an element name, which means they
590            require a symbol. Inter-element validation whenever an empty tag is detected to ensure
591            that the appropriate scope-nesting rules have been applied. </para>
592      </section>
593      <section>
594         <title>Namespace Handling</title>
595         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
596         <para> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
597            together. It is especially important when a vocabulary application-dependant meaning,
598            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
599            to uniform resource identifiers (URIs), which are strings used to identify specific
600            names or resources. On line 1 of Figure \ref{fig:namespace1}, the <code>xmlns</code>
601            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
602               &apos;<code></code>&apos; and the default (empty) prefix to
603               <code></code>. Thus to the XML processor, the <code>title</code> on line 2
604            and <code>price</code> on line 4 both read as
605            <code>&quot;;:title</code> and
606               <code>&quot;;:price</code> respectively, whereas on line 3 and
607            5, <code>p:name</code> and <code>price</code> are seen as
608               <code>&quot;;:name</code> and
609               <code>&quot;;:price</code>. Even though the actual element name
610               <code>price</code>, due to namespace scoping rules they are viewed as two
611            uniquely-named items because the current vocabulary is determined by the namespace(s)
612            that are in-scope. </para>
613         <para>
614            <!-- FIGURE
6171. & \verb|<book xmlns:p="" xmlns="">| \\
6182. & \verb|  <title>BOOK NAME</title>| \\
6193. & \verb|  <p:name>PUBLISHER NAME</p:name>| \\
6204. & \verb|  <price>X</price>| \\
6215. & \verb|  <price xmlns="">Y</price>| \\
6226. & \verb|</book>| \\
624\caption{XML Namespace Example}
625\label {fig:namespace1}
628         </para>
629         <para> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
630            persist for the lifetime of the application through the use of a global URI pool. Xerces
631            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
632            (end tag) occurs in the document. Because a namespace declaration affects the entire
633            element, it must be processed prior to grammar validation. This is a costly process
634            considering that a typical namespaced XML document only comes in one of two forms: (1)
635            those that declare a set of namespaces upfront and never change them, and (2) those that
636            repeatedly modify the namespaces in predictable patterns. </para>
637         <para> For that reason, icXML contains an independent namespace stack and utilizes bit
638            vectors to cheaply perform <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
639            <!-- performance advantage figure?? average cycles/byte cost? --> When a prefix is
640            declared (e.g., <code>xmlns:p=&quot;;</code>), a namespace binding
641            is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution
642            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
643            every prefix contains a bit vector marking every NSID that has ever been associated with
644            it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding
645            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
646            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
647            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
648            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
649            found using a bit-scan intrinsic. A namespace binding table, similar to Table
650            \ref{tbl:namespace1}, provides the actual URI ID. </para>
651         <para>
652            <!-- FIGURE
656NSID & Prefix & URI & Prefix ID & URI ID \\ \hline\hline
6570 & {\tt p} & {\tt} & 0 & 0 \\ \hline
6581 & {\tt xmlns} & {\tt} & 1 & 1 \\ \hline
6592 & {\tt xmlns} & {\tt} & 1 & 0 \\ \hline
661\caption{Namespace Binding Table Example}
666         </para>
667         <para>
668            <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
669            <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
670            <!-- NSid = bitscan(VisiblePrefixBinding); -->
671            <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
672         </para>
673         <para> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
674            any modification to the currently visible namespaces is calculated and stored within a
675            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
676            is found, the currently visible namespaces is XORed with the vector at the top of the
677            stack. This allows any number of changes to be performed at each scope-level with a
678            constant time.
679            <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
680         </para>
681      </section>
682      <section>
683         <title>Error Handling</title>
684         <para>
685            <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
686            Xerces outputs error messages in two ways: through the programmer API and as thrown
687            objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to
688            assess whether the next character is legal; if not, the current state determines the
689            type and severity of the error. icXML emits errors in the similar manner&#8212;but
690            how it discovers them is substantially different. Recall that in Figure
691            \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and
692            Markup Processor, each with its own system for detecting and producing error messages. </para>
693         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
694            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
695            the byte-position of an error within a block, and the type of error is determined by the
696            equation that discovered it. The difficulty of error processing in this section is that
697            in Xerces the line and column number must be given with every error production. Two
698            major issues exist because of this: (1) line position adheres to XML white-normalization
699            rules; as such, some sequences of characters, e.g., a carriage return followed by a line
700            feed, are counted as a single new line character. (2) column position is counted in
701            characters, not bytes or code units; thus multi-code-unit code-points and surrogate
702            character pairs are all counted as a single column position. Note that typical XML
703            documents are error-free but the calculation of the line/column position is a constant
704            overhead in Xerces. <!-- that must be maintained in the case that one occurs. --> To
705            reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence
706            of the error and performs the minimal amount of book-keeping necessary to facilitate it.
707            icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates
708            the information within the Line Column Tracker (LCT). One of the CSA's major
709            responsibilities is transcoding an input text.
710            <!-- from some encoding format to near-output-ready UTF-16. --> During this process,
711            white-space normalization rules are applied and multi-code-unit and surrogate characters
712            are detected and validated. A <emphasis role="ital">line-feed bitstream</emphasis>,
713            which marks the positions of the normalized new lines characters, is a natural
714            derivative of this process. Using an optimized population count algorithm, the line
715            count can be summarized cheaply for each valid block of text.
716            <!-- The optimization delays the counting process .... --> Column position is more
717            difficult to calculate. It is possible to scan backwards through the bitstream of new
718            line characters to determine the distance (in code-units) between the position between
719            which an error was detected and the last line feed. However, this distance may exceed
720            than the actual character position for the reasons discussed in (2). To handle this, the
721            CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together
722            many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters,
723            and any characters that were removed during the normalization process. When an error is
724            detected, the sum of those skipped positions is subtracted from the distance to
725            determine the actual column number. </para>
726         <para> The Markup Processor is a state-driven machine. As such, error detection within it
727            is very similar to Xerces. However, reporting the correct line/column is a much more
728            difficult problem. The Markup Processor parses the content stream, which is a series of
729            tagged UTF-16 strings. Each string is normalized in accordance with the XML
730            specification. All symbol data and unnecessary whitespace is eliminated from the stream;
731            thus its impossible to derive the current location using only the content stream. To
732            calculate the location, the Markup Processor borrows three additional pieces of
733            information from the Parabix Subsystem: the line-feed, skip mask, and a <emphasis
734               role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the
735            (code-unit) position of every datum that was suppressed from the source during the
736            production of the content stream. Armed with these, it is possible to calculate the
737            actual line/column using the same system as the Parabix Subsystem until the sum of the
738            negated deletion mask stream is equal to the current position. </para>
739      </section>
740   </section>
742   <section>
743      <title>Multithreading with Pipeline Parallelism</title>
744      <para> As discussed in section \ref{background:xerces}, Xerces can be considered a FSM
745         application. These are &quot;embarrassingly
746         sequential.&quot;\cite{Asanovic:EECS-2006-183} and notoriously difficult to
747         parallelize. However, icXML is designed to organize processing into logical layers. In
748         particular, layers within the Parabix Subsystem are designed to operate over significant
749         segments of input data before passing their outputs on for subsequent processing. This fits
750         well into the general model of pipeline parallelism, in which each thread is in charge of a
751         single module or group of modules. </para>
752      <para> The most straightforward division of work in icXML is to separate the Parabix Subsystem
753         and the Markup Processor into distinct logical layers into two separate stages. The
754         resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained
755         software-pipeline application. In this case, the Parabix Subsystem thread
756               <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a
757         time and produces the content, symbol and URI streams, then stores them in a pre-allocated
758         shared data structure <code>S</code>. The Markup Processor thread
759            <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness
760         and grammar-based validation, and the provides parsed XML data to the application through
761         the Xerces API. The shared data structure is implemented using a ring buffer, where every
762         entry contains an independent set of data streams. In the examples of Figure
763         \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A
764         lock-free mechanism is applied to ensure that each entry can only be read or written by one
765         thread at the same time. In Figure \ref{threads_timeline1} the processing time of
766               <code>T<subscript>1</subscript></code> is longer than
767         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
768         waits for <code>T<subscript>1</subscript></code> to write to the shared memory. Figure
769         \ref{threads_timeline2} illustrates the scenario in which
770         <code>T<subscript>1</subscript></code> is faster and must wait for
771            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
772         reuse the memory space. </para>
773      <para>
774        <figure xml:id="threads_timeline1">
775          <title>Thread Balance in Two-Stage Pipelines</title>
776          <mediaobject>
777            <imageobject>
778              <imagedata format="png" fileref="threads_timeline1.png" width="500cm"/>
779            </imageobject>
780          </mediaobject>
781          <caption>
782          </caption>
783        </figure>
784        <figure xml:id="threads_timeline2">
785          <title>Thread Balance in Two-Stage Pipelines</title>
786          <mediaobject>
787            <imageobject>
788              <imagedata format="png" fileref="threads_timeline2.png" width="500cm"/>
789            </imageobject>
790          </mediaobject>
791          <caption>
792          </caption>
793        </figure>
794      </para>
795      <para> Overall, our design is intended to benefit a range of applications. Conceptually, we
796         consider two design points. The first, the parsing performed by the Parabix Subsystem
797         dominates at 67% of the overall cost, with the cost of application processing (including
798         the driver logic within the Markup Processor) at 33%. The second is almost the opposite
799         scenario, the cost of application processing dominates at 60%, while the cost of XML
800         parsing represents an overhead of 40%. </para>
801      <para> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
802         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
803         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
804         total application cost. In this case, the single-threaded icXML should achieve a 1.5x
805         speedup over Xerces so that the total application cost reduces to 67% of the original.
806         However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing
807         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
808         to a 3x speedup at best. </para>
809      <para> At the other extreme of our design range, we consider an application in which core
810         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
811         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
812         significant aspect of our two-stage multi-threaded design then becomes the ability to hide
813         the entire latency of parsing within the serial time required by the application. In this
814         case, we achieve an overall speedup in processing time by 1.67x. </para>
815      <para> Although the structure of the Parabix Subsystem allows division of the work into
816         several pipeline stages and has been demonstrated to be effective for four pipeline stages
817         in a research prototype \cite{HPCA2012}, our analysis here suggests that the further
818         pipelining of work within the Parabix Subsystem is not worthwhile if the cost of
819         application logic is little as 33% of the end-to-end cost using Xerces. To achieve benefits
820         of further parallelization with multi-core technology, there would need to be reductions in
821         the cost of application logic that could match reductions in core parsing cost. </para>
822   </section>
824   <section>
825      <title>Performance</title>
826      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
827         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
828         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
829         Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core)
830         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
831         12.04 (Linux). </para>
832      <para> We analyzed the execution profiles of each XML parser using the performance counters
833         found in the processor. We chose several key hardware events that provide insight into the
834         profile of each application and indicate if the processor is doing useful work. The set of
835         events included in our study are: processor cycles, branch instructions, branch
836         mispredictions, and cache misses. The Performance Application Programming Interface (PAPI)
837         Version 5.5.0 \cite{papi} toolkit was installed on the test system to facilitate the
838         collection of hardware performance monitoring statistics. In addition, we used the Linux
839         perf \cite{perf} utility to collect per core hardware events. </para>
840      <section>
841         <title>Xerces C++ SAXCount</title>
842         <para> Xerces comes with sample applications that demonstrate salient features of the
843            parser. SAXCount is the simplest such application: it counts the elements, attributes
844            and characters of a given XML file using the (event based) SAX API and prints out the
845            totals. </para>
847         <para> Table \ref{XMLDocChars} shows the document characteristics of the XML input files
848            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
849            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
850            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
851            documents and consist entirely of single byte encoded ASCII characters.
852  <table>
853                  <caption>
854                     <para>XML Document Characteristics</para>
855                  </caption>
856                  <colgroup>
857                     <col align="left" valign="top"/>
858                     <col align="left" valign="top"/>
859                     <col align="left" valign="top"/>
860                     <col align="left" valign="top"/>
861                     <col align="left" valign="top"/>
862                  </colgroup>
863                  <tbody>
864 <tr><td>File Name              </td><td> jaw.xml               </td><td> road.gml      </td><td> po.xml        </td><td> soap.xml </td></tr> 
865<tr><td>File Type               </td><td> document              </td><td> data          </td><td> data          </td><td> data   </td></tr>     
866<tr><td>File Size (kB)          </td><td> 7343                  </td><td> 11584         </td><td> 76450         </td><td> 2717 </td></tr> 
867<tr><td>Markup Item Count       </td><td> 74882                 </td><td> 280724        </td><td> 4634110       </td><td> 18004 </td></tr> 
868  <tr><td>Markup Density                </td><td> 0.13                  </td><td> 0.57          </td><td> 0.76          </td><td> 0.87  </td></tr> 
869                  </tbody>
870               </table>           
872         <para> A key predictor of the overall parsing performance of an XML file is markup
873            density\footnote{ Markup Density: the ratio of markup bytes used to define the structure
874            of the document vs. its file size.}. This metric has substantial influence on the
875            performance of traditional recursive descent XML parsers because it directly corresponds
876            to the number of state transitions that occur when parsing a document. We use a mixture
877            of document-oriented and data-oriented XML files to analyze performance over a spectrum
878            of markup densities. </para>
879         <para> Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML
880            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
881            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
882            speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less
883            affected through a reduction in branches and the use of parallel-processing techniques.
884            icXML-p performs better as markup-density increases because the work performed by each
885            stage is well balanced in this application. </para>
886         <para>
887        <figure xml:id="perf_SAX">
888          <title>SAXCount Performance Comparison</title>
889          <mediaobject>
890            <imageobject>
891              <imagedata format="png" fileref="perf_SAX.png" width="500cm"/>
892            </imageobject>
893          </mediaobject>
894          <caption>
895          </caption>
896        </figure>
897         </para>
898      </section>
899      <section>
900         <title>GML2SVG</title>
901<para>   As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
902was chosen.   This application transforms geospatially encoded data represented using
903an XML representation in the form of Geography Markup Language (GML) \cite{lake2004geography}
904into a different XML format  suitable for displayable maps:
905Scalable Vector Graphics (SVG) format\cite{lu2007advances}. In the GML2SVG benchmark, GML feature elements
906and GML geometry elements tags are matched. GML coordinate data are then extracted
907and transformed to the corresponding SVG path data encodings.
908Equivalent SVG path elements are generated and output to the destination
909SVG document.  The GML2SVG application is thus considered typical of a broad
910class of XML applications that parse and extract information from
911a known XML format for the purpose of analysis and restructuring to meet
912the requirements of an alternative format.</para>
914<para>Our GML to SVG data translations are executed on GML source data
915modelling the city of Vancouver, British Columbia, Canada.
916The GML source document set
917consists of 46 distinct GML feature layers ranging in size from approximately 9 KB to 125.2 MB
918and with an average document size of 18.6 MB. Markup density ranges from approximately 0.045 to 0.719
919and with an average markup density of 0.519. In this performance study,
920213.4 MB of source GML data generates 91.9 MB of target SVG data.</para>
923        <figure xml:id="perf_GML2SVG">
924          <title>Performance Comparison for GML2SVG</title>
925          <mediaobject>
926            <imageobject>
927              <imagedata format="png" fileref="perf_GML2SVG.png" width="500cm"/>
928            </imageobject>
929          </mediaobject>
930          <caption>
931          </caption>
932        </figure>
934<para>Figure \ref{perf_GML2SVG} compares the performance of the GML2SVG application linked against
935the Xerces, \icXML{} and \icXMLp{}.   
936On the GML workload with this application, single-thread \icXML{}
937achieved about a 50\% acceleration over Xerces,
938increasing throughput on our test machine from 58.3 MB/sec to 87.9 MB/sec.   
939Using \icXMLp{}, a further throughput increase to 111 MB/sec was recorded,
940approximately a 2X speedup.</para>
942<para>An important aspect of \icXML{} is the replacement of much branch-laden
943sequential code inside Xerces with straight-line SIMD code using far
944fewer branches.  Figure \ref{branchmiss_GML2SVG} shows the corresponding
945improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
946It is also interesting to note that \icXMLp{} goes even further.   
947In essence, in using pipeline parallelism to split the instruction
948stream onto separate cores, the branch target buffers on each core are
949less overloaded and able to increase the successful branch prediction rate.</para>
951        <figure xml:id="branchmiss_GML2SVG">
952          <title>Comparative Branch Misprediction Rate</title>
953          <mediaobject>
954            <imageobject>
955              <imagedata format="png" fileref="BM.png" width="500cm"/>
956            </imageobject>
957          </mediaobject>
958          <caption>
959          </caption>
960        </figure>
962<para>The behaviour of the three versions with respect to L1 cache misses per kB is shown
963in Figure \ref{cachemiss_GML2SVG}.   Improvements are shown in both instruction-
964and data-cache performance with the improvements in instruction-cache
965behaviour the most dramatic.   Single-threaded \icXML{} shows substantially improved
966performance over Xerces on both measures.   
967Although \icXMLp{} is slightly worse \wrt{} data-cache performance,
968this is more than offset by a further dramatic reduction in instruction-cache miss rate.
969Again partitioning the instruction stream through the pipeline parallelism model has
970significant benefit.</para>
972        <figure xml:id="cachemiss_GML2SVG">
973          <title>Comparative Cache Miss Rate</title>
974          <mediaobject>
975            <imageobject>
976              <imagedata format="png" fileref="CM.png" width="500cm"/>
977            </imageobject>
978          </mediaobject>
979          <caption>
980          </caption>
981        </figure>
983<para>One caveat with this study is that the GML2SVG application did not exhibit
984a relative balance of processing between application code and Xerces library
985code reaching the 33\% figure.  This suggests that for this application and
986possibly others, further separating the logical layers of the
987\icXML{} engine into different pipeline stages could well offer significant benefit.
988This remains an area of ongoing work.</para>
989      </section>
990   </section>
992   <section>
993      <title>Conclusion and Future Work</title>
994      <para> This paper is the first case study documenting the significant performance benefits
995         that may be realized through the integration of parallel bitstream technology into existing
996         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
997         integration of SIMD and multicore parallelism was shown capable of dramatic producing
998         dramatic increases in throughput and reductions in branch mispredictions and cache misses.
999         The modified parser, going under the name icXML is designed to provide the full
1000         functionality of the original Xerces library with complete compatibility of APIs. Although
1001         substantial re-engineering was required to realize the performance potential of parallel
1002         technologies, this is an important case study demonstrating the general feasibility of
1003         these techniques. </para>
1004      <para> The further development of icXML to move beyond 2-stage pipeline parallelism is
1005         ongoing, with realistic prospects for four reasonably balanced stages within the library.
1006         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
1007         multistage pipelined parsing library should offer substantial benefits. </para>
1008      <para> The example of XML parsing may be considered prototypical of finite-state machines
1009         applications which have sometimes been considered &quot;embarassingly
1010         sequential&quot; and so difficult to parallelize that &quot;nothing
1011         works.&quot; So the case study presented here should be considered an important data
1012         point in making the case that parallelization can indeed be helpful across a broad array of
1013         application types. </para>
1014      <para> To overcome the software engineering challenges in applying parallel bitstream
1015         technology to existing software systems, it is clear that better library and tool support
1016         is needed. The techniques used in the implementation of icXML and documented in this paper
1017         could well be generalized for applications in other contexts and automated through the
1018         creation of compiler technology specifically supporting parallel bitstream programming.
1019      </para>
1020   </section>
1022   <!-- 
1023   <section>
1024      <title>Acknowledgments</title>
1025      <para></para>
1026   </section>
1028   <bibliography>
1029      <title>Bibliography</title>
1030      <bibliomixed xml:id="XMLChip09" xreflabel="Leventhal and Lemoine 2009">Leventhal, Michael and
1031         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
1032         Processing XML Efficiently 2009, Montréal.</bibliomixed>
1033      <bibliomixed xml:id="Datapower09" xreflabel="Salz, Achilles and Maze 2009">Salz, Richard,
1034         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
1035         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
1036         Efficiently 2009, Montréal.</bibliomixed>
1037      <bibliomixed xml:id="PPoPP08" xreflabel="Cameron 2007">Cameron, Robert D. 2007. A Case Study
1038         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
1039         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
1040         Lake City, Utah. On the Web at <link></link>.</bibliomixed>
1041      <bibliomixed xml:id="CASCON08" xreflabel="Cameron, Herdy and Lin 2008">Cameron, Robert D.,
1042         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
1043         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
1044         Practice of Parallel Programming 2008, Toronto.</bibliomixed>
1045      <bibliomixed xml:id="SVGOpen08" xreflabel="Herdy, Burggraf and Cameron 2008">Herdy, Kenneth
1046         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
1047         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
1048         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
1049         Nuremburg. On the Web at
1050            <link></link>.</bibliomixed>
1051      <bibliomixed xml:id="Ross06" xreflabel="Ross 2006">Ross, Kenneth A. 2006. Efficient hash
1052         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
1053            <link></link>.</bibliomixed>
1054      <bibliomixed xml:id="ASPLOS09" xreflabel="Cameron and Lin 2009">Cameron, Robert D. and Dan
1055         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
1056         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</bibliomixed>
1057      <bibliomixed xml:id="Wu08" xreflabel="Wu et al. 2008">Wu, Yu, Qi Zhang, Zhiqiang Yu and
1058         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
1059         Proceedings of Balisage 2008, Montréal. On the Web at
1060            <link></link>.</bibliomixed>
1061      <bibliomixed xml:id="u8u16" xreflabel="Cameron 2008">u8u16 - A High-Speed UTF-8 to UTF-16
1062         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
1063         Science Simon Fraser University, June 21 2007.</bibliomixed>
1064      <bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth
1065         Edition) W3C Recommendation 26 November 2008. On the Web at
1066            <link></link>.</bibliomixed>
1067      <bibliomixed xml:id="Unicode" xreflabel="Unicode">The Unicode Consortium. 2009. On the Web at
1068            <link></link>.</bibliomixed>
1069      <bibliomixed xml:id="Pex06" xreflabel="Hilewitz and Lee 2006"> Hilewitz, Y. and Ruby B. Lee.
1070         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
1071         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
1072         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</bibliomixed>
1073      <bibliomixed xml:id="InfoSet" xreflabel="XML Infoset">XML Information Set (Second Edition) W3C
1074         Recommendation 4 February 2004. On the Web at
1075         <link></link>.</bibliomixed>
1076      <bibliomixed xml:id="Saxon" xreflabel="Saxon">SAXON The XSLT and XQuery Processor. On the Web
1077         at <link></link>.</bibliomixed>
1078      <bibliomixed xml:id="Kay08" xreflabel="Kay 2008"> Kay, Michael Y. 2008. Ten Reasons Why Saxon
1079         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</bibliomixed>
1080      <bibliomixed xml:id="AElfred" xreflabel="Ælfred"> The Ælfred XML Parser. On the Web at
1081            <link></link>.</bibliomixed>
1082      <bibliomixed xml:id="JNI" xreflabel="Hitchens 2002">Hitchens, Ron. Java NIO. O'Reilly, 2002.</bibliomixed>
1083      <bibliomixed xml:id="Expat" xreflabel="Expat">The Expat XML Parser.
1084            <link></link>.</bibliomixed>
1085   </bibliography>
Note: See TracBrowser for help on using the repository browser.