source: docs/Balisage13/Bal2013came0601/Bal2013came0601.xml @ 3896

Last change on this file since 3896 was 3400, checked in by nmedfort, 6 years ago

minor fixes

File size: 81.9 KB
1<?xml version="1.0" encoding="UTF-8"?>
2<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
3<article xmlns="" version="5.0-subset Balisage-1.3"
4   xml:id="HR-23632987-8973">
5   <title>icXML:  Accelerating a Commercial XML
6     Parser Using SIMD and Multicore Technologies</title>
7   <info>
8      <abstract>
9         <para>Prior research on the acceleration of XML processing using single-instruction
10           multiple-data (SIMD) and multi-core
11            parallelism has lead to a number of interesting research prototypes. This work is
12            the first to investigate to the extent to which the techniques underlying these prototypes
13            could result
14            in systematic performance benefits when fully integrated into a commercial XML parser
15            The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the
16            foundation for the study. A systematic restructuring of the parser was undertaken, while
17            maintaining the existing API for application programmers. Using SIMD techniques alone,
18            an increase in parsing speed of at least 50% was observed in a range of applications.
19            When coupled with pipeline parallelism on dual core processors, improvements of 2x and
20            beyond were realized.
22            icXML is intended as an important industrial contribution in its own right as well
23            as an important case study for the underlying Parabix parallel processing framework.
24            Based on the success of the icXML development, there is a strong case for continued
25            development of that framework as well as for the application of that framework
26            to other important XML technology stacks.   An important area for further work is
27            the extension of Parabix technology to accelerate Java-based implementations as
28            well as ones based on C/C++.
30            </para>
31      </abstract>
32      <author>
33         <personname>
34            <firstname>Nigel</firstname>
35            <surname>Medforth</surname>
36         </personname>
37         <personblurb>
38            <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
39               developer of icXML. He earned a Bachelor of Technology in Information Technology at
40               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
41               Outstanding Achievement.</para>
42            <para>Nigel is currently researching ways to leverage both the Parabix framework and
43               stream-processing models to further accelerate XML parsing within icXML.</para>
44         </personblurb>
45         <affiliation>
46            <jobtitle>Developer</jobtitle>
47            <orgname>International Characters Inc.</orgname>
48         </affiliation>
49         <affiliation>
50            <jobtitle>Graduate Student</jobtitle>
51            <orgname>School of Computing Science, Simon Fraser University </orgname>
52         </affiliation>
53         <email></email>
54      </author>
55      <author>
56         <personname>
57            <firstname>Dan</firstname>
58            <surname>Lin</surname>
59         </personname>
60         <personblurb>
61           <para>Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
62             in Computing Science at Simon Fraser University in 2010. Her research focus on on high
63             performance algorithms that exploit parallelization strategies on various multicore platforms.
64           </para>
65         </personblurb>
66         <affiliation>
67            <jobtitle>Graduate Student</jobtitle>
68            <orgname>School of Computing Science, Simon Fraser University </orgname>
69         </affiliation>
70         <email></email>
71      </author>
72      <author>
73         <personname>
74            <firstname>Kenneth</firstname>
75            <surname>Herdy</surname>
76         </personname>
77         <personblurb>
78            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
79               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
80               of Science in Computing Science with a Certificate in Spatial Information Systems at
81               Simon Fraser University in 2005. </para>
82            <para> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
83               University with industrial scholarship support from the Natural Sciences and
84               Engineering Research Council of Canada, the Mathematics of Information Technology and
85               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
86               of the principal techniques that may be used to improve XML processing performance in
87               the context of the Geography Markup Language (GML). </para>
88         </personblurb>
89         <affiliation>
90            <jobtitle>Graduate Student</jobtitle>
91            <orgname>School of Computing Science, Simon Fraser University </orgname>
92         </affiliation>
93         <email></email>
94      </author>
95      <author>
96         <personname>
97            <firstname>Rob</firstname>
98            <surname>Cameron</surname>
99         </personname>
100         <personblurb>
101            <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
102               Sciences at Simon Fraser University. His research interests include programming
103               language and software system technology, with a specific focus on high performance
104               text processing using SIMD and multicore parallelism. He is the developer of the REX
105               XML shallow parser as well as the parallel bit stream (Parabix) framework for SIMD
106               text processing. </para>
107         </personblurb>
108         <affiliation>
109            <jobtitle>Professor of Computing Science</jobtitle>
110            <orgname>Simon Fraser University</orgname>
111         </affiliation>
112         <affiliation>
113            <jobtitle>Chief Technology Officer</jobtitle>
114            <orgname>International Characters, Inc.</orgname>
115         </affiliation>
116         <email></email>
117      </author>
118      <author>
119         <personname>
120            <firstname>Arrvindh</firstname>
121            <surname>Shriraman</surname>
122         </personname>
123         <personblurb>
124            <para/>
125         </personblurb>
126         <affiliation>
127            <jobtitle>Assistant Professor</jobtitle>
128            <orgname>School of Computing Science, Simon Fraser University</orgname>
129         </affiliation>
130         <email></email>
131      </author>
133      <legalnotice>
134         <para>Copyright &#x000A9; 2013 Nigel Medforth, Dan Lin, Kenneth S. Herdy, Robert D. Cameron  and Arrvindh Shriraman.
135            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
136            Works 2.5 Canada License.</para>
137      </legalnotice>
138      <keywordset role="author">
139         <keyword/>
140      </keywordset>
142   </info>
143 <section>
144      <title>Introduction</title>
145      <para>   
146        Parallelization and acceleration of XML parsing is a widely
147        studied problem that has seen the development of a number
148        of interesting research prototypes using both single-instruction
149           multiple-data (SIMD) and
150        multi-core parallelism.   Most works have investigated
151        data parallel solutions on multicore
152        architectures using various strategies to break input
153        documents into segments that can be allocated to different cores.
154        For example, one possibility for data
155        parallelization is to add a pre-parsing step to compute
156        a skeleton tree structure of an  XML document <citation linkend="GRID2006"/>.
157        The parallelization of the pre-parsing stage itself can be tackled with
158          state machines <citation linkend="E-SCIENCE2007"/>, <citation linkend="IPDPS2008"/>.
159        Methods without pre-parsing have used speculation <citation linkend="HPCC2011"/> or post-processing that
160        combines the partial results <citation linkend="ParaDOM2009"/>.
161        A hybrid technique that combines data and pipeline parallelism was proposed to
162        hide the latency of a "job" that has to be done sequentially <citation linkend="ICWS2008"/>.
163      </para>
164      <para>
165        Fewer efforts have investigated SIMD parallelism, although this approach
166        has the potential advantage of improving single core performance as well
167        as offering savings in energy consumption <citation linkend="HPCA2012"/>.
168        Intel introduced specialized SIMD string processing instructions in the SSE 4.2 instruction set extension
169        and showed how they can be used to improve the performance of XML parsing <citation linkend="XMLSSE42"/>.
170        The Parabix framework uses generic SIMD extensions and bit parallel methods to
171        process hundreds of XML input characters simultaneously <citation linkend="Cameron2009"/> <citation linkend="cameron-EuroPar2011"/>.
172        Parabix prototypes have also combined SIMD methods with thread-level parallelism to
173        achieve further acceleration on multicore systems <citation linkend="HPCA2012"/>.
174      </para>
175      <para>
176        In this paper, we move beyond research prototypes to consider
177        the detailed integration of both SIMD and multicore parallelism into the
178        Xerces-C++ parser of the Apache Software Foundation, an existing
179        standards-compliant open-source parser that is widely used
180        in commercial practice.    The challenge of this work is
181        to parallelize the Xerces parser in such a way as to
182        preserve the existing APIs as well as offering worthwhile
183        end-to-end acceleration of XML processing.   
184        To achieve the best results possible, we undertook
185        a nine-month comprehensive restructuring of the Xerces-C++ parser,
186        seeking to expose as many critical aspects of XML parsing
187        as possible for parallelization, the result of which we named icXML.   
188        Overall, we employed Parabix-style methods of transcoding, tokenization
189        and tag parsing, parallel string comparison methods in symbol
190        resolution, bit parallel methods in namespace processing,
191        as well as staged processing using pipeline parallelism to take advantage of
192        multiple cores.
193      </para>
194      <para>
195        The remainder of this paper is organized as follows.   
196          <xref linkend="background" endterm=""/> discusses the structure of the Xerces and Parabix XML parsers and the fundamental
197        differences between the two parsing models.   
198        <xref linkend="architecture"/> then presents the icXML design based on a restructured Xerces architecture to
199        incorporate SIMD parallelism using Parabix methods.   
200        <xref linkend="multithread"/> moves on to consider the multithreading of the icXML architecture
201        using the pipeline parallelism model. 
202        <xref linkend="performance"/> analyzes the performance of both the single-threaded and
203        multi-threaded versions of icXML in comparison to original Xerces,
204        demonstrating substantial end-to-end acceleration of
205        a GML-to-SVG translation application written against the Xerces API.
206          <xref linkend="conclusion"/> concludes the paper with a discussion of future work and the potential for
207        applying the techniques discussed herein in other application domains.
208      </para>
209   </section>
211   <section xml:id="background">
212      <title>Background</title>
213      <section xml:id="background-xerces">
214         <title>Xerces C++ Structure</title>
215         <para> The Xerces C++ parser is a widely-used standards-conformant
216            XML parser produced as open-source software
217             by the Apache Software Foundation.
218            It features comprehensive support for a variety of character encodings both
219            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
220            XML vocabularies through the XML namespace mechanism, as well as complete
221            implementations of structure and data validation through multiple grammars declared
222            using either legacy DTDs (document type definitions) or modern XML Schema facilities.
223            Xerces also supports several APIs for accessing parser services, including event-based
224            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
225            tree-based parsing interface. </para>
226         <para>
227            Xerces,
228            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
229            the first to the last byte of input data. Each byte passes through several processing
230            layers and is classified and eventually validated within the context of the document
231            state. This introduces implicit dependencies between the various tasks within the
232            application that make it difficult to optimize for performance. As a complex software
233              system, no one feature dominates the overall parsing performance. <xref linkend="xerces-profile"/>
234            shows the execution time profile of the top ten functions in a
235            typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of
236            these functions for parallelization in isolation would only produce a minute improvement
237            in performance. Unfortunately, early investigation into these functions found that
238            incorporating speculation-free thread-level parallelization was impossible and they were
239            already performing well in their given tasks; thus only trivial enhancements were
240            attainable. In order to obtain a systematic acceleration of Xerces, it should be
241            expected that a comprehensive restructuring is required, involving all aspects of the
242            parser. </para>
243             <table xml:id="xerces-profile">
244                  <caption>
245                     <para>Execution Time of Top 10 Xerces Functions</para>
246                  </caption>
247                  <colgroup>
248                     <col align="left" valign="top"/>
249                     <col align="left" valign="top"/>
250                  </colgroup>
251                  <thead><tr><th>Time (%) </th><th> Function Name </th></tr></thead>
252                  <tbody>
253<tr valign="top"><td>13.29      </td>   <td>XMLUTF8Transcoder::transcodeFrom </td></tr>
254<tr valign="top"><td>7.45       </td>   <td>IGXMLScanner::scanCharData </td></tr>
255<tr valign="top"><td>6.83       </td>   <td>memcpy </td></tr>
256<tr valign="top"><td>5.83       </td>   <td>XMLReader::getNCName </td></tr>
257<tr valign="top"><td>4.67       </td>   <td>IGXMLScanner::buildAttList </td></tr>
258<tr valign="top"><td>4.54       </td>   <td>RefHashTableO&lt;&gt;::findBucketElem </td></tr>
259<tr valign="top"><td>4.20       </td>   <td>IGXMLScanner::scanStartTagNS </td></tr>
260<tr valign="top"><td>3.75       </td>   <td>ElemStack::mapPrefixToURI </td></tr>
261<tr valign="top"><td>3.58       </td>   <td>ReaderMgr::getNextChar </td></tr>
262<tr valign="top"><td>3.20       </td>   <td>IGXMLScanner::basicAttrValueScan </td></tr>
263                  </tbody>
264               </table>
265      </section>
266      <section>
267         <title>The Parabix Framework</title>
268         <para> The Parabix (parallel bit stream) framework is a transformative approach to XML
269            parsing (and other forms of text processing.) The key idea is to exploit the
270            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
271            data from long blocks of input data by using one register bit per single input byte. To
272            facilitate this, the input data is first transposed into a set of basis bit streams.
273              For example, <xref linkend="xml-bytes"/> shows  the ASCII bytes for the string "<code>b7&lt;A</code>" with
274                the corresponding  8 basis bit streams, b<subscript>0</subscript> through  b<subscript>7</subscript> shown in  <xref linkend="xml-bits"/>.
275            The bits used to construct b<subscript>7</subscript> have been highlighted in this example.
276              Boolean-logic operations (&#8743;, &#8744; and &#172; denote the
277              boolean AND, OR and NOT operators) are used to classify the input bits into a set of
278               <emphasis role="ital">character-class bit streams</emphasis>, which identify key
279            characters (or groups of characters) with a <code>1</code>. For example, one of the
280            fundamental characters in XML is a left-angle bracket. A character is an
281               <code>&apos;&lt;&apos; if and only if
282               &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>)
283               &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>)
284               &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>)
285               &#8743; &#172; (b<subscript>6</subscript> &#8744;
286               b<subscript>7</subscript>) = 1</code>. Similarly, a character is numeric, <code>[0-9]
287               if and only if &#172;(b<subscript>0</subscript> &#8744;
288               b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
289                  b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript>
290               &#8743; (b<subscript>5</subscript> &#8744;
291            b<subscript>6</subscript>))</code>. An important observation here is that ranges of
292            characters may require fewer operations than individual characters and
293            <!-- the classification cost could be amortized over many character classes.--> multiple
294            classes can share the classification cost. </para>
295         <table xml:id="xml-bytes">
296                  <caption>
297                     <para>XML Source Data</para>
298                  </caption>
299                  <colgroup>
300                     <col align="right" valign="top"/>
301                     <col align="centre" valign="top"/>
302                     <col align="centre" valign="top"/>
303                     <col align="centre" valign="top"/>
304                     <col align="centre" valign="top"/>
305                  </colgroup>
306                  <tbody>
307  <tr><td>String </td><td> <code>b</code> </td><td> <code>7</code> </td><td> <code>&lt;</code> </td><td> <code>A</code> </td></tr>
308  <tr><td>ASCII </td><td> <code>0110001<emphasis role="bold">0</emphasis></code> </td><td> <code>0011011<emphasis role="bold">1</emphasis></code> </td><td> <code>0011110<emphasis role="bold">0</emphasis></code> </td><td> <code>0100000<emphasis role="bold">1</emphasis></code> </td></tr>
309  </tbody>
313         <table xml:id="xml-bits">
314                  <caption>
315                     <para>8-bit ASCII Basis Bit Streams</para>
316                  </caption>
317                  <colgroup>
318                     <col align="centre" valign="top"/>
319                     <col align="centre" valign="top"/>
320                     <col align="centre" valign="top"/>
321                     <col align="centre" valign="top"/>
322                     <col align="centre" valign="top"/>
323                     <col align="centre" valign="top"/>
324                     <col align="centre" valign="top"/>
325                     <col align="centre" valign="top"/>
326                  </colgroup>
327                  <tbody>
328<tr><td> b<subscript>0</subscript> </td><td> b<subscript>1</subscript> </td><td> b<subscript>2</subscript> </td><td> b<subscript>3</subscript></td><td> b<subscript>4</subscript> </td><td> b<subscript>5</subscript> </td><td> b<subscript>6</subscript> </td><td> b<subscript>7</subscript> </td></tr>
329 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
330 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
331 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
332 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
333  </tbody>
338         <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
339         <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
340         <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
341         <!-- process, intra-element well-formedness validation is performed on each block -->
342         <!-- of text. -->
343         <para> Consider, for example, the XML source data stream shown in the first line of <xref linkend="derived"/>.
344The remaining lines of this figure show
345            several parallel bit streams that are computed in Parabix-style parsing, with each bit
346            of each stream in one-to-one correspondence to the source character code units of the
347            input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are
348            represented as underscores. The first bit stream shown is that for the opening angle
349            brackets that represent tag openers in XML. The second and third streams show a
350            partition of the tag openers into start tag marks and end tag marks depending on the
351            character immediately following the opener (i.e., &quot;<code>/</code>&quot;) or
352            not. The remaining three lines show streams that can be computed in subsequent parsing
353            (using the technique of bitstream addition <citation linkend="cameron-EuroPar2011"/>), namely streams
354            marking the element names, attribute names and attribute values of tags. </para>
355            <table xml:id="derived">
356                  <caption>
357                     <para>XML Source Data and Derived Parallel Bit Streams</para>
358                  </caption>
359                  <colgroup>
360                     <col align="centre" valign="top"/>
361                     <col align="left" valign="top"/>
362                  </colgroup>
363                  <tbody>
364          <tr><td> Source Data </td><td> <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
365          <tr><td> Tag Openers </td><td> <code>1____________1____________________________1____________1__________</code></td></tr>
366           <tr><td> Start Tag Marks </td><td> <code>_1____________1___________________________________________________</code></td></tr>
367           <tr><td> End Tag Marks </td><td> <code>___________________________________________1____________1_________</code></td></tr>
368           <tr><td> Empty Tag Marks </td><td> <code>__________________________________________________________________</code></td></tr>
369           <tr><td> Element Names </td><td> <code>_11111111_____1111111_____________________________________________</code></td></tr>
370           <tr><td> Attribute Names </td><td> <code>______________________11_______11_________________________________</code></td></tr>
371           <tr><td> Attribute Values </td><td> <code>__________________________111________111__________________________</code></td></tr>
372                  </tbody>
373               </table>         
375         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
376            parsing performance. The first is that the use of the full register width offers a
377            considerable information advantage over sequential byte-at-a-time parsing. That is,
378            sequential processing of bytes uses just 8 bits of each register, greatly limiting the
379            processor resources that are effectively being used at any one time. The second is that
380            byte-at-a-time loop scanning loops are actually often just computing a single bit of
381            information per iteration: is the scan complete yet? Rather than computing these
382            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
383            bytes at a time using 128-bit registers) should provide substantial benefit. </para>
384         <para> Previous studies have shown that the Parabix approach improves many aspects of XML
385            processing, including transcoding <citation linkend="Cameron2008"/>, character classification and
386            validation, tag parsing and well-formedness checking. The first Parabix parser used
387            processor bit scan instructions to considerably accelerate sequential scanning loops for
388            individual characters <citation linkend="CameronHerdyLin2008"/>. Recent work has incorporated a method
389            of parallel scanning using bitstream addition <citation linkend="cameron-EuroPar2011"/>, as well as
390            combining SIMD methods with 4-stage pipeline parallelism to further improve throughput
391            <citation linkend="HPCA2012"/>. Although these research prototypes handled the full syntax of
392            schema-less XML documents, they lacked the functionality required by full XML parsers. </para>
393         <para> Commercial XML processors support transcoding of multiple character sets and can
394            parse and validate against multiple document vocabularies. Additionally, they provide
395            API facilities beyond those found in research prototypes, including the widely used SAX,
396            SAX2 and DOM interfaces. </para>
397      </section>
398      <section>
399         <title>Sequential vs. Parallel Paradigm</title>
400         <para> Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents
401            sequentially. Each character is examined to distinguish between the XML-specific markup,
402            such as a left angle bracket <code>&quot;&lt;&quot;</code>, and the content held within the
403            document. As the parser progresses through the document, it alternates between markup
404            scanning, validation and content processing modes. </para>
405         <para> In other words, Xerces belongs to an equivalence class of applications termed FSM
406           applications.<footnote xml:id="FSM">
407             <para>Herein FSM applications are considered software systems whose
408            behaviour is defined by the inputs, current state and the events associated with
409              transitions of states.</para></footnote> Each state transition indicates the processing context of
410            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
411            character could induce a state transition. </para>
412         <para> Parabix-style XML parsers utilize a concept of layered processing. A block of source
413            text is transformed into a set of lexical bitstreams, which undergo a series of
414            operations that can be grouped into logical layers, e.g., transposition, character
415            classification, and lexical analysis. Each layer is pipeline parallel and require
416            neither speculation nor pre-parsing stages <citation linkend="HPCA2012"/>. To meet the API requirements
417            of the document-ordered Xerces output, the results of the Parabix processing layers must
418            be interleaved to produce the equivalent behaviour. </para>
419      </section>
420   </section>
421   <section xml:id="architecture">
422      <title>Architecture</title>
423      <section>
424         <title>Overview</title>
425         <!--\def \CSG{Content Stream Generator}-->
426         <para> icXML is more than an optimized version of Xerces. Many components were grouped,
427            restructured and rearchitected with pipeline parallelism in mind. In this section, we
428            highlight the core differences between the two systems. As shown in Figure
429              <xref linkend="xerces-arch"/>, Xerces is comprised of five main modules: the transcoder, reader,
430            scanner, namespace binder, and validator. The <emphasis role="ital"
431            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
432            the majority of the character set encoding validation is performed as a byproduct of
433            this process. The <emphasis role="ital">Reader</emphasis> is responsible for the
434            streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current
435            line/column position,
436            <!--(which is reported in the unlikely event that the input contains an error), -->
437            performs line-break normalization and validates context-specific character set issues,
438            such as tokenization of qualified-names. The <emphasis role="ital">Scanner</emphasis>
439            pulls data through the reader and constructs the intermediate representation (IR) of the
440            document; it deals with all issues related to entity expansion, validates the XML
441            well-formedness constraints and any character set encoding issues that cannot be
442            completely handled by the reader or transcoder (e.g., surrogate characters, validation
443            and normalization of character references, etc.) The <emphasis role="ital">Namespace
444               Binder</emphasis> is a core piece of the element stack. It handles namespace scoping
445            issues between different XML vocabularies. This allows the scanner to properly select
446            the correct schema grammar structures. The <emphasis role="ital">Validator</emphasis>
447            takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder)
448            and assesses whether the final output matches the user-defined DTD and schema grammar(s)
449            before passing it to the end-user. </para>     
450        <figure xml:id="xerces-arch">
451          <title>Xerces Architecture</title>
452          <mediaobject>
453            <imageobject>
454              <imagedata format="png" fileref="xerces.png" width="155cm"/>
455            </imageobject>
456          </mediaobject>
457          <caption>
458          </caption>
459        </figure>
460         <para> In icXML functions are grouped into logical components. As shown in
461             <xref linkend="xerces-arch"/>, two major categories exist: (1) the Parabix Subsystem and (2) the
462               Markup Processor. All tasks in (1) use the Parabix Framework <citation linkend="HPCA2012"/>, which
463            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
464              Adapter</emphasis>, discussed in <xref linkend="character-set-adapter"/>, mirrors
465            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
466              lexical bitstreams, similar to those shown in <xref linkend="CameronHerdyLin2008"/>. These lexical
467            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
468            additional processing is performed. The first precursor to producing UTF-16 is the
469               <emphasis role="ital">Parallel Markup Parser</emphasis> phase. It takes the lexical
470            streams and produces a set of marker bitstreams in which a 1-bit identifies significant
471            positions within the input data. One bitstream for each of the critical piece of
472            information is created, such as the beginning and ending of start tags, end tags,
473            element names, attribute names, attribute values and content. Intra-element
474            well-formedness validation is performed as an artifact of this process. Like Xerces,
475            icXML must provide the Line and Column position of each error. The <emphasis role="ital"
476               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
477            document position(s) through the use of an optimized population count algorithm,
478              described in <xref linkend="errorhandling"/>. From here, two data-independent
479            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
480         <para> A typical XML file contains few unique element and attribute names&#8212;but
481            each of them will occur frequently. icXML stores these as distinct data structures,
482            called symbols, each with their own global identifier (GID). Using the symbol marker
483            streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol
484               Resolver</emphasis> scans through the raw data to produce a sequence of GIDs, called
485            the <emphasis role="ital">symbol stream</emphasis>. </para>
486         <para> The final components of the Parabix Subsystem are the <emphasis role="ital">Content
487               Preparation Unit</emphasis> and <emphasis role="ital">Content Stream
488            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
489            filters them, according to the information provided by the Parallel Markup Parser, and
490            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital">content stream</emphasis>, discussed in <xref linkend="contentstream"/>. </para>
491         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
492            document. The <emphasis role="ital">Markup Processor</emphasis>
493            parses the IR to
494            validate and produce the sequential output for the end user. The <emphasis role="ital"
495               >Final WF checker</emphasis> performs inter-element well-formedness validation that
496            would be too costly to perform in bit space, such as ensuring every start tag has a
497            matching end tag. Xerces's namespace binding functionality is replaced by the <emphasis
498               role="ital">Namespace Processor</emphasis>. Unlike Xerces, it is a discrete phase
499            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
500               stream</emphasis>, which are associated with each symbol occurrence. This is
501                 discussed in <xref linkend="namespace-handling"/>. Finally, the <emphasis
502               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
503            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
504        <figure xml:id="icxml-arch">
505          <title>icXML Architecture</title>
506          <mediaobject>
507            <imageobject>
508              <imagedata format="png" fileref="icxml.png" width="500cm"/>
509            </imageobject>
510          </mediaobject>
511          <caption>
512          </caption>
513        </figure>
514      </section>
515      <section xml:id="character-set-adapter">
516         <title>Character Set Adapters</title>
517         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
518            Xerces itself and provide the end-consumer with a single encoding format. In the
519            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
520            because of the need to decode and classify each byte of input, mapping variable-length
521            UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In
522            other cases, transcoding may involve table look-up operations for each byte of input. In
523            any case, transcoding imposes at least a cost of buffer copying. </para>
524         <para> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
525            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
526            that input code units represent valid characters, mapping the characters of the encoding
527            into the appropriate bitstreams for XML parsing actions (i.e., producing the lexical
528            item streams), as well as supporting ultimate transcoding requirements. All of this work
529            is performed using the parallel bitstream representation of the source input. </para>
530         <para> An important observation is that many character sets are an extension to the legacy
531            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
532            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
533            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
534            serves to compute lexical item streams for all such ASCII-based character sets. </para>
535         <para> A second observation is that&#8212;regardless of which character set is
536            used&#8212;quite often all of the characters in a particular block of input will be
537            within the ASCII range. This is a very simple test to perform using the bitstream
538            representation, simply confirming that the bit 0 stream is zero for the entire block.
539            For blocks satisfying this test, all logic dealing with non-ASCII characters can simply
540            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
541            UTF-16 form are each set to zero in this case. </para>
542         <para> A third observation is that repeated transcoding of the names of XML elements,
543            attributes and so on can be avoided by using a look-up mechanism. That is, the first
544            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
545            numeric symbol ID. Transcoding of the symbol is applied at this time. Subsequent look-up
546            operations can avoid transcoding by simply retrieving the stored representation. As
547            symbol look up is required to apply various XML validation rules, there is achieves the
548            effect of transcoding each occurrence without additional cost. </para>
549         <para> The cost of individual character transcoding is avoided whenever a block of input is
550            confined to the ASCII subset and for all but the first occurrence of any XML element or
551            attribute name. Furthermore, when transcoding is required, the parallel bitstream
552            representation supports efficient transcoding operations. In the important case of UTF-8
553            to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit
554              parallel fashion based on UTF-8 streams <citation linkend="Cameron2008"/>, and all but the final bytes
555            of multi-byte sequences can be marked for deletion as discussed in the following
556            subsection. In other cases, transcoding within a block only need be applied for
557            non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream
558            using bit scan operations. </para>
559      </section>
560      <section xml:id="par-filter">
561         <title>Combined Parallel Filtering</title>
562         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
563            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
564            Chinese characters <code>&#x4F60;&#x597D;</code> are represented as two
565            three-byte UTF-8 sequences <code>E4 BD A0</code> and <code>E5 A5 BD</code> while the
566            UTF-16 representation must be compressed down to the two code units <code>4F60</code>
567            and <code>597D</code>. In the bit parallel representation, this corresponds to a
568            reduction from six bit positions representing UTF-8 code units (bytes) down to just two
569            bit positions representing UTF-16 code units (double bytes). This compression may be
570            achieved by arranging to calculate the correct UTF-16 bits at the final position of each
571            sequence and creating a deletion mask to mark the first two bytes of each 3-byte
572            sequence for deletion. In this case, the portion of the mask corresponding to these
573            input bytes is the bit sequence <code>110110</code>. Using this approach, transcoding
574            may then be completed by applying parallel deletion and inverse transposition of the
575            UTF-16 bitstreams <citation linkend="Cameron2008"/>. </para>
576         <para> Rather than immediately paying the costs of deletion and transposition just for
577            transcoding, however, icXML defers these steps so that the deletion masks for several
578            stages of processing may be combined. In particular, this includes core XML requirements
579            to normalize line breaks and to replace character reference and entity references by
580            their corresponding text. In the case of line break normalization, all forms of line
581            breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations
582            must be normalized to a single LF character in each case. In icXML, this is achieved by
583            first marking CR positions, performing two bit parallel operations to transform the
584            marked CRs into LFs, and then marking for deletion any LF that is found immediately
585            after the marked CR as shown by the Pablo source code in
586              <xref  linkend="fig-LBnormalization"/>.
587              <figure xml:id="fig-LBnormalization">
588                <caption>Line Break Normalization Logic</caption>
589  <programlisting>
590# XML 1.0 line-break normalization rules.
591if lex.CR:
592# Modify CR (#x0D) to LF (#x0A)
593  u16lo.bit_5 ^= lex.CR
594  u16lo.bit_6 ^= lex.CR
595  u16lo.bit_7 ^= lex.CR
596  CRLF = pablo.Advance(lex.CR) &amp; lex.LF
597  callouts.delmask |= CRLF
598# Adjust LF streams for line/column tracker
599  lex.LF |= lex.CR
600  lex.LF ^= CRLF
603         </para>
604         <para> In essence, the deletion masks for transcoding and for line break normalization each
605            represent a bitwise filter; these filters can be combined using bitwise-or so that the
606            parallel deletion algorithm need only be applied once. </para>
607         <para> A further application of combined filtering is the processing of XML character and
608           entity references. Consider, for example, the references <code><![CDATA[&amp;]]></code> or
609             <code><![CDATA[&#x3C;]]></code> which must be replaced in XML processing with the single
610               <code>&amp;</code> and <code>&lt;</code> characters, respectively. The
611            approach in icXML is to mark all but the first character positions of each reference for
612            deletion, leaving a single character position unmodified. Thus, for the references
613               <code><![CDATA[&amp;]]></code> or <code><![CDATA[&#x3C;]]></code> the masks <code>01111</code> and
614               <code>011111</code> are formed and combined into the overall deletion mask. After the
615            deletion and inverse transposition operations are finally applied, a post-processing
616            step inserts the proper character at these positions. One note about this process is
617            that it is speculative; references are assumed to generally be replaced by a single
618            UTF-16 code unit. In the case, that this is not true, it is addressed in
619            post-processing. </para>
620         <para> The final step of combined filtering occurs during the process of reducing markup
621            data to tag bytes preceding each significant XML transition as described in
622              <xref linkend="contentstream"/>. Overall, icXML avoids separate buffer copying
623            operations for each of the these filtering steps, paying the cost of parallel deletion
624            and inverse transposition only once. Currently, icXML employs the parallel-prefix
625            compress algorithm of Steele <citation linkend="HackersDelight"/>. Performance is independent of the
626            number of positions deleted. Future versions of icXML are expected to take advantage of
627            the parallel extract operation <citation linkend="HilewitzLee2006"/> that Intel is now providing in its
628            Haswell architecture. </para>
629      </section>
630      <section xml:id="contentstream">
631         <title>Content Stream</title>
632         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
633            Rather that parsing an XML document in its original format, the input is transformed
634            into one that is easier for the parser to iterate through and produce the sequential
635            output. In <xref  linkend="fig-parabix2"/>, the source data
636             <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]></code>
637             is transformed into <code><emphasis role="ital">0</emphasis><![CDATA[fee]]><emphasis role="ital">0</emphasis><![CDATA[=fie]]><emphasis role="ital">0</emphasis><![CDATA[=foe]]><emphasis role="ital">0</emphasis><![CDATA[>]]><emphasis role="ital">0</emphasis><![CDATA[/fum]]><emphasis role="ital">0</emphasis><![CDATA[/]]></code>
638            through the parallel filtering algorithm, described in <xref linkend="par-filter"/>. </para>
639              <table xml:id="fig-parabix2">
640                        <caption>XML Source Data and Derived Parallel Bit Streams</caption>
641                  <colgroup>
642                     <col align="centre" valign="top"/>
643                     <col align="left" valign="top"/>
644                  </colgroup>
645                  <tbody>
646          <tr><td> Source Data </td><td>
647                                    <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
648               <tr><td> String Ends </td><td> <code>1____________1_______________1__________1_1____________1__________</code></td></tr>
649<tr><td> Markup Identifiers </td><td>         <code>_________1______________1_________1______1_1____________1_________</code></td></tr>
650<tr><td> Deletion Mask </td><td>              <code>_11111111_____1111111111_1____1111_11_______11111111_____111111111</code></td></tr>
651<tr><td> Undeleted Data </td><td> <code><emphasis role="ital">0</emphasis>________&gt;fee<emphasis role="ital">0</emphasis>__________=_fie<emphasis role="ital">0</emphasis>____=__foe<emphasis role="ital">0</emphasis>><emphasis role="ital">0</emphasis>/________fum<emphasis role="ital">0</emphasis>/_________</code></td></tr>
652                  </tbody>
655         <para> Combined with the symbol stream, the parser traverses the content stream to
656            effectively reconstructs the input document in its output form. The initial <emphasis
657               role="ital">0</emphasis> indicates an empty content string. The following
658               <code>&gt;</code> indicates that a start tag without any attributes is the first
659            element in this text and the first unused symbol, <code>document</code>, is the element
660            name. Succeeding that is the content string <code>fee</code>, which is null-terminated
661            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
662            operations are required to produce these strings, which as
663              <xref linkend="xerces-profile"/> shows accounts for 6.83% of Xerces's execution time.
664            Additionally, it is cheap to locate the terminal character of each string: using the
665            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
666            null character in the content stream in parallel, which in turn means the parser can
667            directly jump to the end of every string without scanning for it. </para>
668         <para> Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the
669            existence of an attribute. Because all of the intra-element was performed in the Parabix
670            Subsystem, this must be a legal attribute. Since attributes can only occur within start
671            tags and must be accompanied by a textual value, the next symbol in the symbol stream
672            must be the element name of a start tag, and the following one must be the name of the
673            attribute and the string that follows the <code>=</code> must be its value. However, the
674            subsequent <code>=</code> is not treated as an independent attribute because the parser
675            has yet to read a <code>&gt;</code>, which marks the end of a start tag. Thus only
676            one symbol is taken from the symbol stream and it (along with the string value) is added
677            to the element. Eventually the parser reaches a <code>/</code>, which marks the
678            existence of an end tag. Every end tag requires an element name, which means they
679            require a symbol. Inter-element validation whenever an empty tag is detected to ensure
680            that the appropriate scope-nesting rules have been applied. </para>
681      </section>
682      <section xml:id="namespace-handling">
683         <title>Namespace Handling</title>
684         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
685         <para> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
686            together. It is especially important when a vocabulary application-dependant meaning,
687            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
688            to uniform resource identifiers (URIs), which are strings used to identify specific
689            names or resources. On line 1 in <xref linkend="namespace-ex"/>, the <code>xmlns</code>
690            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
691               &apos;<code></code>&apos; and the default (empty) prefix to
692               <code></code>. Thus to the XML processor, the <code>title</code> on line 2
693            and <code>price</code> on line 4 both read as
694            <code>&quot;;:title</code> and
695               <code>&quot;;:price</code> respectively, whereas on line 3 and
696            5, <code>p:name</code> and <code>price</code> are seen as
697               <code>&quot;;:name</code> and
698               <code>&quot;;:price</code>. Even though the actual element name
699               <code>price</code>, due to namespace scoping rules they are viewed as two
700            uniquely-named items because the current vocabulary is determined by the namespace(s)
701            that are in-scope. </para>
702<table xml:id="namespace-ex">
703                  <caption>
704                     <para>XML Namespace Example</para>
705                  </caption>
706                  <colgroup>
707                     <col align="centre" valign="top"/>
708                     <col align="left" valign="top"/>
709                  </colgroup>
710                  <tbody>
711 <tr><td>1. </td><td><![CDATA[<book xmlns:p="" xmlns="">]]> </td></tr>
712 <tr><td>2. </td><td><![CDATA[  <title>BOOK NAME</title>]]> </td></tr>
713 <tr><td>3. </td><td><![CDATA[  <p:name>PUBLISHER NAME</p:name>]]> </td></tr>
714 <tr><td>4. </td><td><![CDATA[  <price>X</price>]]> </td></tr>
715 <tr><td>5. </td><td><![CDATA[  <price xmlns="">Y</price>]]> </td></tr>
716 <tr><td>6. </td><td><![CDATA[</book>]]> </td></tr>
717                  </tbody>
718               </table>         
720         <para> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
721            persist for the lifetime of the application through the use of a global URI pool. Xerces
722            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
723            (end tag) occurs in the document. Because a namespace declaration affects the entire
724            element, it must be processed prior to grammar validation. This is a costly process
725            considering that a typical namespaced XML document only comes in one of two forms: (1)
726            those that declare a set of namespaces upfront and never change them, and (2) those that
727            repeatedly modify the namespaces in predictable patterns. </para>
728         <para> For that reason, icXML contains an independent namespace stack and utilizes bit
729            vectors to cheaply perform <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
730            <!-- performance advantage figure?? average cycles/byte cost? --> When a prefix is
731            declared (e.g., <code>xmlns:p=&quot;;</code>), a namespace binding
732            is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution
733            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
734            every prefix contains a bit vector marking every NSID that has ever been associated with
735              it within the document. For example, in <xref linkend="namespace-ex"/>, the prefix binding
736            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
737            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
738            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
739            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
740            found using a bit-scan intrinsic. A namespace binding table, similar to
741            <xref linkend="namespace-binding"/>, provides the actual URI ID. </para>
742<table xml:id="namespace-binding">
743                  <caption>
744                     <para>Namespace Binding Table Example</para>
745                  </caption>
746                  <colgroup>
747                     <col align="centre" valign="top"/>
748                     <col align="centre" valign="top"/>
749                     <col align="centre" valign="top"/>
750                     <col align="centre" valign="top"/>
751                     <col align="centre" valign="top"/>
752                   </colgroup>
753                   <thead>
754                     <tr><th>NSID </th><th> Prefix </th><th> URI </th><th> Prefix ID </th><th> URI ID </th>
755                     </tr>
756                   </thead>
757                  <tbody>
758<tr><td>0 </td><td> <code> p</code> </td><td> <code></code> </td><td> 0 </td><td> 0 </td></tr> 
759 <tr><td>1 </td><td> <code> xmlns</code> </td><td> <code></code> </td><td> 1 </td><td> 1 </td></tr> 
760 <tr><td>2 </td><td> <code> xmlns</code> </td><td> <code></code> </td><td> 1 </td><td> 0 </td></tr> 
761                  </tbody>
762               </table>         
763         <para>
764            <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
765            <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
766            <!-- NSid = bitscan(VisiblePrefixBinding); -->
767            <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
768         </para>
769         <para> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
770            any modification to the currently visible namespaces is calculated and stored within a
771            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
772            is found, the currently visible namespaces is XORed with the vector at the top of the
773            stack. This allows any number of changes to be performed at each scope-level with a
774            constant time.
775            <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
776         </para>
777      </section>
778      <section xml:id="errorhandling">
779         <title>Error Handling</title>
780         <para>
781            <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
782            Xerces outputs error messages in two ways: through the programmer API and as thrown
783            objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to
784            assess whether the next character is legal; if not, the current state determines the
785            type and severity of the error. icXML emits errors in the similar manner&#8212;but
786            how it discovers them is substantially different. Recall that in Figure
787            <xref linkend="icxml-arch"/>, icXML is divided into two sections: the Parabix Subsystem and
788            Markup Processor, each with its own system for detecting and producing error messages. </para>
789         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
790            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
791            the byte-position of an error within a block, and the type of error is determined by the
792            equation that discovered it. The difficulty of error processing in this section is that
793            in Xerces the line and column number must be given with every error production. Two
794            major issues exist because of this: (1) line position adheres to XML white-normalization
795            rules; as such, some sequences of characters, e.g., a carriage return followed by a line
796            feed, are counted as a single new line character. (2) column position is counted in
797            characters, not bytes or code units; thus multi-code-unit code-points and surrogate
798            character pairs are all counted as a single column position. Note that typical XML
799            documents are error-free but the calculation of the line/column position is a constant
800            overhead in Xerces. <!-- that must be maintained in the case that one occurs. --> To
801            reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence
802            of the error and performs the minimal amount of book-keeping necessary to facilitate it.
803            icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates
804            the information within the Line Column Tracker (LCT). One of the CSA's major
805            responsibilities is transcoding an input text.
806            <!-- from some encoding format to near-output-ready UTF-16. --> During this process,
807            white-space normalization rules are applied and multi-code-unit and surrogate characters
808            are detected and validated. A <emphasis role="ital">line-feed bitstream</emphasis>,
809            which marks the positions of the normalized new lines characters, is a natural
810            derivative of this process. Using an optimized population count algorithm, the line
811            count can be summarized cheaply for each valid block of text.
812            <!-- The optimization delays the counting process .... --> Column position is more
813            difficult to calculate. It is possible to scan backwards through the bitstream of new
814            line characters to determine the distance (in code-units) between the position between
815            which an error was detected and the last line feed. However, this distance may exceed
816            than the actual character position for the reasons discussed in (2). To handle this, the
817            CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together
818            many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters,
819            and any characters that were removed during the normalization process. When an error is
820            detected, the sum of those skipped positions is subtracted from the distance to
821            determine the actual column number. </para>
822         <para> The Markup Processor is a state-driven machine. As such, error detection within it
823            is very similar to Xerces. However, reporting the correct line/column is a much more
824            difficult problem. The Markup Processor parses the content stream, which is a series of
825            tagged UTF-16 strings. Each string is normalized in accordance with the XML
826            specification. All symbol data and unnecessary whitespace is eliminated from the stream;
827            thus its impossible to derive the current location using only the content stream. To
828            calculate the location, the Markup Processor borrows three additional pieces of
829            information from the Parabix Subsystem: the line-feed, skip mask, and a <emphasis
830               role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the
831            (code-unit) position of every datum that was suppressed from the source during the
832            production of the content stream. Armed with these, it is possible to calculate the
833            actual line/column using the same system as the Parabix Subsystem until the sum of the
834            negated deletion mask stream is equal to the current position. </para>
835      </section>
836   </section>
838   <section xml:id="multithread">
839      <title>Multithreading with Pipeline Parallelism</title>
840      <para> As discussed in section <xref linkend="background-xerces"/>, Xerces can be considered a FSM
841         application. These are &quot;embarrassingly
842         sequential.&quot;<citation linkend="Asanovic-EECS-2006-183"/> and notoriously difficult to
843         parallelize. However, icXML is designed to organize processing into logical layers. In
844         particular, layers within the Parabix Subsystem are designed to operate over significant
845         segments of input data before passing their outputs on for subsequent processing. This fits
846         well into the general model of pipeline parallelism, in which each thread is in charge of a
847         single module or group of modules. </para>
848      <para> The most straightforward division of work in icXML is to separate the Parabix Subsystem
849         and the Markup Processor into distinct logical layers into two separate stages. The
850         resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained
851         software-pipeline application. In this case, the Parabix Subsystem thread
852               <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a
853         time and produces the content, symbol and URI streams, then stores them in a pre-allocated
854         shared data structure <code>S</code>. The Markup Processor thread
855            <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness
856         and grammar-based validation, and the provides parsed XML data to the application through
857         the Xerces API. The shared data structure is implemented using a ring buffer, where every
858         entry contains an independent set of data streams. In the examples of
859           <xref linkend="threads_timeline1"/>, the ring buffer has four entries. A
860         lock-free mechanism is applied to ensure that each entry can only be read or written by one
861         thread at the same time. In  <xref linkend="threads_timeline1"/> the processing time of
862               <code>T<subscript>1</subscript></code> is longer than
863         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
864         waits for <code>T<subscript>1</subscript></code> to write to the shared memory. 
865         <xref linkend="threads_timeline2"/> illustrates the scenario in which
866         <code>T<subscript>1</subscript></code> is faster and must wait for
867            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
868         reuse the memory space. </para>
869      <para>
870        <figure xml:id="threads_timeline1">
871          <title>Thread Balance in Two-Stage Pipelines: Stage 1 Dominant</title>
872          <mediaobject>
873            <imageobject>
874              <imagedata format="png" fileref="threads_timeline1.png" width="500cm"/>
875            </imageobject>
876          </mediaobject>
877         </figure>
878        <figure xml:id="threads_timeline2">
879          <title>Thread Balance in Two-Stage Pipelines: Stage 2 Dominant</title>
880        <mediaobject>
881            <imageobject>
882              <imagedata format="png" fileref="threads_timeline2.png" width="500cm"/>
883            </imageobject>
884          </mediaobject>
885        </figure>
886      </para>
887      <para> Overall, our design is intended to benefit a range of applications. Conceptually, we
888         consider two design points. The first, the parsing performed by the Parabix Subsystem
889         dominates at 67% of the overall cost, with the cost of application processing (including
890         the driver logic within the Markup Processor) at 33%. The second is almost the opposite
891         scenario, the cost of application processing dominates at 60%, while the cost of XML
892         parsing represents an overhead of 40%. </para>
893      <para> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
894         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
895         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
896         total application cost. In this case, the single-threaded icXML should achieve a 1.5x
897         speedup over Xerces so that the total application cost reduces to 67% of the original.
898         However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing
899         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
900         to a 3x speedup at best. </para>
901      <para> At the other extreme of our design range, we consider an application in which core
902         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
903         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
904         significant aspect of our two-stage multi-threaded design then becomes the ability to hide
905         the entire latency of parsing within the serial time required by the application. In this
906         case, we achieve an overall speedup in processing time by 1.67x. </para>
907      <para> Although the structure of the Parabix Subsystem allows division of the work into
908         several pipeline stages and has been demonstrated to be effective for four pipeline stages
909         in a research prototype <citation linkend="HPCA2012"/>, our analysis here suggests that the further
910         pipelining of work within the Parabix Subsystem is not worthwhile if the cost of
911         application logic is little as 33% of the end-to-end cost using Xerces. To achieve benefits
912         of further parallelization with multi-core technology, there would need to be reductions in
913         the cost of application logic that could match reductions in core parsing cost. </para>
914   </section>
916   <section xml:id="performance">
917      <title>Performance</title>
918      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
919         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
920         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
921         Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core)
922         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
923         12.04 (Linux). </para>
924      <para> We analyzed the execution profiles of each XML parser using the performance counters
925         found in the processor. We chose several key hardware events that provide insight into the
926         profile of each application and indicate if the processor is doing useful work. The set of
927         events included in our study are: processor cycles, branch instructions, branch
928         mispredictions, and cache misses. The Performance Application Programming Interface (PAPI)
929         Version 5.5.0 <citation linkend="papi"/> toolkit was installed on the test system to facilitate the
930         collection of hardware performance monitoring statistics. In addition, we used the Linux
931         perf <citation linkend="perf"/> utility to collect per core hardware events. </para>
932      <section>
933         <title>Xerces C++ SAXCount</title>
934         <para> Xerces comes with sample applications that demonstrate salient features of the
935            parser. SAXCount is the simplest such application: it counts the elements, attributes
936            and characters of a given XML file using the (event based) SAX API and prints out the
937            totals. </para>
939 <para> <xref linkend="XMLdocs"/> shows the document characteristics of the XML input files
940            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
941            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
942            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
943            documents and consist entirely of single byte encoded ASCII characters.
944  <table xml:id="XMLdocs">
945                  <caption>
946                     <para>XML Document Characteristics</para>
947                  </caption>
948                  <colgroup>
949                     <col align="left" valign="top"/>
950                     <col align="centre" valign="top"/>
951                     <col align="centre" valign="top"/>
952                     <col align="centre" valign="top"/>
953                     <col align="centre" valign="top"/>
954                  </colgroup>
955                  <tbody>
956 <tr><td>File Name              </td><td> jaw.xml               </td><td> road.gml      </td><td> po.xml        </td><td> soap.xml </td></tr> 
957<tr><td>File Type               </td><td> document              </td><td> data          </td><td> data          </td><td> data   </td></tr>     
958<tr><td>File Size (kB)          </td><td> 7343                  </td><td> 11584         </td><td> 76450         </td><td> 2717 </td></tr> 
959<tr><td>Markup Item Count       </td><td> 74882                 </td><td> 280724        </td><td> 4634110       </td><td> 18004 </td></tr> 
960  <tr><td>Markup Density                </td><td> 0.13                  </td><td> 0.57          </td><td> 0.76          </td><td> 0.87  </td></tr> 
961                  </tbody>
962               </table>           
964         <para> A key predictor of the overall parsing performance of an XML file is markup
965           density<footnote><para>Markup Density: the ratio of markup bytes used to define the structure
966             of the document vs. its file size.</para></footnote>. This metric has substantial influence on the
967            performance of traditional recursive descent XML parsers because it directly corresponds
968            to the number of state transitions that occur when parsing a document. We use a mixture
969            of document-oriented and data-oriented XML files to analyze performance over a spectrum
970            of markup densities. </para>
971         <para> <xref linkend="perf_SAX"/> compares the performance of Xerces, icXML and pipelined icXML
972            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
973            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
974            speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less
975            affected through a reduction in branches and the use of parallel-processing techniques.
976            icXML-p performs better as markup-density increases because the work performed by each
977            stage is well balanced in this application. </para>
978         <para>
979        <figure xml:id="perf_SAX">
980          <title>SAXCount Performance Comparison</title>
981          <mediaobject>
982            <imageobject>
983              <imagedata format="png" fileref="perf_SAX.png" width="500cm"/>
984            </imageobject>
985          </mediaobject>
986          <caption>
987          </caption>
988        </figure>
989         </para>
990      </section>
991      <section>
992         <title>GML2SVG</title>
993<para>   As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
994was chosen.   This application transforms geospatially encoded data represented using
995an XML representation in the form of Geography Markup Language (GML) <citation linkend="lake2004geography"/> 
996into a different XML format  suitable for displayable maps:
997Scalable Vector Graphics (SVG) format <citation linkend="lu2007advances"/>. In the GML2SVG benchmark, GML feature elements
998and GML geometry elements tags are matched. GML coordinate data are then extracted
999and transformed to the corresponding SVG path data encodings.
1000Equivalent SVG path elements are generated and output to the destination
1001SVG document.  The GML2SVG application is thus considered typical of a broad
1002class of XML applications that parse and extract information from
1003a known XML format for the purpose of analysis and restructuring to meet
1004the requirements of an alternative format.</para>
1006<para>Our GML to SVG data translations are executed on GML source data
1007modelling the city of Vancouver, British Columbia, Canada.
1008The GML source document set
1009consists of 46 distinct GML feature layers ranging in size from approximately 9 KB to 125.2 MB
1010and with an average document size of 18.6 MB. Markup density ranges from approximately 0.045 to 0.719
1011and with an average markup density of 0.519. In this performance study,
1012213.4 MB of source GML data generates 91.9 MB of target SVG data.</para>
1015        <figure xml:id="perf_GML2SVG">
1016          <title>Performance Comparison for GML2SVG</title>
1017          <mediaobject>
1018            <imageobject>
1019              <imagedata format="png" fileref="Throughput.png" width="500cm"/>
1020            </imageobject>
1021          </mediaobject>
1022          <caption>
1023          </caption>
1024        </figure>
1026<para><xref linkend="perf_GML2SVG"/> compares the performance of the GML2SVG application linked against
1027the Xerces, icXML and icXML-p.   
1028On the GML workload with this application, single-thread icXML
1029achieved about a 50% acceleration over Xerces,
1030increasing throughput on our test machine from 58.3 MB/sec to 87.9 MB/sec.   
1031Using icXML-p, a further throughput increase to 111 MB/sec was recorded,
1032approximately a 2X speedup.</para>
1034<para>An important aspect of icXML is the replacement of much branch-laden
1035sequential code inside Xerces with straight-line SIMD code using far
1036fewer branches.  <xref linkend="branchmiss_GML2SVG"/> shows the corresponding
1037improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
1038It is also interesting to note that icXML-p goes even further.   
1039In essence, in using pipeline parallelism to split the instruction
1040stream onto separate cores, the branch target buffers on each core are
1041less overloaded and able to increase the successful branch prediction rate.</para>
1043        <figure xml:id="branchmiss_GML2SVG">
1044          <title>Comparative Branch Misprediction Rate</title>
1045          <mediaobject>
1046            <imageobject>
1047              <imagedata format="png" fileref="BM.png" width="500cm"/>
1048            </imageobject>
1049          </mediaobject>
1050          <caption>
1051          </caption>
1052        </figure>
1054<para>The behaviour of the three versions with respect to L1 cache misses per kB is shown
1055in <xref linkend="cachemiss_GML2SVG"/>.   Improvements are shown in both instruction-
1056and data-cache performance with the improvements in instruction-cache
1057behaviour the most dramatic.   Single-threaded icXML shows substantially improved
1058performance over Xerces on both measures.   
1059Although icXML-p is slightly worse with respect to data-cache performance,
1060this is more than offset by a further dramatic reduction in instruction-cache miss rate.
1061Again partitioning the instruction stream through the pipeline parallelism model has
1062significant benefit.</para>
1064        <figure xml:id="cachemiss_GML2SVG">
1065          <title>Comparative Cache Miss Rate</title>
1066          <mediaobject>
1067            <imageobject>
1068              <imagedata format="png" fileref="CM.png" width="500cm"/>
1069            </imageobject>
1070          </mediaobject>
1071          <caption>
1072          </caption>
1073        </figure>
1075<para>One caveat with this study is that the GML2SVG application did not exhibit
1076a relative balance of processing between application code and Xerces library
1077code reaching the 33% figure.  This suggests that for this application and
1078possibly others, further separating the logical layers of the
1079icXML engine into different pipeline stages could well offer significant benefit.
1080This remains an area of ongoing work.</para>
1081      </section>
1082   </section>
1084   <section xml:id="conclusion">
1085      <title>Conclusion and Future Work</title>
1086      <para> This paper is the first case study documenting the significant performance benefits
1087         that may be realized through the integration of parallel bitstream technology into existing
1088         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
1089         integration of SIMD and multicore parallelism was shown capable of dramatic producing
1090         dramatic increases in throughput and reductions in branch mispredictions and cache misses.
1091         The modified parser, going under the name icXML is designed to provide the full
1092         functionality of the original Xerces library with complete compatibility of APIs. Although
1093         substantial re-engineering was required to realize the performance potential of parallel
1094         technologies, this is an important case study demonstrating the general feasibility of
1095         these techniques. </para>
1096      <para> The further development of icXML to move beyond 2-stage pipeline parallelism is
1097         ongoing, with realistic prospects for four reasonably balanced stages within the library.
1098         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
1099         multistage pipelined parsing library should offer substantial benefits. </para>
1100      <para> The example of XML parsing may be considered prototypical of finite-state machines
1101         applications which have sometimes been considered &quot;embarassingly
1102         sequential&quot; and so difficult to parallelize that &quot;nothing
1103         works.&quot; So the case study presented here should be considered an important data
1104         point in making the case that parallelization can indeed be helpful across a broad array of
1105         application types. </para>
1106      <para> To overcome the software engineering challenges in applying parallel bitstream
1107         technology to existing software systems, it is clear that better library and tool support
1108         is needed. The techniques used in the implementation of icXML and documented in this paper
1109         could well be generalized for applications in other contexts and automated through the
1110         creation of compiler technology specifically supporting parallel bitstream programming.
1111      </para> Given the success of the icXML development, there is a strong case for continued
1112            development of the Parabix framework as well as for the application of Parabix
1113            to other important XML technology stacks.   In particular, an important area for further
1114            work is to extend the benefits of SIMD and multicore parallelism to the acceleration
1115            of Java-based XML processors.
1116      <para> 
1117      </para>
1118   </section>
1121  <title>Bibliography</title>
1122  <bibliomixed xml:id="CameronHerdyLin2008" xreflabel="Parabix1 2008">Cameron, Robert D., Herdy, Kenneth S. and Lin, Dan. High performance XML parsing using parallel bit stream technology. CASCON'08: Proc. 2008 conference of the center for advanced studies on collaborative research. Richmond Hill, Ontario, Canada. 2008.</bibliomixed>
1123  <bibliomixed xml:id="papi" xreflabel="PAPI">Innovative Computing Laboratory, University of Texas. Performance Application Programming Interface.<link></link></bibliomixed>
1124  <bibliomixed xml:id="perf" xreflabel="perf">Eranian, Stephane, Gouriou, Eric, Moseley, Tipp and Bruijn, Willem de. Linux kernel profiling with perf. <link></link></bibliomixed>
1125  <bibliomixed xml:id="Cameron2008" xreflabel="u8u16 2008">Cameron, Robert D.. A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding. Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Salt Lake City, USA. 2008.</bibliomixed>
1126  <bibliomixed xml:id="ParaDOM2009" xreflabel="Shah and Rao 2009">Shah, Bhavik, Rao, Praveen, Moon, Bongki and Rajagopalan, Mohan. A Data Parallel Algorithm for XML DOM Parsing. Database and XML Technologies. 2009.</bibliomixed>
1127  <bibliomixed xml:id="XMLSSE42" xreflabel="Lei 2008">Lei, Zhai. XML Parsing Accelerator with Intel Streaming SIMD Extensions 4 (Intel SSE4). <link>Intel Software Network</link>.  2008.</bibliomixed>
1128  <bibliomixed xml:id="Cameron2009" xreflabel="Balisage 2009">Cameron, Rob, Herdy, Ken and Amiri, Ehsan Amiri. Parallel Bit Stream Technology as a Foundation for XML Parsing Performance. Int'l Symposium on Processing XML Efficiently: Overcoming Limits on Space, Time, or Bandwidth. Montreal, Quebec, Canada.  2009.</bibliomixed>
1129  <bibliomixed xml:id="HilewitzLee2006" xreflabel="Hilewitz and Lee 2006">Hilewitz, Yedidya and Lee, Ruby B.. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions. ASAP '06: Proc. IEEE 17th Int'l Conference on Application-specific Systems, Architectures and Processors. Steamboat Springs, Colorado, USA.  2006.</bibliomixed>
1130  <bibliomixed xml:id="Asanovic-EECS-2006-183" xreflabel="Asanovic et al. 2006">Asanovic, Krste and others. The Landscape of Parallel Computing Research: A View from Berkeley. EECS Department, University of California, Berkeley.  2006.</bibliomixed>
1131  <bibliomixed xml:id="GRID2006" xreflabel="Lu and Chiu 2006">Lu, Wei, Chiu, Kenneth and Pan, Yinfei. A Parallel Approach to XML Parsing. Proceedings of the 7th IEEE/ACM International Conference on Grid Computing. Barcelona, Spain.  2006.</bibliomixed>
1132  <bibliomixed xml:id="cameron-EuroPar2011" xreflabel="Parabix2 2011">Cameron, Robert D., Amiri, Ehsan, Herdy, Kenneth S., Lin, Dan, Shermer, Thomas C. and Popowich, Fred P.. Parallel Scanning with Bitstream Addition: An XML Case Study. Euro-Par 2011, LNCS 6853, Part II.  Bordeaux, Frane. 2011.</bibliomixed>
1133  <bibliomixed xml:id="HPCA2012" xreflabel="Lin and Medforth 2012">Lin, Dan, Medforth, Nigel, Herdy, Kenneth S., Shriraman, Arrvindh and Cameron, Rob. Parabix: Boosting the efficiency of text processing on commodity processors. International Symposium on High-Performance Computer Architecture. New Orleans, LA. 2012.</bibliomixed>
1134  <bibliomixed xml:id="HPCC2011" xreflabel="You and Wang 2011">You, Cheng-Han and Wang, Sheng-De. A Data Parallel Approach to XML Parsing and Query. 10th IEEE International Conference on High Performance Computing and Communications. Banff, Alberta, Canada. 2011.</bibliomixed>
1135  <bibliomixed xml:id="E-SCIENCE2007" xreflabel="Pan and Zhang 2007">Pan, Yinfei, Zhang, Ying, Chiu, Kenneth and Lu, Wei. Parallel XML Parsing Using Meta-DFAs. International Conference on e-Science and Grid Computing.   Bangalore, India.  2007.</bibliomixed>
1136  <bibliomixed xml:id="ICWS2008" xreflabel="Pan and Zhang 2008a">Pan, Yinfei, Zhang, Ying and Chiu, Kenneth. Hybrid Parallelism for XML SAX Parsing. IEEE International Conference on Web Services. Beijing, China.  2008.</bibliomixed>
1137  <bibliomixed xml:id="IPDPS2008" xreflabel="Pan and Zhang 2008b">Pan, Yinfei, Zhang, Ying and Chiu, Kenneth. Simultaneous transducers for data-parallel XML parsing. International Parallel and Distributed Processing Symposium. Miami, Florida, USA.  2008.</bibliomixed>
1138  <bibliomixed xml:id="HackersDelight" xreflabel="Warren 2002">Warren, Henry S.. Hacker's Delight. Addison-Wesley Professional. 2003.</bibliomixed>
1139  <bibliomixed xml:id="lu2007advances" xreflabel="Lu and Dos Santos 2007">Lu, C.T., Dos Santos, R.F., Sripada, L.N. and Kou, Y.. Advances in GML for geospatial applications. Geoinformatica 11:131-157.  2007.</bibliomixed>
1140  <bibliomixed xml:id="lake2004geography" xreflabel="Lake and Burggraf 2004">Lake, R., Burggraf, D.S., Trninic, M. and Rae, L.. Geography mark-up language (GML) [foundation for the geo-web]. Wiley.  Chichester.  2004.</bibliomixed>
Note: See TracBrowser for help on using the repository browser.