source: docs/Balisage13/Bal2013came0601/Bal2013came0601.xml @ 3394

Last change on this file since 3394 was 3394, checked in by cameron, 6 years ago

Previous updates

File size: 79.4 KB
1<?xml version="1.0" encoding="UTF-8"?>
2<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
3<article xmlns="" version="5.0-subset Balisage-1.3"
4   xml:id="HR-23632987-8973">
5   <title/>
6   <info>
7      <abstract>
8         <para>Prior research on the acceleration of XML processing using SIMD and multi-core
9            parallelism has lead to a number of interesting research prototypes. This work
10            investigates the extent to which the techniques underlying these prototypes could result
11            in systematic performance benefits when fully integrated into a commercial XML parser.
12            The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the
13            foundation for the study. A systematic restructuring of the parser was undertaken, while
14            maintaining the existing API for application programmers. Using SIMD techniques alone,
15            an increase in parsing speed of at least 50% was observed in a range of applications.
16            When coupled with pipeline parallelism on dual core processors, improvements of 2x and
17            beyond were realized. </para>
18      </abstract>
19      <author>
20         <personname>
21            <firstname>Nigel</firstname>
22            <surname>Medforth</surname>
23         </personname>
24         <personblurb>
25            <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
26               developer of icXML. He earned a Bachelor of Technology in Information Technology at
27               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
28               Outstanding Achievement.</para>
29            <para>Nigel is currently researching ways to leverage both the Parabix framework and
30               stream-processing models to further accelerate XML parsing within icXML.</para>
31         </personblurb>
32         <affiliation>
33            <jobtitle>Developer</jobtitle>
34            <orgname>International Characters Inc.</orgname>
35         </affiliation>
36         <affiliation>
37            <jobtitle>Graduate Student</jobtitle>
38            <orgname>School of Computing Science, Simon Fraser University </orgname>
39         </affiliation>
40         <email></email>
41      </author>
42      <author>
43         <personname>
44            <firstname>Dan</firstname>
45            <surname>Lin</surname>
46         </personname>
47         <personblurb>
48           <para>Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
49             in Computing Science at Simon Fraser University in 2010. Her research focus on on high
50             performance algorithms that exploit parallelization strategies on various multicore platforms.
51           </para>
52         </personblurb>
53         <affiliation>
54            <jobtitle>Graduate Student</jobtitle>
55            <orgname>School of Computing Science, Simon Fraser University </orgname>
56         </affiliation>
57         <email></email>
58      </author>
59      <author>
60         <personname>
61            <firstname>Kenneth</firstname>
62            <surname>Herdy</surname>
63         </personname>
64         <personblurb>
65            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
66               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
67               of Science in Computing Science with a Certificate in Spatial Information Systems at
68               Simon Fraser University in 2005. </para>
69            <para> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
70               University with industrial scholarship support from the Natural Sciences and
71               Engineering Research Council of Canada, the Mathematics of Information Technology and
72               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
73               of the principal techniques that may be used to improve XML processing performance in
74               the context of the Geography Markup Language (GML). </para>
75         </personblurb>
76         <affiliation>
77            <jobtitle>Graduate Student</jobtitle>
78            <orgname>School of Computing Science, Simon Fraser University </orgname>
79         </affiliation>
80         <email></email>
81      </author>
82      <author>
83         <personname>
84            <firstname>Rob</firstname>
85            <surname>Cameron</surname>
86         </personname>
87         <personblurb>
88            <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
89               Sciences at Simon Fraser University. His research interests include programming
90               language and software system technology, with a specific focus on high performance
91               text processing using SIMD and multicore parallelism. He is the developer of the REX
92               XML shallow parser as well as the parallel bit stream (Parabix) framework for SIMD
93               text processing. </para>
94         </personblurb>
95         <affiliation>
96            <jobtitle>Professor of Computing Science</jobtitle>
97            <orgname>Simon Fraser University</orgname>
98         </affiliation>
99         <affiliation>
100            <jobtitle>Chief Technology Officer</jobtitle>
101            <orgname>International Characters, Inc.</orgname>
102         </affiliation>
103         <email></email>
104      </author>
105      <author>
106         <personname>
107            <firstname>Arrvindh</firstname>
108            <surname>Shriraman</surname>
109         </personname>
110         <personblurb>
111            <para/>
112         </personblurb>
113         <affiliation>
114            <jobtitle>Assistant Professor</jobtitle>
115            <orgname>School of Computing Science, Simon Fraser University</orgname>
116         </affiliation>
117         <email></email>
118      </author>
119      <!--
120      <legalnotice>
121         <para>Copyright &#x000A9; 2013 Nigel Medforth, Dan Lin, Kenneth S. Herdy, Robert D. Cameron  and Arrvindh Shriraman.
122            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
123            Works 2.5 Canada License.</para>
124      </legalnotice>
126      <keywordset role="author">
127         <keyword/>
128      </keywordset>
130   </info>
131 <section>
132      <title>Introduction</title>
133      <para>   
134        Parallelization and acceleration of XML parsing is a widely
135        studied problem that has seen the development of a number
136        of interesting research prototypes using both SIMD and
137        multicore parallelism.   Most works have investigated
138        data parallel solutions on multicore
139        architectures using various strategies to break input
140        documents into segments that can be allocated to different cores.
141        For example, one possibility for data
142        parallelization is to add a pre-parsing step to compute
143        a skeleton tree structure of an  XML document <citation linkend="GRID2006"/>.
144        The parallelization of the pre-parsing stage itself can be tackled with
145          state machines <citation linkend="E-SCIENCE2007"/>, <citation linkend="IPDPS2008"/>.
146        Methods without pre-parsing have used speculation <citation linkend="HPCC2011"/> or post-processing that
147        combines the partial results <citation linkend="ParaDOM2009"/>.
148        A hybrid technique that combines data and pipeline parallelism was proposed to
149        hide the latency of a "job" that has to be done sequentially <citation linkend="ICWS2008"/>.
150      </para>
151      <para>
152        Fewer efforts have investigated SIMD parallelism, although this approach
153        has the potential advantage of improving single core performance as well
154        as offering savings in energy consumption <citation linkend="HPCA2012"/>.
155        Intel introduced specialized SIMD string processing instructions in the SSE 4.2 instruction set extension
156        and showed how they can be used to improve the performance of XML parsing <citation linkend="XMLSSE42"/>.
157        The Parabix framework uses generic SIMD extensions and bit parallel methods to
158        process hundreds of XML input characters simultaneously <citation linkend="Cameron2009"/> <citation linkend="cameron-EuroPar2011"/>.
159        Parabix prototypes have also combined SIMD methods with thread-level parallelism to
160        achieve further acceleration on multicore systems <citation linkend="HPCA2012"/>.
161      </para>
162      <para>
163        In this paper, we move beyond research prototypes to consider
164        the detailed integration of both SIMD and multicore parallelism into the
165        Xerces-C++ parser of the Apache Software Foundation, an existing
166        standards-compliant open-source parser that is widely used
167        in commercial practice.    The challenge of this work is
168        to parallelize the Xerces parser in such a way as to
169        preserve the existing APIs as well as offering worthwhile
170        end-to-end acceleration of XML processing.   
171        To achieve the best results possible, we undertook
172        a nine-month comprehensive restructuring of the Xerces-C++ parser,
173        seeking to expose as many critical aspects of XML parsing
174        as possible for parallelization, the result of which we named icXML.   
175        Overall, we employed Parabix-style methods of transcoding, tokenization
176        and tag parsing, parallel string comparison methods in symbol
177        resolution, bit parallel methods in namespace processing,
178        as well as staged processing using pipeline parallelism to take advantage of
179        multiple cores.
180      </para>
181      <para>
182        The remainder of this paper is organized as follows.   
183          <xref linkend="background"/> discusses the structure of the Xerces and Parabix XML parsers and the fundamental
184        differences between the two parsing models.   
185        <xref linkend="architecture"/> then presents the icXML design based on a restructured Xerces architecture to
186        incorporate SIMD parallelism using Parabix methods.   
187        <xref linkend="multithread"/> moves on to consider the multithreading of the icXML architecture
188        using the pipeline parallelism model. 
189        <xref linkend="performance"/> analyzes the performance of both the single-threaded and
190        multi-threaded versions of icXML in comparison to original Xerces,
191        demonstrating substantial end-to-end acceleration of
192        a GML-to-SVG translation application written against the Xerces API.
193          <xref linkend="conclusion"/> concludes the paper with a discussion of future work and the potential for
194        applying the techniques discussed herein in other application domains.
195      </para>
196   </section>
198   <section xml:id="background">
199      <title>Background</title>
200      <section xml:id="background-xerces">
201         <title>Xerces C++ Structure</title>
202         <para> The Xerces C++ parser is a widely-used standards-conformant
203            XML parser produced as open-source software
204             by the Apache Software Foundation.
205            It features comprehensive support for a variety of character encodings both
206            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
207            XML vocabularies through the XML namespace mechanism, as well as complete
208            implementations of structure and data validation through multiple grammars declared
209            using either legacy DTDs (document type definitions) or modern XML Schema facilities.
210            Xerces also supports several APIs for accessing parser services, including event-based
211            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
212            tree-based parsing interface. </para>
213         <para>
214            Xerces,
215            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
216            the first to the last byte of input data. Each byte passes through several processing
217            layers and is classified and eventually validated within the context of the document
218            state. This introduces implicit dependencies between the various tasks within the
219            application that make it difficult to optimize for performance. As a complex software
220              system, no one feature dominates the overall parsing performance. <xref linkend="xerces-profile"/>
221            shows the execution time profile of the top ten functions in a
222            typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of
223            these functions for parallelization in isolation would only produce a minute improvement
224            in performance. Unfortunately, early investigation into these functions found that
225            incorporating speculation-free thread-level parallelization was impossible and they were
226            already performing well in their given tasks; thus only trivial enhancements were
227            attainable. In order to obtain a systematic acceleration of Xerces, it should be
228            expected that a comprehensive restructuring is required, involving all aspects of the
229            parser. </para>
230             <table xml:id="xerces-profile">
231                  <caption>
232                     <para>Execution Time of Top 10 Xerces Functions</para>
233                  </caption>
234                  <colgroup>
235                     <col align="left" valign="top"/>
236                     <col align="left" valign="top"/>
237                  </colgroup>
238                  <thead><tr><th>Time (%) </th><th> Function Name </th></tr></thead>
239                  <tbody>
240<tr valign="top"><td>13.29      </td>   <td>XMLUTF8Transcoder::transcodeFrom </td></tr>
241<tr valign="top"><td>7.45       </td>   <td>IGXMLScanner::scanCharData </td></tr>
242<tr valign="top"><td>6.83       </td>   <td>memcpy </td></tr>
243<tr valign="top"><td>5.83       </td>   <td>XMLReader::getNCName </td></tr>
244<tr valign="top"><td>4.67       </td>   <td>IGXMLScanner::buildAttList </td></tr>
245<tr valign="top"><td>4.54       </td>   <td>RefHashTableO&lt;&gt;::findBucketElem </td></tr>
246<tr valign="top"><td>4.20       </td>   <td>IGXMLScanner::scanStartTagNS </td></tr>
247<tr valign="top"><td>3.75       </td>   <td>ElemStack::mapPrefixToURI </td></tr>
248<tr valign="top"><td>3.58       </td>   <td>ReaderMgr::getNextChar </td></tr>
249<tr valign="top"><td>3.20       </td>   <td>IGXMLScanner::basicAttrValueScan </td></tr>
250                  </tbody>
251               </table>
252      </section>
253      <section>
254         <title>The Parabix Framework</title>
255         <para> The Parabix (parallel bit stream) framework is a transformative approach to XML
256            parsing (and other forms of text processing.) The key idea is to exploit the
257            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
258            data from long blocks of input data by using one register bit per single input byte. To
259            facilitate this, the input data is first transposed into a set of basis bit streams.
260              For example, <xref linkend="xml-bytes"/> shows  the ASCII bytes for the string "<code>b7&lt;A</code>" with
261                the corresponding  8 basis bit streams, b<subscript>0</subscript> through  b<subscript>7</subscript> shown in  <xref linkend="xml-bits"/>.
262            The bits used to construct b<subscript>7</subscript> have been highlighted in this example.
263              Boolean-logic operations (&#8743;, \&#8744; and &#172; denote the
264              boolean AND, OR and NOT operators) are used to classify the input bits into a set of
265               <emphasis role="ital">character-class bit streams</emphasis>, which identify key
266            characters (or groups of characters) with a <code>1</code>. For example, one of the
267            fundamental characters in XML is a left-angle bracket. A character is an
268               <code>&apos;&lt;&apos; if and only if
269               &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>)
270               &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>)
271               &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>)
272               &#8743; &#172; (b<subscript>6</subscript> &#8744;
273               b<subscript>7</subscript>) = 1</code>. Similarly, a character is numeric, <code>[0-9]
274               if and only if &#172;(b<subscript>0</subscript> &#8744;
275               b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
276                  b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript>
277               &#8743; (b<subscript>5</subscript> &#8744;
278            b<subscript>6</subscript>))</code>. An important observation here is that ranges of
279            characters may require fewer operations than individual characters and
280            <!-- the classification cost could be amortized over many character classes.--> multiple
281            classes can share the classification cost. </para>
282         <table xml:id="xml-bytes">
283                  <caption>
284                     <para>XML Source Data</para>
285                  </caption>
286                  <colgroup>
287                     <col align="right" valign="top"/>
288                     <col align="centre" valign="top"/>
289                     <col align="centre" valign="top"/>
290                     <col align="centre" valign="top"/>
291                     <col align="centre" valign="top"/>
292                  </colgroup>
293                  <tbody>
294  <tr><td>String </td><td> <code>b</code> </td><td> <code>7</code> </td><td> <code>&lt;</code> </td><td> <code>A</code> </td></tr>
295  <tr><td>ASCII </td><td> <code>0110001<emphasis role="bold">0</emphasis></code> </td><td> <code>0011011<emphasis role="bold">1</emphasis></code> </td><td> <code>0011110<emphasis role="bold">0</emphasis></code> </td><td> <code>0100000<emphasis role="bold">1</emphasis></code> </td></tr>
296  </tbody>
300         <table xml:id="xml-bits">
301                  <caption>
302                     <para>8-bit ASCII Basis Bit Streams</para>
303                  </caption>
304                  <colgroup>
305                     <col align="centre" valign="top"/>
306                     <col align="centre" valign="top"/>
307                     <col align="centre" valign="top"/>
308                     <col align="centre" valign="top"/>
309                     <col align="centre" valign="top"/>
310                     <col align="centre" valign="top"/>
311                     <col align="centre" valign="top"/>
312                     <col align="centre" valign="top"/>
313                  </colgroup>
314                  <tbody>
315<tr><td> b<subscript>0</subscript> </td><td> b<subscript>1</subscript> </td><td> b<subscript>2</subscript> </td><td> b<subscript>3</subscript></td><td> b<subscript>4</subscript> </td><td> b<subscript>5</subscript> </td><td> b<subscript>6</subscript> </td><td> b<subscript>7</subscript> </td></tr>
316 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
317 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
318 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
319 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
320  </tbody>
325         <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
326         <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
327         <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
328         <!-- process, intra-element well-formedness validation is performed on each block -->
329         <!-- of text. -->
330         <para> Consider, for example, the XML source data stream shown in the first line of <xref linkend="derived"/>.
331The remaining lines of this figure show
332            several parallel bit streams that are computed in Parabix-style parsing, with each bit
333            of each stream in one-to-one correspondence to the source character code units of the
334            input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are
335            represented as underscores. The first bit stream shown is that for the opening angle
336            brackets that represent tag openers in XML. The second and third streams show a
337            partition of the tag openers into start tag marks and end tag marks depending on the
338            character immediately following the opener (i.e., &quot;<code>/</code>&quot;) or
339            not. The remaining three lines show streams that can be computed in subsequent parsing
340            (using the technique of bitstream addition <citation linkend="cameron-EuroPar2011"/>), namely streams
341            marking the element names, attribute names and attribute values of tags. </para>
342            <table xml:id="derived">
343                  <caption>
344                     <para>XML Source Data and Derived Parallel Bit Streams</para>
345                  </caption>
346                  <colgroup>
347                     <col align="centre" valign="top"/>
348                     <col align="left" valign="top"/>
349                  </colgroup>
350                  <tbody>
351          <tr><td> Source Data </td><td> <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
352          <tr><td> Tag Openers </td><td> <code>1____________1____________________________1____________1__________</code></td></tr>
353           <tr><td> Start Tag Marks </td><td> <code>_1____________1___________________________________________________</code></td></tr>
354           <tr><td> End Tag Marks </td><td> <code>___________________________________________1____________1_________</code></td></tr>
355           <tr><td> Empty Tag Marks </td><td> <code>__________________________________________________________________</code></td></tr>
356           <tr><td> Element Names </td><td> <code>_11111111_____1111111_____________________________________________</code></td></tr>
357           <tr><td> Attribute Names </td><td> <code>______________________11_______11_________________________________</code></td></tr>
358           <tr><td> Attribute Values </td><td> <code>__________________________111________111__________________________</code></td></tr>
359                  </tbody>
360               </table>         
362         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
363            parsing performance. The first is that the use of the full register width offers a
364            considerable information advantage over sequential byte-at-a-time parsing. That is,
365            sequential processing of bytes uses just 8 bits of each register, greatly limiting the
366            processor resources that are effectively being used at any one time. The second is that
367            byte-at-a-time loop scanning loops are actually often just computing a single bit of
368            information per iteration: is the scan complete yet? Rather than computing these
369            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
370            bytes at a time using 128-bit registers) should provide substantial benefit. </para>
371         <para> Previous studies have shown that the Parabix approach improves many aspects of XML
372            processing, including transcoding <citation linkend="Cameron2008"/>, character classification and
373            validation, tag parsing and well-formedness checking. The first Parabix parser used
374            processor bit scan instructions to considerably accelerate sequential scanning loops for
375            individual characters <citation linkend="CameronHerdyLin2008"/>. Recent work has incorporated a method
376            of parallel scanning using bitstream addition <citation linkend="cameron-EuroPar2011"/>, as well as
377            combining SIMD methods with 4-stage pipeline parallelism to further improve throughput
378            <citation linkend="HPCA2012"/>. Although these research prototypes handled the full syntax of
379            schema-less XML documents, they lacked the functionality required by full XML parsers. </para>
380         <para> Commercial XML processors support transcoding of multiple character sets and can
381            parse and validate against multiple document vocabularies. Additionally, they provide
382            API facilities beyond those found in research prototypes, including the widely used SAX,
383            SAX2 and DOM interfaces. </para>
384      </section>
385      <section>
386         <title>Sequential vs. Parallel Paradigm</title>
387         <para> Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents
388            sequentially. Each character is examined to distinguish between the XML-specific markup,
389            such as a left angle bracket <code>&lt;</code>, and the content held within the
390            document. As the parser progresses through the document, it alternates between markup
391            scanning, validation and content processing modes. </para>
392         <para> In other words, Xerces belongs to an equivalence class of applications termed FSM
393           applications<xref linkend="FSM"/>.<footnote xml:id="FSM"><para>Herein FSM applications are considered software systems whose
394            behaviour is defined by the inputs, current state and the events associated with
395              transitions of states.</para></footnote> Each state transition indicates the processing context of
396            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
397            character could induce a state transition. </para>
398         <para> Parabix-style XML parsers utilize a concept of layered processing. A block of source
399            text is transformed into a set of lexical bitstreams, which undergo a series of
400            operations that can be grouped into logical layers, e.g., transposition, character
401            classification, and lexical analysis. Each layer is pipeline parallel and require
402            neither speculation nor pre-parsing stages<citation linkend="HPCA2012"/>. To meet the API requirements
403            of the document-ordered Xerces output, the results of the Parabix processing layers must
404            be interleaved to produce the equivalent behaviour. </para>
405      </section>
406   </section>
407   <section xml:id="architecture">
408      <title>Architecture</title>
409      <section>
410         <title>Overview</title>
411         <!--\def \CSG{Content Stream Generator}-->
412         <para> icXML is more than an optimized version of Xerces. Many components were grouped,
413            restructured and rearchitected with pipeline parallelism in mind. In this section, we
414            highlight the core differences between the two systems. As shown in Figure
415              <xref linkend="xerces-arch"/>, Xerces is comprised of five main modules: the transcoder, reader,
416            scanner, namespace binder, and validator. The <emphasis role="ital"
417            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
418            the majority of the character set encoding validation is performed as a byproduct of
419            this process. The <emphasis role="ital">Reader</emphasis> is responsible for the
420            streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current
421            line/column position,
422            <!--(which is reported in the unlikely event that the input contains an error), -->
423            performs line-break normalization and validates context-specific character set issues,
424            such as tokenization of qualified-names. The <emphasis role="ital">Scanner</emphasis>
425            pulls data through the reader and constructs the intermediate representation (IR) of the
426            document; it deals with all issues related to entity expansion, validates the XML
427            well-formedness constraints and any character set encoding issues that cannot be
428            completely handled by the reader or transcoder (e.g., surrogate characters, validation
429            and normalization of character references, etc.) The <emphasis role="ital">Namespace
430               Binder</emphasis> is a core piece of the element stack. It handles namespace scoping
431            issues between different XML vocabularies. This allows the scanner to properly select
432            the correct schema grammar structures. The <emphasis role="ital">Validator</emphasis>
433            takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder)
434            and assesses whether the final output matches the user-defined DTD and schema grammar(s)
435            before passing it to the end-user. </para>     
436        <figure xml:id="xerces-arch">
437          <title>Xerces Architecture</title>
438          <mediaobject>
439            <imageobject>
440              <imagedata format="png" fileref="xerces.png" width="150cm"/>
441            </imageobject>
442          </mediaobject>
443          <caption>
444          </caption>
445        </figure>
446         <para> In icXML functions are grouped into logical components. As shown in
447             <xref linkend="xerces-arch"/>, two major categories exist: (1) the Parabix Subsystem and (2) the
448               Markup Processor. All tasks in (1) use the Parabix Framework <citation linkend="HPCA2012"/>, which
449            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
450              Adapter</emphasis>, discussed in <xref linkend="character-set-adapter"/>, mirrors
451            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
452              lexical bitstreams, similar to those shown in <xref linkend="parabix1"/>. These lexical
453            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
454            additional processing is performed. The first precursor to producing UTF-16 is the
455               <emphasis role="ital">Parallel Markup Parser</emphasis> phase. It takes the lexical
456            streams and produces a set of marker bitstreams in which a 1-bit identifies significant
457            positions within the input data. One bitstream for each of the critical piece of
458            information is created, such as the beginning and ending of start tags, end tags,
459            element names, attribute names, attribute values and content. Intra-element
460            well-formedness validation is performed as an artifact of this process. Like Xerces,
461            icXML must provide the Line and Column position of each error. The <emphasis role="ital"
462               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
463            document position(s) through the use of an optimized population count algorithm,
464              described in <xref linkend="errorhandling"/>. From here, two data-independent
465            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
466         <para> A typical XML file contains few unique element and attribute names&#8212;but
467            each of them will occur frequently. icXML stores these as distinct data structures,
468            called symbols, each with their own global identifier (GID). Using the symbol marker
469            streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol
470               Resolver</emphasis> scans through the raw data to produce a sequence of GIDs, called
471            the <emphasis role="ital">symbol stream</emphasis>. </para>
472         <para> The final components of the Parabix Subsystem are the <emphasis role="ital">Content
473               Preparation Unit</emphasis> and <emphasis role="ital">Content Stream
474            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
475            filters them, according to the information provided by the Parallel Markup Parser, and
476            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital">content stream</emphasis>, discussed in <xref linkend="contentstream"/>. </para>
477         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
478            document. The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to
479            validate and produce the sequential output for the end user. The <emphasis role="ital"
480               >Final WF checker</emphasis> performs inter-element well-formedness validation that
481            would be too costly to perform in bit space, such as ensuring every start tag has a
482            matching end tag. Xerces's namespace binding functionality is replaced by the <emphasis
483               role="ital">Namespace Processor</emphasis>. Unlike Xerces, it is a discrete phase
484            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
485               stream</emphasis>, which are associated with each symbol occurrence. This is
486                 discussed in <xref linkend="namespace-handling"/>. Finally, the <emphasis
487               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
488            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
489        <figure xml:id="icxml-arch">
490          <title>icXML Architecture</title>
491          <mediaobject>
492            <imageobject>
493              <imagedata format="png" fileref="icxml.png" width="500cm"/>
494            </imageobject>
495          </mediaobject>
496          <caption>
497          </caption>
498        </figure>
499      </section>
500      <section xml:id="character-set-adapter">
501         <title>Character Set Adapters</title>
502         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
503            Xerces itself and provide the end-consumer with a single encoding format. In the
504            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
505            because of the need to decode and classify each byte of input, mapping variable-length
506            UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In
507            other cases, transcoding may involve table look-up operations for each byte of input. In
508            any case, transcoding imposes at least a cost of buffer copying. </para>
509         <para> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
510            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
511            that input code units represent valid characters, mapping the characters of the encoding
512            into the appropriate bitstreams for XML parsing actions (i.e., producing the lexical
513            item streams), as well as supporting ultimate transcoding requirements. All of this work
514            is performed using the parallel bitstream representation of the source input. </para>
515         <para> An important observation is that many character sets are an extension to the legacy
516            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
517            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
518            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
519            serves to compute lexical item streams for all such ASCII-based character sets. </para>
520         <para> A second observation is that&#8212;regardless of which character set is
521            used&#8212;quite often all of the characters in a particular block of input will be
522            within the ASCII range. This is a very simple test to perform using the bitstream
523            representation, simply confirming that the bit 0 stream is zero for the entire block.
524            For blocks satisfying this test, all logic dealing with non-ASCII characters can simply
525            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
526            UTF-16 form are each set to zero in this case. </para>
527         <para> A third observation is that repeated transcoding of the names of XML elements,
528            attributes and so on can be avoided by using a look-up mechanism. That is, the first
529            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
530            numeric symbol ID. Transcoding of the symbol is applied at this time. Subsequent look-up
531            operations can avoid transcoding by simply retrieving the stored representation. As
532            symbol look up is required to apply various XML validation rules, there is achieves the
533            effect of transcoding each occurrence without additional cost. </para>
534         <para> The cost of individual character transcoding is avoided whenever a block of input is
535            confined to the ASCII subset and for all but the first occurrence of any XML element or
536            attribute name. Furthermore, when transcoding is required, the parallel bitstream
537            representation supports efficient transcoding operations. In the important case of UTF-8
538            to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit
539              parallel fashion based on UTF-8 streams <citation linkend="Cameron2008"/>, and all but the final bytes
540            of multi-byte sequences can be marked for deletion as discussed in the following
541            subsection. In other cases, transcoding within a block only need be applied for
542            non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream
543            using bit scan operations. </para>
544      </section>
545      <section xml:id="par-filter">
546         <title>Combined Parallel Filtering</title>
547         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
548            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
549            Chinese characters <code>&#x4F60;&#x597D;</code> are represented as two
550            three-byte UTF-8 sequences <code>E4 BD A0</code> and <code>E5 A5 BD</code> while the
551            UTF-16 representation must be compressed down to the two code units <code>4F60</code>
552            and <code>597D</code>. In the bit parallel representation, this corresponds to a
553            reduction from six bit positions representing UTF-8 code units (bytes) down to just two
554            bit positions representing UTF-16 code units (double bytes). This compression may be
555            achieved by arranging to calculate the correct UTF-16 bits at the final position of each
556            sequence and creating a deletion mask to mark the first two bytes of each 3-byte
557            sequence for deletion. In this case, the portion of the mask corresponding to these
558            input bytes is the bit sequence <code>110110</code>. Using this approach, transcoding
559            may then be completed by applying parallel deletion and inverse transposition of the
560            UTF-16 bitstreams<citation linkend="Cameron2008"/>. </para>
561         <para> Rather than immediately paying the costs of deletion and transposition just for
562            transcoding, however, icXML defers these steps so that the deletion masks for several
563            stages of processing may be combined. In particular, this includes core XML requirements
564            to normalize line breaks and to replace character reference and entity references by
565            their corresponding text. In the case of line break normalization, all forms of line
566            breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations
567            must be normalized to a single LF character in each case. In icXML, this is achieved by
568            first marking CR positions, performing two bit parallel operations to transform the
569            marked CRs into LFs, and then marking for deletion any LF that is found immediately
570            after the marked CR as shown by the Pablo source code in
571              <xref  linkend="fig-LBnormalization"/>.
572              <figure xml:id="fig-LBnormalization">
573                <caption>Line Break Normalization Logic</caption>
574  <programlisting>
575# XML 1.0 line-break normalization rules.
576if lex.CR:
577# Modify CR (#x0D) to LF (#x0A)
578  u16lo.bit_5 ^= lex.CR
579  u16lo.bit_6 ^= lex.CR
580  u16lo.bit_7 ^= lex.CR
581  CRLF = pablo.Advance(lex.CR) &amp; lex.LF
582  callouts.delmask |= CRLF
583# Adjust LF streams for line/column tracker
584  lex.LF |= lex.CR
585  lex.LF ^= CRLF
588         </para>
589         <para> In essence, the deletion masks for transcoding and for line break normalization each
590            represent a bitwise filter; these filters can be combined using bitwise-or so that the
591            parallel deletion algorithm need only be applied once. </para>
592         <para> A further application of combined filtering is the processing of XML character and
593            entity references. Consider, for example, the references <code>&amp;</code> or
594               <code>&#x3C;</code>. which must be replaced in XML processing with the single
595               <code>&amp;</code> and <code>&lt;</code> characters, respectively. The
596            approach in icXML is to mark all but the first character positions of each reference for
597            deletion, leaving a single character position unmodified. Thus, for the references
598               <code>&amp;</code> or <code>&#x3C;</code> the masks <code>01111</code> and
599               <code>011111</code> are formed and combined into the overall deletion mask. After the
600            deletion and inverse transposition operations are finally applied, a post-processing
601            step inserts the proper character at these positions. One note about this process is
602            that it is speculative; references are assumed to generally be replaced by a single
603            UTF-16 code unit. In the case, that this is not true, it is addressed in
604            post-processing. </para>
605         <para> The final step of combined filtering occurs during the process of reducing markup
606            data to tag bytes preceding each significant XML transition as described in
607              <xref linkend="contentstream"/>. Overall, icXML avoids separate buffer copying
608            operations for each of the these filtering steps, paying the cost of parallel deletion
609            and inverse transposition only once. Currently, icXML employs the parallel-prefix
610            compress algorithm of Steele~<citation linkend="HackersDelight"/> Performance is independent of the
611            number of positions deleted. Future versions of icXML are expected to take advantage of
612            the parallel extract operation~<citation linkend="HilewitzLee2006"/> that Intel is now providing in its
613            Haswell architecture. </para>
614      </section>
615      <section xml:id="contentstream">
616         <title>Content Stream</title>
617         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
618            Rather that parsing an XML document in its original format, the input is transformed
619            into one that is easier for the parser to iterate through and produce the sequential
620            output. In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
621            <!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
622            is transformed into <!-- CODE -->
623            <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
624            through the parallel filtering algorithm, described in <xref linkend="par-filter"/>. </para>
625         <para> Combined with the symbol stream, the parser traverses the content stream to
626            effectively reconstructs the input document in its output form. The initial <emphasis
627               role="ital">0</emphasis> indicates an empty content string. The following
628               <code>&gt;</code> indicates that a start tag without any attributes is the first
629            element in this text and the first unused symbol, <code>document</code>, is the element
630            name. Succeeding that is the content string <code>fee</code>, which is null-terminated
631            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
632            operations are required to produce these strings, which as
633              <xref linkend="xerces-profile"/> shows accounts for 6.83% of Xerces's execution time.
634            Additionally, it is cheap to locate the terminal character of each string: using the
635            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
636            null character in the content stream in parallel, which in turn means the parser can
637            directly jump to the end of every string without scanning for it. </para>
638         <para> Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the
639            existence of an attribute. Because all of the intra-element was performed in the Parabix
640            Subsystem, this must be a legal attribute. Since attributes can only occur within start
641            tags and must be accompanied by a textual value, the next symbol in the symbol stream
642            must be the element name of a start tag, and the following one must be the name of the
643            attribute and the string that follows the <code>=</code> must be its value. However, the
644            subsequent <code>=</code> is not treated as an independent attribute because the parser
645            has yet to read a <code>&gt;</code>, which marks the end of a start tag. Thus only
646            one symbol is taken from the symbol stream and it (along with the string value) is added
647            to the element. Eventually the parser reaches a <code>/</code>, which marks the
648            existence of an end tag. Every end tag requires an element name, which means they
649            require a symbol. Inter-element validation whenever an empty tag is detected to ensure
650            that the appropriate scope-nesting rules have been applied. </para>
651      </section>
652      <section xml:id="namespace-handling">
653         <title>Namespace Handling</title>
654         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
655         <para> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
656            together. It is especially important when a vocabulary application-dependant meaning,
657            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
658            to uniform resource identifiers (URIs), which are strings used to identify specific
659            names or resources. On line 1 in <xref linkend="namespace-ex"/>, the <code>xmlns</code>
660            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
661               &apos;<code></code>&apos; and the default (empty) prefix to
662               <code></code>. Thus to the XML processor, the <code>title</code> on line 2
663            and <code>price</code> on line 4 both read as
664            <code>&quot;;:title</code> and
665               <code>&quot;;:price</code> respectively, whereas on line 3 and
666            5, <code>p:name</code> and <code>price</code> are seen as
667               <code>&quot;;:name</code> and
668               <code>&quot;;:price</code>. Even though the actual element name
669               <code>price</code>, due to namespace scoping rules they are viewed as two
670            uniquely-named items because the current vocabulary is determined by the namespace(s)
671            that are in-scope. </para>
672<table xml:id="namespace-ex">
673                  <caption>
674                     <para>XML Namespace Example</para>
675                  </caption>
676                  <colgroup>
677                     <col align="centre" valign="top"/>
678                     <col align="left" valign="top"/>
679                  </colgroup>
680                  <tbody>
681 <tr><td>1. </td><td><![CDATA[<book xmlns:p="" xmlns="">]]> </td></tr>
682 <tr><td>2. </td><td><![CDATA[  <title>BOOK NAME</title>]]> </td></tr>
683 <tr><td>3. </td><td><![CDATA[  <p:name>PUBLISHER NAME</p:name>]]> </td></tr>
684 <tr><td>4. </td><td><![CDATA[  <price>X</price>]]> </td></tr>
685 <tr><td>5. </td><td><![CDATA[  <price xmlns="">Y</price>]]> </td></tr>
686 <tr><td>6. </td><td><![CDATA[</book>]]> </td></tr>
687                  </tbody>
688               </table>         
690         <para> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
691            persist for the lifetime of the application through the use of a global URI pool. Xerces
692            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
693            (end tag) occurs in the document. Because a namespace declaration affects the entire
694            element, it must be processed prior to grammar validation. This is a costly process
695            considering that a typical namespaced XML document only comes in one of two forms: (1)
696            those that declare a set of namespaces upfront and never change them, and (2) those that
697            repeatedly modify the namespaces in predictable patterns. </para>
698         <para> For that reason, icXML contains an independent namespace stack and utilizes bit
699            vectors to cheaply perform <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
700            <!-- performance advantage figure?? average cycles/byte cost? --> When a prefix is
701            declared (e.g., <code>xmlns:p=&quot;;</code>), a namespace binding
702            is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution
703            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
704            every prefix contains a bit vector marking every NSID that has ever been associated with
705              it within the document. For example, in <xref linkend="namespace-ex"/>, the prefix binding
706            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
707            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
708            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
709            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
710            found using a bit-scan intrinsic. A namespace binding table, similar to
711            <xref linkend="namespace-binding"/>, provides the actual URI ID. </para>
712<table xml:id="namespace-binding">
713                  <caption>
714                     <para>Namespace Binding Table Example</para>
715                  </caption>
716                  <colgroup>
717                     <col align="centre" valign="top"/>
718                     <col align="centre" valign="top"/>
719                     <col align="centre" valign="top"/>
720                     <col align="centre" valign="top"/>
721                     <col align="centre" valign="top"/>
722                   </colgroup>
723                   <thead>
724                     <tr><th>NSID </th><th> Prefix </th><th> URI </th><th> Prefix ID </th><th> URI ID </th>
725                     </tr>
726                   </thead>
727                  <tbody>
728<tr><td>0 </td><td> <code> p</code> </td><td> <code></code> </td><td> 0 </td><td> 0 </td></tr> 
729 <tr><td>1 </td><td> <code> xmlns</code> </td><td> <code></code> </td><td> 1 </td><td> 1 </td></tr> 
730 <tr><td>2 </td><td> <code> xmlns</code> </td><td> <code></code> </td><td> 1 </td><td> 0 </td></tr> 
731                  </tbody>
732               </table>         
733         <para>
734            <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
735            <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
736            <!-- NSid = bitscan(VisiblePrefixBinding); -->
737            <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
738         </para>
739         <para> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
740            any modification to the currently visible namespaces is calculated and stored within a
741            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
742            is found, the currently visible namespaces is XORed with the vector at the top of the
743            stack. This allows any number of changes to be performed at each scope-level with a
744            constant time.
745            <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
746         </para>
747      </section>
748      <section xml:id="errorhandling">
749         <title>Error Handling</title>
750         <para>
751            <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
752            Xerces outputs error messages in two ways: through the programmer API and as thrown
753            objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to
754            assess whether the next character is legal; if not, the current state determines the
755            type and severity of the error. icXML emits errors in the similar manner&#8212;but
756            how it discovers them is substantially different. Recall that in Figure
757            <xref linkend="icxml-arch"/>, icXML is divided into two sections: the Parabix Subsystem and
758            Markup Processor, each with its own system for detecting and producing error messages. </para>
759         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
760            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
761            the byte-position of an error within a block, and the type of error is determined by the
762            equation that discovered it. The difficulty of error processing in this section is that
763            in Xerces the line and column number must be given with every error production. Two
764            major issues exist because of this: (1) line position adheres to XML white-normalization
765            rules; as such, some sequences of characters, e.g., a carriage return followed by a line
766            feed, are counted as a single new line character. (2) column position is counted in
767            characters, not bytes or code units; thus multi-code-unit code-points and surrogate
768            character pairs are all counted as a single column position. Note that typical XML
769            documents are error-free but the calculation of the line/column position is a constant
770            overhead in Xerces. <!-- that must be maintained in the case that one occurs. --> To
771            reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence
772            of the error and performs the minimal amount of book-keeping necessary to facilitate it.
773            icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates
774            the information within the Line Column Tracker (LCT). One of the CSA's major
775            responsibilities is transcoding an input text.
776            <!-- from some encoding format to near-output-ready UTF-16. --> During this process,
777            white-space normalization rules are applied and multi-code-unit and surrogate characters
778            are detected and validated. A <emphasis role="ital">line-feed bitstream</emphasis>,
779            which marks the positions of the normalized new lines characters, is a natural
780            derivative of this process. Using an optimized population count algorithm, the line
781            count can be summarized cheaply for each valid block of text.
782            <!-- The optimization delays the counting process .... --> Column position is more
783            difficult to calculate. It is possible to scan backwards through the bitstream of new
784            line characters to determine the distance (in code-units) between the position between
785            which an error was detected and the last line feed. However, this distance may exceed
786            than the actual character position for the reasons discussed in (2). To handle this, the
787            CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together
788            many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters,
789            and any characters that were removed during the normalization process. When an error is
790            detected, the sum of those skipped positions is subtracted from the distance to
791            determine the actual column number. </para>
792         <para> The Markup Processor is a state-driven machine. As such, error detection within it
793            is very similar to Xerces. However, reporting the correct line/column is a much more
794            difficult problem. The Markup Processor parses the content stream, which is a series of
795            tagged UTF-16 strings. Each string is normalized in accordance with the XML
796            specification. All symbol data and unnecessary whitespace is eliminated from the stream;
797            thus its impossible to derive the current location using only the content stream. To
798            calculate the location, the Markup Processor borrows three additional pieces of
799            information from the Parabix Subsystem: the line-feed, skip mask, and a <emphasis
800               role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the
801            (code-unit) position of every datum that was suppressed from the source during the
802            production of the content stream. Armed with these, it is possible to calculate the
803            actual line/column using the same system as the Parabix Subsystem until the sum of the
804            negated deletion mask stream is equal to the current position. </para>
805      </section>
806   </section>
808   <section xml:id="multithread">
809      <title>Multithreading with Pipeline Parallelism</title>
810      <para> As discussed in section <xref linkend="background-xerces"/>, Xerces can be considered a FSM
811         application. These are &quot;embarrassingly
812         sequential.&quot;<citation linkend="Asanovic-EECS-2006-183"/> and notoriously difficult to
813         parallelize. However, icXML is designed to organize processing into logical layers. In
814         particular, layers within the Parabix Subsystem are designed to operate over significant
815         segments of input data before passing their outputs on for subsequent processing. This fits
816         well into the general model of pipeline parallelism, in which each thread is in charge of a
817         single module or group of modules. </para>
818      <para> The most straightforward division of work in icXML is to separate the Parabix Subsystem
819         and the Markup Processor into distinct logical layers into two separate stages. The
820         resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained
821         software-pipeline application. In this case, the Parabix Subsystem thread
822               <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a
823         time and produces the content, symbol and URI streams, then stores them in a pre-allocated
824         shared data structure <code>S</code>. The Markup Processor thread
825            <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness
826         and grammar-based validation, and the provides parsed XML data to the application through
827         the Xerces API. The shared data structure is implemented using a ring buffer, where every
828         entry contains an independent set of data streams. In the examples of
829           <xref linkend="threads_timeline1"/>, the ring buffer has four entries. A
830         lock-free mechanism is applied to ensure that each entry can only be read or written by one
831         thread at the same time. In  <xref linkend="threads_timeline1"/> the processing time of
832               <code>T<subscript>1</subscript></code> is longer than
833         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
834         waits for <code>T<subscript>1</subscript></code> to write to the shared memory. 
835         <xref linkend="threads_timeline2"/> illustrates the scenario in which
836         <code>T<subscript>1</subscript></code> is faster and must wait for
837            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
838         reuse the memory space. </para>
839      <para>
840        <figure xml:id="threads_timeline1">
841          <title>Thread Balance in Two-Stage Pipelines: Stage 1 Dominant</title>
842          <mediaobject>
843            <imageobject>
844              <imagedata format="png" fileref="threads_timeline1.png" width="500cm"/>
845            </imageobject>
846          </mediaobject>
847         </figure>
848        <figure xml:id="threads_timeline2">
849          <title>Thread Balance in Two-Stage Pipelines: Stage 2 Dominant</title>
850        <mediaobject>
851            <imageobject>
852              <imagedata format="png" fileref="threads_timeline2.png" width="500cm"/>
853            </imageobject>
854          </mediaobject>
855        </figure>
856      </para>
857      <para> Overall, our design is intended to benefit a range of applications. Conceptually, we
858         consider two design points. The first, the parsing performed by the Parabix Subsystem
859         dominates at 67% of the overall cost, with the cost of application processing (including
860         the driver logic within the Markup Processor) at 33%. The second is almost the opposite
861         scenario, the cost of application processing dominates at 60%, while the cost of XML
862         parsing represents an overhead of 40%. </para>
863      <para> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
864         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
865         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
866         total application cost. In this case, the single-threaded icXML should achieve a 1.5x
867         speedup over Xerces so that the total application cost reduces to 67% of the original.
868         However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing
869         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
870         to a 3x speedup at best. </para>
871      <para> At the other extreme of our design range, we consider an application in which core
872         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
873         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
874         significant aspect of our two-stage multi-threaded design then becomes the ability to hide
875         the entire latency of parsing within the serial time required by the application. In this
876         case, we achieve an overall speedup in processing time by 1.67x. </para>
877      <para> Although the structure of the Parabix Subsystem allows division of the work into
878         several pipeline stages and has been demonstrated to be effective for four pipeline stages
879         in a research prototype <citation linkend="HPCA2012"/>, our analysis here suggests that the further
880         pipelining of work within the Parabix Subsystem is not worthwhile if the cost of
881         application logic is little as 33% of the end-to-end cost using Xerces. To achieve benefits
882         of further parallelization with multi-core technology, there would need to be reductions in
883         the cost of application logic that could match reductions in core parsing cost. </para>
884   </section>
886   <section xml:id="performance">
887      <title>Performance</title>
888      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
889         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
890         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
891         Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core)
892         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
893         12.04 (Linux). </para>
894      <para> We analyzed the execution profiles of each XML parser using the performance counters
895         found in the processor. We chose several key hardware events that provide insight into the
896         profile of each application and indicate if the processor is doing useful work. The set of
897         events included in our study are: processor cycles, branch instructions, branch
898         mispredictions, and cache misses. The Performance Application Programming Interface (PAPI)
899         Version 5.5.0 <citation linkend="papi"/> toolkit was installed on the test system to facilitate the
900         collection of hardware performance monitoring statistics. In addition, we used the Linux
901         perf <citation linkend="perf"/> utility to collect per core hardware events. </para>
902      <section>
903         <title>Xerces C++ SAXCount</title>
904         <para> Xerces comes with sample applications that demonstrate salient features of the
905            parser. SAXCount is the simplest such application: it counts the elements, attributes
906            and characters of a given XML file using the (event based) SAX API and prints out the
907            totals. </para>
909 <para> <xref linkend="XMLdocs"/> shows the document characteristics of the XML input files
910            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
911            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
912            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
913            documents and consist entirely of single byte encoded ASCII characters.
914  <table xml:id="XMLdocs">
915                  <caption>
916                     <para>XML Document Characteristics</para>
917                  </caption>
918                  <colgroup>
919                     <col align="left" valign="top"/>
920                     <col align="centre" valign="top"/>
921                     <col align="centre" valign="top"/>
922                     <col align="centre" valign="top"/>
923                     <col align="centre" valign="top"/>
924                  </colgroup>
925                  <tbody>
926 <tr><td>File Name              </td><td> jaw.xml               </td><td> road.gml      </td><td> po.xml        </td><td> soap.xml </td></tr> 
927<tr><td>File Type               </td><td> document              </td><td> data          </td><td> data          </td><td> data   </td></tr>     
928<tr><td>File Size (kB)          </td><td> 7343                  </td><td> 11584         </td><td> 76450         </td><td> 2717 </td></tr> 
929<tr><td>Markup Item Count       </td><td> 74882                 </td><td> 280724        </td><td> 4634110       </td><td> 18004 </td></tr> 
930  <tr><td>Markup Density                </td><td> 0.13                  </td><td> 0.57          </td><td> 0.76          </td><td> 0.87  </td></tr> 
931                  </tbody>
932               </table>           
934         <para> A key predictor of the overall parsing performance of an XML file is markup
935           density<footnote><para>Markup Density: the ratio of markup bytes used to define the structure
936             of the document vs. its file size.</para></footnote>. This metric has substantial influence on the
937            performance of traditional recursive descent XML parsers because it directly corresponds
938            to the number of state transitions that occur when parsing a document. We use a mixture
939            of document-oriented and data-oriented XML files to analyze performance over a spectrum
940            of markup densities. </para>
941         <para> <xref linkend="perf_SAX"/> compares the performance of Xerces, icXML and pipelined icXML
942            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
943            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
944            speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less
945            affected through a reduction in branches and the use of parallel-processing techniques.
946            icXML-p performs better as markup-density increases because the work performed by each
947            stage is well balanced in this application. </para>
948         <para>
949        <figure xml:id="perf_SAX">
950          <title>SAXCount Performance Comparison</title>
951          <mediaobject>
952            <imageobject>
953              <imagedata format="png" fileref="perf_SAX.png" width="500cm"/>
954            </imageobject>
955          </mediaobject>
956          <caption>
957          </caption>
958        </figure>
959         </para>
960      </section>
961      <section>
962         <title>GML2SVG</title>
963<para>   As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
964was chosen.   This application transforms geospatially encoded data represented using
965an XML representation in the form of Geography Markup Language (GML) <citation linkend="lake2004geography"/> 
966into a different XML format  suitable for displayable maps:
967Scalable Vector Graphics (SVG) format<citation linkend="lu2007advances"/>. In the GML2SVG benchmark, GML feature elements
968and GML geometry elements tags are matched. GML coordinate data are then extracted
969and transformed to the corresponding SVG path data encodings.
970Equivalent SVG path elements are generated and output to the destination
971SVG document.  The GML2SVG application is thus considered typical of a broad
972class of XML applications that parse and extract information from
973a known XML format for the purpose of analysis and restructuring to meet
974the requirements of an alternative format.</para>
976<para>Our GML to SVG data translations are executed on GML source data
977modelling the city of Vancouver, British Columbia, Canada.
978The GML source document set
979consists of 46 distinct GML feature layers ranging in size from approximately 9 KB to 125.2 MB
980and with an average document size of 18.6 MB. Markup density ranges from approximately 0.045 to 0.719
981and with an average markup density of 0.519. In this performance study,
982213.4 MB of source GML data generates 91.9 MB of target SVG data.</para>
985        <figure xml:id="perf_GML2SVG">
986          <title>Performance Comparison for GML2SVG</title>
987          <mediaobject>
988            <imageobject>
989              <imagedata format="png" fileref="Throughput.png" width="500cm"/>
990            </imageobject>
991          </mediaobject>
992          <caption>
993          </caption>
994        </figure>
996<para><xref linkend="perf_GML2SVG"/> compares the performance of the GML2SVG application linked against
997the Xerces, icXML and icXML-p.   
998On the GML workload with this application, single-thread icXML
999achieved about a 50% acceleration over Xerces,
1000increasing throughput on our test machine from 58.3 MB/sec to 87.9 MB/sec.   
1001Using icXML-p, a further throughput increase to 111 MB/sec was recorded,
1002approximately a 2X speedup.</para>
1004<para>An important aspect of icXML is the replacement of much branch-laden
1005sequential code inside Xerces with straight-line SIMD code using far
1006fewer branches.  <xref linkend="branchmiss_GML2SVG"/> shows the corresponding
1007improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
1008It is also interesting to note that icXML-p goes even further.   
1009In essence, in using pipeline parallelism to split the instruction
1010stream onto separate cores, the branch target buffers on each core are
1011less overloaded and able to increase the successful branch prediction rate.</para>
1013        <figure xml:id="branchmiss_GML2SVG">
1014          <title>Comparative Branch Misprediction Rate</title>
1015          <mediaobject>
1016            <imageobject>
1017              <imagedata format="png" fileref="BM.png" width="500cm"/>
1018            </imageobject>
1019          </mediaobject>
1020          <caption>
1021          </caption>
1022        </figure>
1024<para>The behaviour of the three versions with respect to L1 cache misses per kB is shown
1025in <xref linkend="cachemiss_GML2SVG"/>.   Improvements are shown in both instruction-
1026and data-cache performance with the improvements in instruction-cache
1027behaviour the most dramatic.   Single-threaded icXML shows substantially improved
1028performance over Xerces on both measures.   
1029Although icXML-p is slightly worse with respect to data-cache performance,
1030this is more than offset by a further dramatic reduction in instruction-cache miss rate.
1031Again partitioning the instruction stream through the pipeline parallelism model has
1032significant benefit.</para>
1034        <figure xml:id="cachemiss_GML2SVG">
1035          <title>Comparative Cache Miss Rate</title>
1036          <mediaobject>
1037            <imageobject>
1038              <imagedata format="png" fileref="CM.png" width="500cm"/>
1039            </imageobject>
1040          </mediaobject>
1041          <caption>
1042          </caption>
1043        </figure>
1045<para>One caveat with this study is that the GML2SVG application did not exhibit
1046a relative balance of processing between application code and Xerces library
1047code reaching the 33% figure.  This suggests that for this application and
1048possibly others, further separating the logical layers of the
1049icXML engine into different pipeline stages could well offer significant benefit.
1050This remains an area of ongoing work.</para>
1051      </section>
1052   </section>
1054   <section xml:id="conclusion">
1055      <title>Conclusion and Future Work</title>
1056      <para> This paper is the first case study documenting the significant performance benefits
1057         that may be realized through the integration of parallel bitstream technology into existing
1058         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
1059         integration of SIMD and multicore parallelism was shown capable of dramatic producing
1060         dramatic increases in throughput and reductions in branch mispredictions and cache misses.
1061         The modified parser, going under the name icXML is designed to provide the full
1062         functionality of the original Xerces library with complete compatibility of APIs. Although
1063         substantial re-engineering was required to realize the performance potential of parallel
1064         technologies, this is an important case study demonstrating the general feasibility of
1065         these techniques. </para>
1066      <para> The further development of icXML to move beyond 2-stage pipeline parallelism is
1067         ongoing, with realistic prospects for four reasonably balanced stages within the library.
1068         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
1069         multistage pipelined parsing library should offer substantial benefits. </para>
1070      <para> The example of XML parsing may be considered prototypical of finite-state machines
1071         applications which have sometimes been considered &quot;embarassingly
1072         sequential&quot; and so difficult to parallelize that &quot;nothing
1073         works.&quot; So the case study presented here should be considered an important data
1074         point in making the case that parallelization can indeed be helpful across a broad array of
1075         application types. </para>
1076      <para> To overcome the software engineering challenges in applying parallel bitstream
1077         technology to existing software systems, it is clear that better library and tool support
1078         is needed. The techniques used in the implementation of icXML and documented in this paper
1079         could well be generalized for applications in other contexts and automated through the
1080         creation of compiler technology specifically supporting parallel bitstream programming.
1081      </para>
1082   </section>
1085  <title>Bibliography</title>
1086  <bibliomixed xml:id="CameronHerdyLin2008" xreflabel="Cameron and Herdy 2008">Cameron, Robert D., Herdy, Kenneth S. and Lin, Dan. High performance XML parsing using parallel bit stream technology. CASCON'08: Proc. 2008 conference of the center for advanced studies on collaborative research. 2008 New York, NY, USA</bibliomixed>
1087  <bibliomixed xml:id="papi" xreflabel="Innovative Computing Laboratory">Innovative Computing Laboratory, University of Texas. Performance Application Programming Interface.<link></link></bibliomixed>
1088  <bibliomixed xml:id="perf" xreflabel="Eranian and Gouriou">Eranian, Stephane, Gouriou, Eric, Moseley, Tipp and Bruijn, Willem de. Linux kernel profiling with perf.<link></link></bibliomixed>
1089  <bibliomixed xml:id="Cameron2008" xreflabel="Cameron 2008">Cameron, Robert D.. A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding. Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008 New York, NY, USA</bibliomixed>
1090  <bibliomixed xml:id="ParaDOM2009" xreflabel="Shah and Rao 2009">Shah, Bhavik, Rao, Praveen, Moon, Bongki and Rajagopalan, Mohan. A Data Parallel Algorithm for XML DOM Parsing. Database and XML Technologies. 2009</bibliomixed>
1091  <bibliomixed xml:id="XMLSSE42" xreflabel="Lei 2008">Lei, Zhai. XML Parsing Accelerator with Intel Streaming SIMD Extensions 4 (Intel SSE4). 2008<link>Intel Software Network</link></bibliomixed>
1092  <bibliomixed xml:id="Cameron2009" xreflabel="Cameron and Herdy 2009">Cameron, Rob, Herdy, Ken and Amiri, Ehsan Amiri. Parallel Bit Stream Technology as a Foundation for XML Parsing Performance. Int'l Symposium on Processing XML Efficiently: Overcoming Limits on Space, Time, or Bandwidth. 2009</bibliomixed>
1093  <bibliomixed xml:id="HilewitzLee2006" xreflabel="Hilewitz and Lee 2006">Hilewitz, Yedidya and Lee, Ruby B.. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions. ASAP '06: Proc. IEEE 17th Int'l Conference on Application-specific Systems, Architectures and Processors. 2006 Washington, DC, USA</bibliomixed>
1094  <bibliomixed xml:id="Asanovic-EECS-2006-183" xreflabel="Asanovic and others 2006">Asanovic, Krste and others. The Landscape of Parallel Computing Research: A View from Berkeley. 2006</bibliomixed>
1095  <bibliomixed xml:id="GRID2006" xreflabel="Lu and Chiu 2006">Lu, Wei, Chiu, Kenneth and Pan, Yinfei. A Parallel Approach to XML Parsing. Proceedings of the 7th IEEE/ACM International Conference on Grid Computing. 2006 Washington, DC, USA</bibliomixed>
1096  <bibliomixed xml:id="cameron-EuroPar2011" xreflabel="Cameron and Amiri 2011">Cameron, Robert D., Amiri, Ehsan, Herdy, Kenneth S., Lin, Dan, Shermer, Thomas C. and Popowich, Fred P.. Parallel Scanning with Bitstream Addition: An XML Case Study. Euro-Par 2011, LNCS 6853, Part II. 2011 Berlin, Heidelberg</bibliomixed>
1097  <bibliomixed xml:id="HPCA2012" xreflabel="Lin and Medforth 2012">Lin, Dan, Medforth, Nigel, Herdy, Kenneth S., Shriraman, Arrvindh and Cameron, Rob. Parabix: Boosting the efficiency of text processing on commodity processors. International Symposium on High-Performance Computer Architecture. 2012 Los Alamitos, CA, USA</bibliomixed>
1098  <bibliomixed xml:id="HPCC2011" xreflabel="You and Wang 2011">You, Cheng-Han and Wang, Sheng-De. A Data Parallel Approach to XML Parsing and Query. 10th IEEE International Conference on High Performance Computing and Communications. 2011 Los Alamitos, CA, USA</bibliomixed>
1099  <bibliomixed xml:id="E-SCIENCE2007" xreflabel="Pan and Zhang 2007">Pan, Yinfei, Zhang, Ying, Chiu, Kenneth and Lu, Wei. Parallel XML Parsing Using Meta-DFAs. International Conference on e-Science and Grid Computing. 2007 Los Alamitos, CA, USA</bibliomixed>
1100  <bibliomixed xml:id="ICWS2008" xreflabel="Pan and Zhang 2008">Pan, Yinfei, Zhang, Ying and Chiu, Kenneth. Hybrid Parallelism for XML SAX Parsing. IEEE International Conference on Web Services. 2008 Los Alamitos, CA, USA</bibliomixed>
1101  <bibliomixed xml:id="IPDPS2008" xreflabel="Pan and Zhang 2008">Pan, Yinfei, Zhang, Ying and Chiu, Kenneth. Simultaneous transducers for data-parallel XML parsing. International Parallel and Distributed Processing Symposium. 2008 Los Alamitos, CA, USA</bibliomixed>
1102  <bibliomixed xml:id="HackersDelight" xreflabel="Warren 2002">Warren, Henry S.. Hacker's Delight. 2002</bibliomixed>
1103  <bibliomixed xml:id="lu2007advances" xreflabel="Lu and Dos Santos 2007">Lu, C.T., Dos Santos, R.F., Sripada, L.N. and Kou, Y.. Advances in GML for geospatial applications. 2007</bibliomixed>
1104  <bibliomixed xml:id="lake2004geography" xreflabel="Lake and Burggraf 2004">Lake, R., Burggraf, D.S., Trninic, M. and Rae, L.. Geography mark-up language (GML) [foundation for the geo-web]. 2004</bibliomixed>
Note: See TracBrowser for help on using the repository browser.