source: docs/Balisage13/Bal2013came0601/Bal2013came0601.xml @ 3053

Last change on this file since 3053 was 3053, checked in by cameron, 6 years ago

LB normalization figure

File size: 75.1 KB
1<?xml version="1.0" encoding="UTF-8"?>
3<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
4<article xmlns="" version="5.0-subset Balisage-1.3"
5   xml:id="HR-23632987-8973">
6   <title/>
7   <info>
8      <!--
9      <confgroup>
10         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
11            Space, Time, or Bandwidth</conftitle>
12         <confdates>August 10 2009</confdates>
13      </confgroup>
15      <abstract>
16         <para>Prior research on the acceleration of XML processing using SIMD and multi-core
17            parallelism has lead to a number of interesting research prototypes. This work
18            investigates the extent to which the techniques underlying these prototypes could result
19            in systematic performance benefits when fully integrated into a commercial XML parser.
20            The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the
21            foundation for the study. A systematic restructuring of the parser was undertaken, while
22            maintaining the existing API for application programmers. Using SIMD techniques alone,
23            an increase in parsing speed of at least 50% was observed in a range of applications.
24            When coupled with pipeline parallelism on dual core processors, improvements of 2x and
25            beyond were realized. </para>
26      </abstract>
27      <author>
28         <personname>
29            <firstname>Nigel</firstname>
30            <surname>Medforth</surname>
31         </personname>
32         <personblurb>
33            <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
34               developer of icXML. He earned a Bachelor of Technology in Information Technology at
35               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
36               Outstanding Achievement.</para>
37            <para>Nigel is currently researching ways to leverage both the Parabix framework and
38               stream-processing models to further accelerate XML parsing within icXML.</para>
39         </personblurb>
40         <affiliation>
41            <jobtitle>Developer</jobtitle>
42            <orgname>International Characters Inc.</orgname>
43         </affiliation>
44         <affiliation>
45            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
46            <orgname>Simon Fraser University </orgname>
47         </affiliation>
48         <email></email>
49      </author>
50      <author>
51         <personname>
52            <firstname>Dan</firstname>
53            <surname>Lin</surname>
54         </personname>
55         <personblurb>
56           <para>Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
57             in Computing Science at Simon Fraser University in 2010. Her research focus on on high
58             performance algorithms that exploit parallelization strategies on various multicore platforms.
59           </para>
60         </personblurb>
61         <affiliation>
62            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
63            <orgname>Simon Fraser University </orgname>
64         </affiliation>
65         <email></email>
66      </author>
67      <author>
68         <personname>
69            <firstname>Kenneth</firstname>
70            <surname>Herdy</surname>
71         </personname>
72         <personblurb>
73            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
74               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
75               of Science in Computing Science with a Certificate in Spatial Information Systems at
76               Simon Fraser University in 2005. </para>
77            <para> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
78               University with industrial scholarship support from the Natural Sciences and
79               Engineering Research Council of Canada, the Mathematics of Information Technology and
80               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
81               of the principal techniques that may be used to improve XML processing performance in
82               the context of the Geography Markup Language (GML). </para>
83         </personblurb>
84         <affiliation>
85            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
86            <orgname>Simon Fraser University </orgname>
87         </affiliation>
88         <email></email>
89      </author>
90      <author>
91         <personname>
92            <firstname>Rob</firstname>
93            <surname>Cameron</surname>
94         </personname>
95         <personblurb>
96            <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
97               Sciences at Simon Fraser University. His research interests include programming
98               language and software system technology, with a specific focus on high performance
99               text processing using SIMD and multicore parallelism. He is the developer of the REX
100               XML shallow parser as well as the parallel bit stream (Parabix) framework for SIMD
101               text processing. </para>
102         </personblurb>
103         <affiliation>
104            <jobtitle>Professor of Computing Science</jobtitle>
105            <orgname>Simon Fraser University</orgname>
106         </affiliation>
107         <affiliation>
108            <jobtitle>Chief Technology Officer</jobtitle>
109            <orgname>International Characters, Inc.</orgname>
110         </affiliation>
111         <email></email>
112      </author>
113      <author>
114         <personname>
115            <firstname>Arrvindh</firstname>
116            <surname>Shriraman</surname>
117         </personname>
118         <personblurb>
119            <para/>
120         </personblurb>
121         <affiliation>
122            <jobtitle/>
123            <orgname/>
124         </affiliation>
125         <email/>
126      </author>
127      <!--
128      <legalnotice>
129         <para>Copyright &#x000A9; 2013 Nigel Medforth, Dan Lin, Kenneth S. Herdy, Robert D. Cameron  and Arrvindh Shriraman.
130            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
131            Works 2.5 Canada License.</para>
132      </legalnotice>
134      <keywordset role="author">
135         <keyword/>
136      </keywordset>
138   </info>
139   <section>
140      <title>Introduction</title>
141      <para/>
142      <para/>
143      <para/>
144      <para/>
145   </section>
147   <section>
148      <title>Background</title>
149      <section>
150         <title>Xerces C++ Structure</title>
151         <para> The Xerces C++ parser is a widely-used standards-conformant
152            XML parser produced as open-source software
153             by the Apache Software Foundation.
154            It features comprehensive support for a variety of character encodings both
155            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
156            XML vocabularies through the XML namespace mechanism, as well as complete
157            implementations of structure and data validation through multiple grammars declared
158            using either legacy DTDs (document type definitions) or modern XML Schema facilities.
159            Xerces also supports several APIs for accessing parser services, including event-based
160            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
161            tree-based parsing interface. </para>
162         <para>
163            Xerces,
164            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
165            the first to the last byte of input data. Each byte passes through several processing
166            layers and is classified and eventually validated within the context of the document
167            state. This introduces implicit dependencies between the various tasks within the
168            application that make it difficult to optimize for performance. As a complex software
169            system, no one feature dominates the overall parsing performance. Table I
170            shows the execution time profile of the top ten functions in a
171            typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of
172            these functions for parallelization in isolation would only produce a minute improvement
173            in performance. Unfortunately, early investigation into these functions found that
174            incorporating speculation-free thread-level parallelization was impossible and they were
175            already performing well in their given tasks; thus only trivial enhancements were
176            attainable. In order to obtain a systematic acceleration of Xerces, it should be
177            expected that a comprehensive restructuring is required, involving all aspects of the
178            parser. </para>
179             <table>
180                  <caption>
181                     <para>Execution Time of Top 10 Xerces Functions</para>
182                  </caption>
183                  <colgroup>
184                     <col align="left" valign="top"/>
185                     <col align="left" valign="top"/>
186                  </colgroup>
187                  <thead><tr><th>Time (%) </th><th> Function Name </th></tr></thead>
188                  <tbody>
189<tr valign="top"><td>13.29      </td>   <td>XMLUTF8Transcoder::transcodeFrom </td></tr>
190<tr valign="top"><td>7.45       </td>   <td>IGXMLScanner::scanCharData </td></tr>
191<tr valign="top"><td>6.83       </td>   <td>memcpy </td></tr>
192<tr valign="top"><td>5.83       </td>   <td>XMLReader::getNCName </td></tr>
193<tr valign="top"><td>4.67       </td>   <td>IGXMLScanner::buildAttList </td></tr>
194<tr valign="top"><td>4.54       </td>   <td>RefHashTableO&lt;&gt;::findBucketElem </td></tr>
195<tr valign="top"><td>4.20       </td>   <td>IGXMLScanner::scanStartTagNS </td></tr>
196<tr valign="top"><td>3.75       </td>   <td>ElemStack::mapPrefixToURI </td></tr>
197<tr valign="top"><td>3.58       </td>   <td>ReaderMgr::getNextChar </td></tr>
198<tr valign="top"><td>3.20       </td>   <td>IGXMLScanner::basicAttrValueScan </td></tr>
199                  </tbody>
200               </table>
201      </section>
202      <section>
203         <title>The Parabix Framework</title>
204         <para> The Parabix (parallel bit stream) framework is a transformative approach to XML
205            parsing (and other forms of text processing.) The key idea is to exploit the
206            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
207            data from long blocks of input data by using one register bit per single input byte. To
208            facilitate this, the input data is first transposed into a set of basis bit streams.
209              For example Table II shows  the ASCII bytes for the string "<code>b7&lt;A</code>" with
210                the corresponding  8 basis bit streams, b<subscript>0</subscript> through  b<subscript>7</subscript> shown in Table III.
212            <!-- The bits used to construct $\tt <subscript>7</subscript>$ have been highlighted in this example. -->
213            Boolean-logic operations\footnote{&#8743;, \&#8744; and &#172; denote the
214            boolean AND, OR and NOT operators.} are used to classify the input bits into a set of
215               <emphasis role="ital">character-class bit streams</emphasis>, which identify key
216            characters (or groups of characters) with a <code>1</code>. For example, one of the
217            fundamental characters in XML is a left-angle bracket. A character is an
218               <code>&apos;&lt;&apos; if and only if
219               &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>)
220               &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>)
221               &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>)
222               &#8743; &#172; (b<subscript>6</subscript> &#8744;
223               b<subscript>7</subscript>) = 1</code>. Similarly, a character is numeric, <code>[0-9]
224               if and only if &#172;(b<subscript>0</subscript> &#8744;
225               b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
226                  b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript>
227               &#8743; (b<subscript>5</subscript> &#8744;
228            b<subscript>6</subscript>))</code>. An important observation here is that ranges of
229            characters may require fewer operations than individual characters and
230            <!-- the classification cost could be amortized over many character classes.--> multiple
231            classes can share the classification cost. </para>
232         <table>
233                  <caption>
234                     <para>XML Source Data</para>
235                  </caption>
236                  <colgroup>
237                     <col align="right" valign="top"/>
238                     <col align="centre" valign="top"/>
239                     <col align="centre" valign="top"/>
240                     <col align="centre" valign="top"/>
241                     <col align="centre" valign="top"/>
242                  </colgroup>
243                  <tbody>
244  <tr><td>String </td><td> <code>b</code> </td><td> <code>7</code> </td><td> <code>&lt;</code> </td><td> <code>A</code> </td></tr>
245  <tr><td>ASCII </td><td> <code>0110001<emphasis role="bold">0</emphasis></code> </td><td> <code>0011011<emphasis role="bold">1</emphasis></code> </td><td> <code>0011110<emphasis role="bold">0</emphasis></code> </td><td> <code>0100000<emphasis role="bold">1</emphasis></code> </td></tr>
246  </tbody>
250         <table>
251                  <caption>
252                     <para>8-bit ASCII Basis Bit Streams</para>
253                  </caption>
254                  <colgroup>
255                     <col align="centre" valign="top"/>
256                     <col align="centre" valign="top"/>
257                     <col align="centre" valign="top"/>
258                     <col align="centre" valign="top"/>
259                     <col align="centre" valign="top"/>
260                     <col align="centre" valign="top"/>
261                     <col align="centre" valign="top"/>
262                     <col align="centre" valign="top"/>
263                  </colgroup>
264                  <tbody>
265<tr><td> b<subscript>0</subscript> </td><td> b<subscript>1</subscript> </td><td> b<subscript>2</subscript> </td><td> b<subscript>3</subscript></td><td> b<subscript>4</subscript> </td><td> b<subscript>5</subscript> </td><td> b<subscript>6</subscript> </td><td> b<subscript>7</subscript> </td></tr>
266 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
267 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
268 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
269 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
270  </tbody>
275         <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
276         <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
277         <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
278         <!-- process, intra-element well-formedness validation is performed on each block -->
279         <!-- of text. -->
280         <para> Consider, for example, the XML source data stream shown in the first line of Table IV.
281The remaining lines of this figure show
282            several parallel bit streams that are computed in Parabix-style parsing, with each bit
283            of each stream in one-to-one correspondence to the source character code units of the
284            input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are
285            represented as underscores. The first bit stream shown is that for the opening angle
286            brackets that represent tag openers in XML. The second and third streams show a
287            partition of the tag openers into start tag marks and end tag marks depending on the
288            character immediately following the opener (i.e., &quot;<code>/</code>&quot;) or
289            not. The remaining three lines show streams that can be computed in subsequent parsing
290            (using the technique of bitstream addition \cite{cameron-EuroPar2011}), namely streams
291            marking the element names, attribute names and attribute values of tags. </para>
292            <table>
293                  <caption>
294                     <para>XML Source Data and Derived Parallel Bit Streams</para>
295                  </caption>
296                  <colgroup>
297                     <col align="centre" valign="top"/>
298                     <col align="left" valign="top"/>
299                  </colgroup>
300                  <tbody>
301          <tr><td> Source Data </td><td> <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
302          <tr><td> Tag Openers </td><td> <code>1____________1____________________________1____________1__________</code></td></tr>
303           <tr><td> Start Tag Marks </td><td> <code>_1____________1___________________________________________________</code></td></tr>
304           <tr><td> End Tag Marks </td><td> <code>___________________________________________1____________1_________</code></td></tr>
305           <tr><td> Empty Tag Marks </td><td> <code>__________________________________________________________________</code></td></tr>
306           <tr><td> Element Names </td><td> <code>_11111111_____1111111_____________________________________________</code></td></tr>
307           <tr><td> Attribute Names </td><td> <code>______________________11_______11_________________________________</code></td></tr>
308           <tr><td> Attribute Values </td><td> <code>__________________________111________111__________________________</code></td></tr>
309                  </tbody>
310               </table>         
312         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
313            parsing performance. The first is that the use of the full register width offers a
314            considerable information advantage over sequential byte-at-a-time parsing. That is,
315            sequential processing of bytes uses just 8 bits of each register, greatly limiting the
316            processor resources that are effectively being used at any one time. The second is that
317            byte-at-a-time loop scanning loops are actually often just computing a single bit of
318            information per iteration: is the scan complete yet? Rather than computing these
319            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
320            bytes at a time using 128-bit registers) should provide substantial benefit. </para>
321         <para> Previous studies have shown that the Parabix approach improves many aspects of XML
322            processing, including transcoding \cite{Cameron2008}, character classification and
323            validation, tag parsing and well-formedness checking. The first Parabix parser used
324            processor bit scan instructions to considerably accelerate sequential scanning loops for
325            individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method
326            of parallel scanning using bitstream addition \cite{cameron-EuroPar2011}, as well as
327            combining SIMD methods with 4-stage pipeline parallelism to further improve throughput
328            \cite{HPCA2012}. Although these research prototypes handled the full syntax of
329            schema-less XML documents, they lacked the functionality required by full XML parsers. </para>
330         <para> Commercial XML processors support transcoding of multiple character sets and can
331            parse and validate against multiple document vocabularies. Additionally, they provide
332            API facilities beyond those found in research prototypes, including the widely used SAX,
333            SAX2 and DOM interfaces. </para>
334      </section>
335      <section>
336         <title>Sequential vs. Parallel Paradigm</title>
337         <para> Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents
338            sequentially. Each character is examined to distinguish between the XML-specific markup,
339            such as a left angle bracket <code>&lt;</code>, and the content held within the
340            document. As the parser progresses through the document, it alternates between markup
341            scanning, validation and content processing modes. </para>
342         <para> In other words, Xerces belongs to an equivalent class applications termed FSM
343            applications\footnote{ Herein FSM applications are considered software systems whose
344            behaviour is defined by the inputs, current state and the events associated with
345            transitions of states.}. Each state transition indicates the processing context of
346            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
347            character could induce a state transition. </para>
348         <para> Parabix-style XML parsers utilize a concept of layered processing. A block of source
349            text is transformed into a set of lexical bitstreams, which undergo a series of
350            operations that can be grouped into logical layers, e.g., transposition, character
351            classification, and lexical analysis. Each layer is pipeline parallel and require
352            neither speculation nor pre-parsing stages\cite{HPCA2012}. To meet the API requirements
353            of the document-ordered Xerces output, the results of the Parabix processing layers must
354            be interleaved to produce the equivalent behaviour. </para>
355      </section>
356   </section>
357   <section>
358      <title>Architecture</title>
359      <section>
360         <title>Overview</title>
361         <!--\def \CSG{Content Stream Generator}-->
362         <para> icXML is more than an optimized version of Xerces. Many components were grouped,
363            restructured and rearchitected with pipeline parallelism in mind. In this section, we
364            highlight the core differences between the two systems. As shown in Figure
365            \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader,
366            scanner, namespace binder, and validator. The <emphasis role="ital"
367            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
368            the majority of the character set encoding validation is performed as a byproduct of
369            this process. The <emphasis role="ital">Reader</emphasis> is responsible for the
370            streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current
371            line/column position,
372            <!--(which is reported in the unlikely event that the input contains an error), -->
373            performs line-break normalization and validates context-specific character set issues,
374            such as tokenization of qualified-names. The <emphasis role="ital">Scanner</emphasis>
375            pulls data through the reader and constructs the intermediate representation (IR) of the
376            document; it deals with all issues related to entity expansion, validates the XML
377            well-formedness constraints and any character set encoding issues that cannot be
378            completely handled by the reader or transcoder (e.g., surrogate characters, validation
379            and normalization of character references, etc.) The <emphasis role="ital">Namespace
380               Binder</emphasis> is a core piece of the element stack. It handles namespace scoping
381            issues between different XML vocabularies. This allows the scanner to properly select
382            the correct schema grammar structures. The <emphasis role="ital">Validator</emphasis>
383            takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder)
384            and assesses whether the final output matches the user-defined DTD and schema grammar(s)
385            before passing it to the end-user. </para>     
386        <figure xml:id="xerces-arch">
387          <title>Xerces Architecture</title>
388          <mediaobject>
389            <imageobject>
390              <imagedata format="png" fileref="xerces.png" width="150cm"/>
391            </imageobject>
392          </mediaobject>
393          <caption>
394          </caption>
395        </figure>
396         <para> In icXML functions are grouped into logical components. As shown in Figure
397            \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the
398            Markup Processor. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which
399            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
400               Adapter</emphasis>, discussed in Section \ref{arch:character-set-adapter}, mirrors
401            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
402            lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}. These lexical
403            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
404            additional processing is performed. The first precursor to producing UTF-16 is the
405               <emphasis role="ital">Parallel Markup Parser</emphasis> phase. It takes the lexical
406            streams and produces a set of marker bitstreams in which a 1-bit identifies significant
407            positions within the input data. One bitstream for each of the critical piece of
408            information is created, such as the beginning and ending of start tags, end tags,
409            element names, attribute names, attribute values and content. Intra-element
410            well-formedness validation is performed as an artifact of this process. Like Xerces,
411            icXML must provide the Line and Column position of each error. The <emphasis role="ital"
412               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
413            document position(s) through the use of an optimized population count algorithm,
414            described in Section \ref{section:arch:errorhandling}. From here, two data-independent
415            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
416         <para> A typical XML file contains few unique element and attribute names&#8212;but
417            each of them will occur frequently. icXML stores these as distinct data structures,
418            called symbols, each with their own global identifier (GID). Using the symbol marker
419            streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol
420               Resolver</emphasis> scans through the raw data to produce a sequence of GIDs, called
421            the <emphasis role="ital">symbol stream</emphasis>. </para>
422         <para> The final components of the Parabix Subsystem are the <emphasis role="ital">Content
423               Preparation Unit</emphasis> and <emphasis role="ital">Content Stream
424            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
425            filters them, according to the information provided by the Parallel Markup Parser, and
426            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital"
427               >content stream</emphasis>, discussed in Section \ref{section:arch:contentstream}. </para>
428         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
429            document. The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to
430            validate and produce the sequential output for the end user. The <emphasis role="ital"
431               >Final WF checker</emphasis> performs inter-element well-formedness validation that
432            would be too costly to perform in bit space, such as ensuring every start tag has a
433            matching end tag. Xerces's namespace binding functionality is replaced by the <emphasis
434               role="ital">Namespace Processor</emphasis>. Unlike Xerces, it is a discrete phase
435            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
436               stream</emphasis>, which are associated with each symbol occurrence. This is
437            discussed in Section \ref{section:arch:namespacehandling}. Finally, the <emphasis
438               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
439            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
440        <figure xml:id="icxml-arch">
441          <title>icXML Architecture</title>
442          <mediaobject>
443            <imageobject>
444              <imagedata format="png" fileref="icxml.png" width="500cm"/>
445            </imageobject>
446          </mediaobject>
447          <caption>
448          </caption>
449        </figure>
450      </section>
451      <section>
452         <title>Character Set Adapters</title>
453         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
454            Xerces itself and provide the end-consumer with a single encoding format. In the
455            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
456            because of the need to decode and classify each byte of input, mapping variable-length
457            UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In
458            other cases, transcoding may involve table look-up operations for each byte of input. In
459            any case, transcoding imposes at least a cost of buffer copying. </para>
460         <para> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
461            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
462            that input code units represent valid characters, mapping the characters of the encoding
463            into the appropriate bitstreams for XML parsing actions (i.e., producing the lexical
464            item streams), as well as supporting ultimate transcoding requirements. All of this work
465            is performed using the parallel bitstream representation of the source input. </para>
466         <para> An important observation is that many character sets are an extension to the legacy
467            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
468            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
469            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
470            serves to compute lexical item streams for all such ASCII-based character sets. </para>
471         <para> A second observation is that&#8212;regardless of which character set is
472            used&#8212;quite often all of the characters in a particular block of input will be
473            within the ASCII range. This is a very simple test to perform using the bitstream
474            representation, simply confirming that the bit 0 stream is zero for the entire block.
475            For blocks satisfying this test, all logic dealing with non-ASCII characters can simply
476            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
477            UTF-16 form are each set to zero in this case. </para>
478         <para> A third observation is that repeated transcoding of the names of XML elements,
479            attributes and so on can be avoided by using a look-up mechanism. That is, the first
480            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
481            numeric symbol ID. Transcoding of the symbol is applied at this time. Subsequent look-up
482            operations can avoid transcoding by simply retrieving the stored representation. As
483            symbol look up is required to apply various XML validation rules, there is achieves the
484            effect of transcoding each occurrence without additional cost. </para>
485         <para> The cost of individual character transcoding is avoided whenever a block of input is
486            confined to the ASCII subset and for all but the first occurrence of any XML element or
487            attribute name. Furthermore, when transcoding is required, the parallel bitstream
488            representation supports efficient transcoding operations. In the important case of UTF-8
489            to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit
490            parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes
491            of multi-byte sequences can be marked for deletion as discussed in the following
492            subsection. In other cases, transcoding within a block only need be applied for
493            non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream
494            using bit scan operations. </para>
495      </section>
496      <section>
497         <title>Combined Parallel Filtering</title>
498         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
499            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
500            Chinese characters <code>&#x4F60;&#x597D;</code> are represented as two
501            three-byte UTF-8 sequences <code>E4 BD A0</code> and <code>E5 A5 BD</code> while the
502            UTF-16 representation must be compressed down to the two code units <code>4F60</code>
503            and <code>597D</code>. In the bit parallel representation, this corresponds to a
504            reduction from six bit positions representing UTF-8 code units (bytes) down to just two
505            bit positions representing UTF-16 code units (double bytes). This compression may be
506            achieved by arranging to calculate the correct UTF-16 bits at the final position of each
507            sequence and creating a deletion mask to mark the first two bytes of each 3-byte
508            sequence for deletion. In this case, the portion of the mask corresponding to these
509            input bytes is the bit sequence <code>110110</code>. Using this approach, transcoding
510            may then be completed by applying parallel deletion and inverse transposition of the
511            UTF-16 bitstreams\cite{Cameron2008}. </para>
512         <para> Rather than immediately paying the costs of deletion and transposition just for
513            transcoding, however, icXML defers these steps so that the deletion masks for several
514            stages of processing may be combined. In particular, this includes core XML requirements
515            to normalize line breaks and to replace character reference and entity references by
516            their corresponding text. In the case of line break normalization, all forms of line
517            breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations
518            must be normalized to a single LF character in each case. In icXML, this is achieved by
519            first marking CR positions, performing two bit parallel operations to transform the
520            marked CRs into LFs, and then marking for deletion any LF that is found immediately
521            after the marked CR as shown by the Pablo source code in Figure
522            \ref{fig:LBnormalization}.
523              <figure>
524                <caption>Line Break Normalization Logic</caption>
525  <programlisting>
526# XML 1.0 line-break normalization rules.
527if lex.CR:
528# Modify CR (#x0D) to LF (#x0A)
529  u16lo.bit_5 ^= lex.CR
530  u16lo.bit_6 ^= lex.CR
531  u16lo.bit_7 ^= lex.CR
532  CRLF = pablo.Advance(lex.CR) &amp; lex.LF
533  callouts.delmask |= CRLF
534# Adjust LF streams for line/column tracker
535  lex.LF |= lex.CR
536  lex.LF ^= CRLF
539         </para>
540         <para> In essence, the deletion masks for transcoding and for line break normalization each
541            represent a bitwise filter; these filters can be combined using bitwise-or so that the
542            parallel deletion algorithm need only be applied once. </para>
543         <para> A further application of combined filtering is the processing of XML character and
544            entity references. Consider, for example, the references <code>&amp;</code> or
545               <code>&#x3C;</code>. which must be replaced in XML processing with the single
546               <code>&amp;</code> and <code>&lt;</code> characters, respectively. The
547            approach in icXML is to mark all but the first character positions of each reference for
548            deletion, leaving a single character position unmodified. Thus, for the references
549               <code>&amp;</code> or <code>&#x3C;</code> the masks <code>01111</code> and
550               <code>011111</code> are formed and combined into the overall deletion mask. After the
551            deletion and inverse transposition operations are finally applied, a post-processing
552            step inserts the proper character at these positions. One note about this process is
553            that it is speculative; references are assumed to generally be replaced by a single
554            UTF-16 code unit. In the case, that this is not true, it is addressed in
555            post-processing. </para>
556         <para> The final step of combined filtering occurs during the process of reducing markup
557            data to tag bytes preceding each significant XML transition as described in
558            section~\ref{section:arch:contentstream}. Overall, icXML avoids separate buffer copying
559            operations for each of the these filtering steps, paying the cost of parallel deletion
560            and inverse transposition only once. Currently, icXML employs the parallel-prefix
561            compress algorithm of Steele~\cite{HackersDelight} Performance is independent of the
562            number of positions deleted. Future versions of icXML are expected to take advantage of
563            the parallel extract operation~\cite{HilewitzLee2006} that Intel is now providing in its
564            Haswell architecture. </para>
565      </section>
566      <section>
567         <title>Content Stream</title>
568         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
569            Rather that parsing an XML document in its original format, the input is transformed
570            into one that is easier for the parser to iterate through and produce the sequential
571            output. In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
572            <!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
573            is transformed into <!-- CODE -->
574            <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
575            through the parallel filtering algorithm, described in section \ref{sec:parfilter}. </para>
576         <para> Combined with the symbol stream, the parser traverses the content stream to
577            effectively reconstructs the input document in its output form. The initial <emphasis
578               role="ital">0</emphasis> indicates an empty content string. The following
579               <code>&gt;</code> indicates that a start tag without any attributes is the first
580            element in this text and the first unused symbol, <code>document</code>, is the element
581            name. Succeeding that is the content string <code>fee</code>, which is null-terminated
582            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
583            operations are required to produce these strings, which as
584            Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time.
585            Additionally, it is cheap to locate the terminal character of each string: using the
586            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
587            null character in the content stream in parallel, which in turn means the parser can
588            directly jump to the end of every string without scanning for it. </para>
589         <para> Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the
590            existence of an attribute. Because all of the intra-element was performed in the Parabix
591            Subsystem, this must be a legal attribute. Since attributes can only occur within start
592            tags and must be accompanied by a textual value, the next symbol in the symbol stream
593            must be the element name of a start tag, and the following one must be the name of the
594            attribute and the string that follows the <code>=</code> must be its value. However, the
595            subsequent <code>=</code> is not treated as an independent attribute because the parser
596            has yet to read a <code>&gt;</code>, which marks the end of a start tag. Thus only
597            one symbol is taken from the symbol stream and it (along with the string value) is added
598            to the element. Eventually the parser reaches a <code>/</code>, which marks the
599            existence of an end tag. Every end tag requires an element name, which means they
600            require a symbol. Inter-element validation whenever an empty tag is detected to ensure
601            that the appropriate scope-nesting rules have been applied. </para>
602      </section>
603      <section>
604         <title>Namespace Handling</title>
605         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
606         <para> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
607            together. It is especially important when a vocabulary application-dependant meaning,
608            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
609            to uniform resource identifiers (URIs), which are strings used to identify specific
610            names or resources. On line 1 in the Table below, the <code>xmlns</code>
611            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
612               &apos;<code></code>&apos; and the default (empty) prefix to
613               <code></code>. Thus to the XML processor, the <code>title</code> on line 2
614            and <code>price</code> on line 4 both read as
615            <code>&quot;;:title</code> and
616               <code>&quot;;:price</code> respectively, whereas on line 3 and
617            5, <code>p:name</code> and <code>price</code> are seen as
618               <code>&quot;;:name</code> and
619               <code>&quot;;:price</code>. Even though the actual element name
620               <code>price</code>, due to namespace scoping rules they are viewed as two
621            uniquely-named items because the current vocabulary is determined by the namespace(s)
622            that are in-scope. </para>
624                  <caption>
625                     <para>XML Namespace Example</para>
626                  </caption>
627                  <colgroup>
628                     <col align="centre" valign="top"/>
629                     <col align="left" valign="top"/>
630                  </colgroup>
631                  <tbody>
632 <tr><td>1. </td><td><![CDATA[<book xmlns:p="" xmlns="">]]> </td></tr>
633 <tr><td>2. </td><td><![CDATA[  <title>BOOK NAME</title>]]> </td></tr>
634 <tr><td>3. </td><td><![CDATA[  <p:name>PUBLISHER NAME</p:name>]]> </td></tr>
635 <tr><td>4. </td><td><![CDATA[  <price>X</price>]]> </td></tr>
636 <tr><td>5. </td><td><![CDATA[  <price xmlns="">Y</price>]]> </td></tr>
637 <tr><td>6. </td><td><![CDATA[</book>]]> </td></tr>
638                  </tbody>
639               </table>         
641         <para> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
642            persist for the lifetime of the application through the use of a global URI pool. Xerces
643            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
644            (end tag) occurs in the document. Because a namespace declaration affects the entire
645            element, it must be processed prior to grammar validation. This is a costly process
646            considering that a typical namespaced XML document only comes in one of two forms: (1)
647            those that declare a set of namespaces upfront and never change them, and (2) those that
648            repeatedly modify the namespaces in predictable patterns. </para>
649         <para> For that reason, icXML contains an independent namespace stack and utilizes bit
650            vectors to cheaply perform <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
651            <!-- performance advantage figure?? average cycles/byte cost? --> When a prefix is
652            declared (e.g., <code>xmlns:p=&quot;;</code>), a namespace binding
653            is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution
654            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
655            every prefix contains a bit vector marking every NSID that has ever been associated with
656            it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding
657            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
658            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
659            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
660            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
661            found using a bit-scan intrinsic. A namespace binding table, similar to Table
662            \ref{tbl:namespace1}, provides the actual URI ID. </para>
664                  <caption>
665                     <para>Namespace Binding Table Example</para>
666                  </caption>
667                  <colgroup>
668                     <col align="centre" valign="top"/>
669                     <col align="centre" valign="top"/>
670                     <col align="centre" valign="top"/>
671                     <col align="centre" valign="top"/>
672                     <col align="centre" valign="top"/>
673                   </colgroup>
674                   <thead>
675                     <tr><th>NSID </th><th> Prefix </th><th> URI </th><th> Prefix ID </th><th> URI ID </th>
676                     </tr>
677                   </thead>
678                  <tbody>
679<tr><td>0 </td><td> <code> p</code> </td><td> <code></code> </td><td> 0 </td><td> 0 </td></tr> 
680 <tr><td>1 </td><td> <code> xmlns</code> </td><td> <code></code> </td><td> 1 </td><td> 1 </td></tr> 
681 <tr><td>2 </td><td> <code> xmlns</code> </td><td> <code></code> </td><td> 1 </td><td> 0 </td></tr> 
682                  </tbody>
683               </table>         
684         <para>
685            <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
686            <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
687            <!-- NSid = bitscan(VisiblePrefixBinding); -->
688            <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
689         </para>
690         <para> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
691            any modification to the currently visible namespaces is calculated and stored within a
692            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
693            is found, the currently visible namespaces is XORed with the vector at the top of the
694            stack. This allows any number of changes to be performed at each scope-level with a
695            constant time.
696            <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
697         </para>
698      </section>
699      <section>
700         <title>Error Handling</title>
701         <para>
702            <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
703            Xerces outputs error messages in two ways: through the programmer API and as thrown
704            objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to
705            assess whether the next character is legal; if not, the current state determines the
706            type and severity of the error. icXML emits errors in the similar manner&#8212;but
707            how it discovers them is substantially different. Recall that in Figure
708            \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and
709            Markup Processor, each with its own system for detecting and producing error messages. </para>
710         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
711            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
712            the byte-position of an error within a block, and the type of error is determined by the
713            equation that discovered it. The difficulty of error processing in this section is that
714            in Xerces the line and column number must be given with every error production. Two
715            major issues exist because of this: (1) line position adheres to XML white-normalization
716            rules; as such, some sequences of characters, e.g., a carriage return followed by a line
717            feed, are counted as a single new line character. (2) column position is counted in
718            characters, not bytes or code units; thus multi-code-unit code-points and surrogate
719            character pairs are all counted as a single column position. Note that typical XML
720            documents are error-free but the calculation of the line/column position is a constant
721            overhead in Xerces. <!-- that must be maintained in the case that one occurs. --> To
722            reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence
723            of the error and performs the minimal amount of book-keeping necessary to facilitate it.
724            icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates
725            the information within the Line Column Tracker (LCT). One of the CSA's major
726            responsibilities is transcoding an input text.
727            <!-- from some encoding format to near-output-ready UTF-16. --> During this process,
728            white-space normalization rules are applied and multi-code-unit and surrogate characters
729            are detected and validated. A <emphasis role="ital">line-feed bitstream</emphasis>,
730            which marks the positions of the normalized new lines characters, is a natural
731            derivative of this process. Using an optimized population count algorithm, the line
732            count can be summarized cheaply for each valid block of text.
733            <!-- The optimization delays the counting process .... --> Column position is more
734            difficult to calculate. It is possible to scan backwards through the bitstream of new
735            line characters to determine the distance (in code-units) between the position between
736            which an error was detected and the last line feed. However, this distance may exceed
737            than the actual character position for the reasons discussed in (2). To handle this, the
738            CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together
739            many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters,
740            and any characters that were removed during the normalization process. When an error is
741            detected, the sum of those skipped positions is subtracted from the distance to
742            determine the actual column number. </para>
743         <para> The Markup Processor is a state-driven machine. As such, error detection within it
744            is very similar to Xerces. However, reporting the correct line/column is a much more
745            difficult problem. The Markup Processor parses the content stream, which is a series of
746            tagged UTF-16 strings. Each string is normalized in accordance with the XML
747            specification. All symbol data and unnecessary whitespace is eliminated from the stream;
748            thus its impossible to derive the current location using only the content stream. To
749            calculate the location, the Markup Processor borrows three additional pieces of
750            information from the Parabix Subsystem: the line-feed, skip mask, and a <emphasis
751               role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the
752            (code-unit) position of every datum that was suppressed from the source during the
753            production of the content stream. Armed with these, it is possible to calculate the
754            actual line/column using the same system as the Parabix Subsystem until the sum of the
755            negated deletion mask stream is equal to the current position. </para>
756      </section>
757   </section>
759   <section>
760      <title>Multithreading with Pipeline Parallelism</title>
761      <para> As discussed in section \ref{background:xerces}, Xerces can be considered a FSM
762         application. These are &quot;embarrassingly
763         sequential.&quot;\cite{Asanovic:EECS-2006-183} and notoriously difficult to
764         parallelize. However, icXML is designed to organize processing into logical layers. In
765         particular, layers within the Parabix Subsystem are designed to operate over significant
766         segments of input data before passing their outputs on for subsequent processing. This fits
767         well into the general model of pipeline parallelism, in which each thread is in charge of a
768         single module or group of modules. </para>
769      <para> The most straightforward division of work in icXML is to separate the Parabix Subsystem
770         and the Markup Processor into distinct logical layers into two separate stages. The
771         resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained
772         software-pipeline application. In this case, the Parabix Subsystem thread
773               <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a
774         time and produces the content, symbol and URI streams, then stores them in a pre-allocated
775         shared data structure <code>S</code>. The Markup Processor thread
776            <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness
777         and grammar-based validation, and the provides parsed XML data to the application through
778         the Xerces API. The shared data structure is implemented using a ring buffer, where every
779         entry contains an independent set of data streams. In the examples of Figure
780         \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A
781         lock-free mechanism is applied to ensure that each entry can only be read or written by one
782         thread at the same time. In Figure \ref{threads_timeline1} the processing time of
783               <code>T<subscript>1</subscript></code> is longer than
784         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
785         waits for <code>T<subscript>1</subscript></code> to write to the shared memory. Figure
786         \ref{threads_timeline2} illustrates the scenario in which
787         <code>T<subscript>1</subscript></code> is faster and must wait for
788            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
789         reuse the memory space. </para>
790      <para>
791        <figure xml:id="threads_timeline1">
792          <title>Thread Balance in Two-Stage Pipelines</title>
793          <mediaobject>
794            <imageobject>
795              <imagedata format="png" fileref="threads_timeline1.png" width="500cm"/>
796            </imageobject>
797          </mediaobject>
798          <mediaobject>
799            <imageobject>
800              <imagedata format="png" fileref="threads_timeline2.png" width="500cm"/>
801            </imageobject>
802          </mediaobject>
803          <caption>
804          </caption>
805        </figure>
806      </para>
807      <para> Overall, our design is intended to benefit a range of applications. Conceptually, we
808         consider two design points. The first, the parsing performed by the Parabix Subsystem
809         dominates at 67% of the overall cost, with the cost of application processing (including
810         the driver logic within the Markup Processor) at 33%. The second is almost the opposite
811         scenario, the cost of application processing dominates at 60%, while the cost of XML
812         parsing represents an overhead of 40%. </para>
813      <para> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
814         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
815         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
816         total application cost. In this case, the single-threaded icXML should achieve a 1.5x
817         speedup over Xerces so that the total application cost reduces to 67% of the original.
818         However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing
819         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
820         to a 3x speedup at best. </para>
821      <para> At the other extreme of our design range, we consider an application in which core
822         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
823         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
824         significant aspect of our two-stage multi-threaded design then becomes the ability to hide
825         the entire latency of parsing within the serial time required by the application. In this
826         case, we achieve an overall speedup in processing time by 1.67x. </para>
827      <para> Although the structure of the Parabix Subsystem allows division of the work into
828         several pipeline stages and has been demonstrated to be effective for four pipeline stages
829         in a research prototype \cite{HPCA2012}, our analysis here suggests that the further
830         pipelining of work within the Parabix Subsystem is not worthwhile if the cost of
831         application logic is little as 33% of the end-to-end cost using Xerces. To achieve benefits
832         of further parallelization with multi-core technology, there would need to be reductions in
833         the cost of application logic that could match reductions in core parsing cost. </para>
834   </section>
836   <section>
837      <title>Performance</title>
838      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
839         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
840         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
841         Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core)
842         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
843         12.04 (Linux). </para>
844      <para> We analyzed the execution profiles of each XML parser using the performance counters
845         found in the processor. We chose several key hardware events that provide insight into the
846         profile of each application and indicate if the processor is doing useful work. The set of
847         events included in our study are: processor cycles, branch instructions, branch
848         mispredictions, and cache misses. The Performance Application Programming Interface (PAPI)
849         Version 5.5.0 \cite{papi} toolkit was installed on the test system to facilitate the
850         collection of hardware performance monitoring statistics. In addition, we used the Linux
851         perf \cite{perf} utility to collect per core hardware events. </para>
852      <section>
853         <title>Xerces C++ SAXCount</title>
854         <para> Xerces comes with sample applications that demonstrate salient features of the
855            parser. SAXCount is the simplest such application: it counts the elements, attributes
856            and characters of a given XML file using the (event based) SAX API and prints out the
857            totals. </para>
859         <para> Table \ref{XMLDocChars} shows the document characteristics of the XML input files
860            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
861            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
862            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
863            documents and consist entirely of single byte encoded ASCII characters.
864  <table>
865                  <caption>
866                     <para>XML Document Characteristics</para>
867                  </caption>
868                  <colgroup>
869                     <col align="left" valign="top"/>
870                     <col align="centre" valign="top"/>
871                     <col align="centre" valign="top"/>
872                     <col align="centre" valign="top"/>
873                     <col align="centre" valign="top"/>
874                  </colgroup>
875                  <tbody>
876 <tr><td>File Name              </td><td> jaw.xml               </td><td> road.gml      </td><td> po.xml        </td><td> soap.xml </td></tr> 
877<tr><td>File Type               </td><td> document              </td><td> data          </td><td> data          </td><td> data   </td></tr>     
878<tr><td>File Size (kB)          </td><td> 7343                  </td><td> 11584         </td><td> 76450         </td><td> 2717 </td></tr> 
879<tr><td>Markup Item Count       </td><td> 74882                 </td><td> 280724        </td><td> 4634110       </td><td> 18004 </td></tr> 
880  <tr><td>Markup Density                </td><td> 0.13                  </td><td> 0.57          </td><td> 0.76          </td><td> 0.87  </td></tr> 
881                  </tbody>
882               </table>           
884         <para> A key predictor of the overall parsing performance of an XML file is markup
885            density\footnote{ Markup Density: the ratio of markup bytes used to define the structure
886            of the document vs. its file size.}. This metric has substantial influence on the
887            performance of traditional recursive descent XML parsers because it directly corresponds
888            to the number of state transitions that occur when parsing a document. We use a mixture
889            of document-oriented and data-oriented XML files to analyze performance over a spectrum
890            of markup densities. </para>
891         <para> Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML
892            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
893            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
894            speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less
895            affected through a reduction in branches and the use of parallel-processing techniques.
896            icXML-p performs better as markup-density increases because the work performed by each
897            stage is well balanced in this application. </para>
898         <para>
899        <figure xml:id="perf_SAX">
900          <title>SAXCount Performance Comparison</title>
901          <mediaobject>
902            <imageobject>
903              <imagedata format="png" fileref="perf_SAX.png" width="500cm"/>
904            </imageobject>
905          </mediaobject>
906          <caption>
907          </caption>
908        </figure>
909         </para>
910      </section>
911      <section>
912         <title>GML2SVG</title>
913<para>   As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
914was chosen.   This application transforms geospatially encoded data represented using
915an XML representation in the form of Geography Markup Language (GML) \cite{lake2004geography}
916into a different XML format  suitable for displayable maps:
917Scalable Vector Graphics (SVG) format\cite{lu2007advances}. In the GML2SVG benchmark, GML feature elements
918and GML geometry elements tags are matched. GML coordinate data are then extracted
919and transformed to the corresponding SVG path data encodings.
920Equivalent SVG path elements are generated and output to the destination
921SVG document.  The GML2SVG application is thus considered typical of a broad
922class of XML applications that parse and extract information from
923a known XML format for the purpose of analysis and restructuring to meet
924the requirements of an alternative format.</para>
926<para>Our GML to SVG data translations are executed on GML source data
927modelling the city of Vancouver, British Columbia, Canada.
928The GML source document set
929consists of 46 distinct GML feature layers ranging in size from approximately 9 KB to 125.2 MB
930and with an average document size of 18.6 MB. Markup density ranges from approximately 0.045 to 0.719
931and with an average markup density of 0.519. In this performance study,
932213.4 MB of source GML data generates 91.9 MB of target SVG data.</para>
935        <figure xml:id="perf_GML2SVG">
936          <title>Performance Comparison for GML2SVG</title>
937          <mediaobject>
938            <imageobject>
939              <imagedata format="png" fileref="Throughput.png" width="500cm"/>
940            </imageobject>
941          </mediaobject>
942          <caption>
943          </caption>
944        </figure>
946<para>Figure \ref{perf_GML2SVG} compares the performance of the GML2SVG application linked against
947the Xerces, icXML and icXML-p.   
948On the GML workload with this application, single-thread icXML
949achieved about a 50% acceleration over Xerces,
950increasing throughput on our test machine from 58.3 MB/sec to 87.9 MB/sec.   
951Using icXML-p, a further throughput increase to 111 MB/sec was recorded,
952approximately a 2X speedup.</para>
954<para>An important aspect of icXML is the replacement of much branch-laden
955sequential code inside Xerces with straight-line SIMD code using far
956fewer branches.  Figure \ref{branchmiss_GML2SVG} shows the corresponding
957improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
958It is also interesting to note that icXML-p goes even further.   
959In essence, in using pipeline parallelism to split the instruction
960stream onto separate cores, the branch target buffers on each core are
961less overloaded and able to increase the successful branch prediction rate.</para>
963        <figure xml:id="branchmiss_GML2SVG">
964          <title>Comparative Branch Misprediction Rate</title>
965          <mediaobject>
966            <imageobject>
967              <imagedata format="png" fileref="BM.png" width="500cm"/>
968            </imageobject>
969          </mediaobject>
970          <caption>
971          </caption>
972        </figure>
974<para>The behaviour of the three versions with respect to L1 cache misses per kB is shown
975in Figure \ref{cachemiss_GML2SVG}.   Improvements are shown in both instruction-
976and data-cache performance with the improvements in instruction-cache
977behaviour the most dramatic.   Single-threaded icXML shows substantially improved
978performance over Xerces on both measures.   
979Although icXML-p is slightly worse with respect to data-cache performance,
980this is more than offset by a further dramatic reduction in instruction-cache miss rate.
981Again partitioning the instruction stream through the pipeline parallelism model has
982significant benefit.</para>
984        <figure xml:id="cachemiss_GML2SVG">
985          <title>Comparative Cache Miss Rate</title>
986          <mediaobject>
987            <imageobject>
988              <imagedata format="png" fileref="CM.png" width="500cm"/>
989            </imageobject>
990          </mediaobject>
991          <caption>
992          </caption>
993        </figure>
995<para>One caveat with this study is that the GML2SVG application did not exhibit
996a relative balance of processing between application code and Xerces library
997code reaching the 33% figure.  This suggests that for this application and
998possibly others, further separating the logical layers of the
999icXML engine into different pipeline stages could well offer significant benefit.
1000This remains an area of ongoing work.</para>
1001      </section>
1002   </section>
1004   <section>
1005      <title>Conclusion and Future Work</title>
1006      <para> This paper is the first case study documenting the significant performance benefits
1007         that may be realized through the integration of parallel bitstream technology into existing
1008         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
1009         integration of SIMD and multicore parallelism was shown capable of dramatic producing
1010         dramatic increases in throughput and reductions in branch mispredictions and cache misses.
1011         The modified parser, going under the name icXML is designed to provide the full
1012         functionality of the original Xerces library with complete compatibility of APIs. Although
1013         substantial re-engineering was required to realize the performance potential of parallel
1014         technologies, this is an important case study demonstrating the general feasibility of
1015         these techniques. </para>
1016      <para> The further development of icXML to move beyond 2-stage pipeline parallelism is
1017         ongoing, with realistic prospects for four reasonably balanced stages within the library.
1018         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
1019         multistage pipelined parsing library should offer substantial benefits. </para>
1020      <para> The example of XML parsing may be considered prototypical of finite-state machines
1021         applications which have sometimes been considered &quot;embarassingly
1022         sequential&quot; and so difficult to parallelize that &quot;nothing
1023         works.&quot; So the case study presented here should be considered an important data
1024         point in making the case that parallelization can indeed be helpful across a broad array of
1025         application types. </para>
1026      <para> To overcome the software engineering challenges in applying parallel bitstream
1027         technology to existing software systems, it is clear that better library and tool support
1028         is needed. The techniques used in the implementation of icXML and documented in this paper
1029         could well be generalized for applications in other contexts and automated through the
1030         creation of compiler technology specifically supporting parallel bitstream programming.
1031      </para>
1032   </section>
1034   <!-- 
1035   <section>
1036      <title>Acknowledgments</title>
1037      <para></para>
1038   </section>
1040   <bibliography>
1041      <title>Bibliography</title>
1042      <bibliomixed xml:id="XMLChip09" xreflabel="Leventhal and Lemoine 2009">Leventhal, Michael and
1043         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
1044         Processing XML Efficiently 2009, Montréal.</bibliomixed>
1045      <bibliomixed xml:id="Datapower09" xreflabel="Salz, Achilles and Maze 2009">Salz, Richard,
1046         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
1047         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
1048         Efficiently 2009, Montréal.</bibliomixed>
1049      <bibliomixed xml:id="PPoPP08" xreflabel="Cameron 2007">Cameron, Robert D. 2007. A Case Study
1050         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
1051         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
1052         Lake City, Utah. On the Web at <link></link>.</bibliomixed>
1053      <bibliomixed xml:id="CASCON08" xreflabel="Cameron, Herdy and Lin 2008">Cameron, Robert D.,
1054         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
1055         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
1056         Practice of Parallel Programming 2008, Toronto.</bibliomixed>
1057      <bibliomixed xml:id="SVGOpen08" xreflabel="Herdy, Burggraf and Cameron 2008">Herdy, Kenneth
1058         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
1059         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
1060         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
1061         Nuremburg. On the Web at
1062            <link></link>.</bibliomixed>
1063      <bibliomixed xml:id="Ross06" xreflabel="Ross 2006">Ross, Kenneth A. 2006. Efficient hash
1064         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
1065            <link></link>.</bibliomixed>
1066      <bibliomixed xml:id="ASPLOS09" xreflabel="Cameron and Lin 2009">Cameron, Robert D. and Dan
1067         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
1068         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</bibliomixed>
1069      <bibliomixed xml:id="Wu08" xreflabel="Wu et al. 2008">Wu, Yu, Qi Zhang, Zhiqiang Yu and
1070         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
1071         Proceedings of Balisage 2008, Montréal. On the Web at
1072            <link></link>.</bibliomixed>
1073      <bibliomixed xml:id="u8u16" xreflabel="Cameron 2008">u8u16 - A High-Speed UTF-8 to UTF-16
1074         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
1075         Science Simon Fraser University, June 21 2007.</bibliomixed>
1076      <bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth
1077         Edition) W3C Recommendation 26 November 2008. On the Web at
1078            <link></link>.</bibliomixed>
1079      <bibliomixed xml:id="Unicode" xreflabel="Unicode">The Unicode Consortium. 2009. On the Web at
1080            <link></link>.</bibliomixed>
1081      <bibliomixed xml:id="Pex06" xreflabel="Hilewitz and Lee 2006"> Hilewitz, Y. and Ruby B. Lee.
1082         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
1083         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
1084         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</bibliomixed>
1085      <bibliomixed xml:id="InfoSet" xreflabel="XML Infoset">XML Information Set (Second Edition) W3C
1086         Recommendation 4 February 2004. On the Web at
1087         <link></link>.</bibliomixed>
1088      <bibliomixed xml:id="Saxon" xreflabel="Saxon">SAXON The XSLT and XQuery Processor. On the Web
1089         at <link></link>.</bibliomixed>
1090      <bibliomixed xml:id="Kay08" xreflabel="Kay 2008"> Kay, Michael Y. 2008. Ten Reasons Why Saxon
1091         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</bibliomixed>
1092      <bibliomixed xml:id="AElfred" xreflabel="Ælfred"> The Ælfred XML Parser. On the Web at
1093            <link></link>.</bibliomixed>
1094      <bibliomixed xml:id="JNI" xreflabel="Hitchens 2002">Hitchens, Ron. Java NIO. O'Reilly, 2002.</bibliomixed>
1095      <bibliomixed xml:id="Expat" xreflabel="Expat">The Expat XML Parser.
1096            <link></link>.</bibliomixed>
1097   </bibliography>
Note: See TracBrowser for help on using the repository browser.