source: docs/Balisage13/Bal2013came0601/Bal2013came0601.xml @ 3060

Last change on this file since 3060 was 3060, checked in by cameron, 6 years ago

Citations

File size: 85.7 KB
Line 
1<?xml version="1.0" encoding="UTF-8"?>
2<!-- MODIFIED DTD LOCATION -->
3<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
4<article xmlns="http://docbook.org/ns/docbook" version="5.0-subset Balisage-1.3"
5   xml:id="HR-23632987-8973">
6   <title/>
7   <info>
8      <!--
9      <confgroup>
10         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
11            Space, Time, or Bandwidth</conftitle>
12         <confdates>August 10 2009</confdates>
13      </confgroup>
14-->
15      <abstract>
16         <para>Prior research on the acceleration of XML processing using SIMD and multi-core
17            parallelism has lead to a number of interesting research prototypes. This work
18            investigates the extent to which the techniques underlying these prototypes could result
19            in systematic performance benefits when fully integrated into a commercial XML parser.
20            The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the
21            foundation for the study. A systematic restructuring of the parser was undertaken, while
22            maintaining the existing API for application programmers. Using SIMD techniques alone,
23            an increase in parsing speed of at least 50% was observed in a range of applications.
24            When coupled with pipeline parallelism on dual core processors, improvements of 2x and
25            beyond were realized. </para>
26      </abstract>
27      <author>
28         <personname>
29            <firstname>Nigel</firstname>
30            <surname>Medforth</surname>
31         </personname>
32         <personblurb>
33            <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
34               developer of icXML. He earned a Bachelor of Technology in Information Technology at
35               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
36               Outstanding Achievement.</para>
37            <para>Nigel is currently researching ways to leverage both the Parabix framework and
38               stream-processing models to further accelerate XML parsing within icXML.</para>
39         </personblurb>
40         <affiliation>
41            <jobtitle>Developer</jobtitle>
42            <orgname>International Characters Inc.</orgname>
43         </affiliation>
44         <affiliation>
45            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
46            <orgname>Simon Fraser University </orgname>
47         </affiliation>
48         <email>nmedfort@sfu.ca</email>
49      </author>
50      <author>
51         <personname>
52            <firstname>Dan</firstname>
53            <surname>Lin</surname>
54         </personname>
55         <personblurb>
56           <para>Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
57             in Computing Science at Simon Fraser University in 2010. Her research focus on on high
58             performance algorithms that exploit parallelization strategies on various multicore platforms.
59           </para>
60         </personblurb>
61         <affiliation>
62            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
63            <orgname>Simon Fraser University </orgname>
64         </affiliation>
65         <email>lindanl@sfu.ca</email>
66      </author>
67      <author>
68         <personname>
69            <firstname>Kenneth</firstname>
70            <surname>Herdy</surname>
71         </personname>
72         <personblurb>
73            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
74               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
75               of Science in Computing Science with a Certificate in Spatial Information Systems at
76               Simon Fraser University in 2005. </para>
77            <para> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
78               University with industrial scholarship support from the Natural Sciences and
79               Engineering Research Council of Canada, the Mathematics of Information Technology and
80               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
81               of the principal techniques that may be used to improve XML processing performance in
82               the context of the Geography Markup Language (GML). </para>
83         </personblurb>
84         <affiliation>
85            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
86            <orgname>Simon Fraser University </orgname>
87         </affiliation>
88         <email>ksherdy@sfu.ca</email>
89      </author>
90      <author>
91         <personname>
92            <firstname>Rob</firstname>
93            <surname>Cameron</surname>
94         </personname>
95         <personblurb>
96            <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
97               Sciences at Simon Fraser University. His research interests include programming
98               language and software system technology, with a specific focus on high performance
99               text processing using SIMD and multicore parallelism. He is the developer of the REX
100               XML shallow parser as well as the parallel bit stream (Parabix) framework for SIMD
101               text processing. </para>
102         </personblurb>
103         <affiliation>
104            <jobtitle>Professor of Computing Science</jobtitle>
105            <orgname>Simon Fraser University</orgname>
106         </affiliation>
107         <affiliation>
108            <jobtitle>Chief Technology Officer</jobtitle>
109            <orgname>International Characters, Inc.</orgname>
110         </affiliation>
111         <email>cameron@cs.sfu.ca</email>
112      </author>
113      <author>
114         <personname>
115            <firstname>Arrvindh</firstname>
116            <surname>Shriraman</surname>
117         </personname>
118         <personblurb>
119            <para/>
120         </personblurb>
121         <affiliation>
122            <jobtitle/>
123            <orgname/>
124         </affiliation>
125         <email/>
126      </author>
127      <!--
128      <legalnotice>
129         <para>Copyright &#x000A9; 2013 Nigel Medforth, Dan Lin, Kenneth S. Herdy, Robert D. Cameron  and Arrvindh Shriraman.
130            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
131            Works 2.5 Canada License.</para>
132      </legalnotice>
133-->
134      <keywordset role="author">
135         <keyword/>
136      </keywordset>
137
138   </info>
139 <section>
140      <title>Introduction</title>
141      <para>   
142        Parallelization and acceleration of XML parsing is a widely
143        studied problem that has seen the development of a number
144        of interesting research prototypes using both SIMD and
145        multicore parallelism.   Most works have investigated
146        data parallel solutions on multicore
147        architectures using various strategies to break input
148        documents into segments that can be allocated to different cores.
149        For example, one possibility for data
150        parallelization is to add a pre-parsing step to compute
151        a skeleton tree structure of an  XML document <citation linkend="GRID2006"/>.
152        The parallelization of the pre-parsing stage itself can be tackled with
153          state machines <citation linkend="E-SCIENCE2007"/>, <citation linkend="IPDPS2008"/>.
154        Methods without pre-parsing have used speculation <citation linkend="HPCC2011"/> or post-processing that
155        combines the partial results <citation linkend="ParaDOM2009"/>.
156        A hybrid technique that combines data and pipeline parallelism was proposed to
157        hide the latency of a "job" that has to be done sequentially <citation linkend="ICWS2008"/>.
158      </para>
159      <para>
160        Fewer efforts have investigated SIMD parallelism, although this approach
161        has the potential advantage of improving single core performance as well
162        as offering savings in energy consumption <citation linkend="HPCA2012"/>.
163        Intel introduced specialized SIMD string processing instructions in the SSE 4.2 instruction set extension
164        and showed how they can be used to improve the performance of XML parsing <citation linkend="XMLSSE42"/>.
165        The Parabix framework uses generic SIMD extensions and bit parallel methods to
166        process hundreds of XML input characters simultaneously <citation linkend="Cameron2009"/> <citation linkend="cameron-EuroPar2011"/>.
167        Parabix prototypes have also combined SIMD methods with thread-level parallelism to
168        achieve further acceleration on multicore systems <citation linkend="HPCA2012"/>.
169      </para>
170      <para>
171        In this paper, we move beyond research prototypes to consider
172        the detailed integration of both SIMD and multicore parallelism into the
173        Xerces-C++ parser of the Apache Software Foundation, an existing
174        standards-compliant open-source parser that is widely used
175        in commercial practice.    The challenge of this work is
176        to parallelize the Xerces parser in such a way as to
177        preserve the existing APIs as well as offering worthwhile
178        end-to-end acceleration of XML processing.   
179        To achieve the best results possible, we undertook
180        a nine-month comprehensive restructuring of the Xerces-C++ parser,
181        seeking to expose as many critical aspects of XML parsing
182        as possible for parallelization, the result of which we named icXML.   
183        Overall, we employed Parabix-style methods of transcoding, tokenization
184        and tag parsing, parallel string comparison methods in symbol
185        resolution, bit parallel methods in namespace processing,
186        as well as staged processing using pipeline parallelism to take advantage of
187        multiple cores.
188      </para>
189      <para>
190        The remainder of this paper is organized as follows.   
191          <xref linkend="background"/> discusses the structure of the Xerces and Parabix XML parsers and the fundamental
192        differences between the two parsing models.   
193        <xref linkend="architecture"/> then presents the icXML design based on a restructured Xerces architecture to
194        incorporate SIMD parallelism using Parabix methods.   
195        <xref linkend="multithread"/> moves on to consider the multithreading of the icXML architecture
196        using the pipeline parallelism model. 
197        <xref linkend="performance"/> analyzes the performance of both the single-threaded and
198        multi-threaded versions of icXML in comparison to original Xerces,
199        demonstrating substantial end-to-end acceleration of
200        a GML-to-SVG translation application written against the Xerces API.
201          <xref linkend="conclusion"/> concludes the paper with a discussion of future work and the potential for
202        applying the techniques discussed herein in other application domains.
203      </para>
204   </section>
205
206   <section xml:id="background">
207      <title>Background</title>
208      <section xml:id="background-xerces">
209         <title>Xerces C++ Structure</title>
210         <para> The Xerces C++ parser is a widely-used standards-conformant
211            XML parser produced as open-source software
212             by the Apache Software Foundation.
213            It features comprehensive support for a variety of character encodings both
214            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
215            XML vocabularies through the XML namespace mechanism, as well as complete
216            implementations of structure and data validation through multiple grammars declared
217            using either legacy DTDs (document type definitions) or modern XML Schema facilities.
218            Xerces also supports several APIs for accessing parser services, including event-based
219            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
220            tree-based parsing interface. </para>
221         <para>
222            Xerces,
223            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
224            the first to the last byte of input data. Each byte passes through several processing
225            layers and is classified and eventually validated within the context of the document
226            state. This introduces implicit dependencies between the various tasks within the
227            application that make it difficult to optimize for performance. As a complex software
228              system, no one feature dominates the overall parsing performance. <xref linkend="xerces-profile"/>
229            shows the execution time profile of the top ten functions in a
230            typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of
231            these functions for parallelization in isolation would only produce a minute improvement
232            in performance. Unfortunately, early investigation into these functions found that
233            incorporating speculation-free thread-level parallelization was impossible and they were
234            already performing well in their given tasks; thus only trivial enhancements were
235            attainable. In order to obtain a systematic acceleration of Xerces, it should be
236            expected that a comprehensive restructuring is required, involving all aspects of the
237            parser. </para>
238             <table xml:id="xerces-profile">
239                  <caption>
240                     <para>Execution Time of Top 10 Xerces Functions</para>
241                  </caption>
242                  <colgroup>
243                     <col align="left" valign="top"/>
244                     <col align="left" valign="top"/>
245                  </colgroup>
246                  <thead><tr><th>Time (%) </th><th> Function Name </th></tr></thead>
247                  <tbody>
248<tr valign="top"><td>13.29      </td>   <td>XMLUTF8Transcoder::transcodeFrom </td></tr>
249<tr valign="top"><td>7.45       </td>   <td>IGXMLScanner::scanCharData </td></tr>
250<tr valign="top"><td>6.83       </td>   <td>memcpy </td></tr>
251<tr valign="top"><td>5.83       </td>   <td>XMLReader::getNCName </td></tr>
252<tr valign="top"><td>4.67       </td>   <td>IGXMLScanner::buildAttList </td></tr>
253<tr valign="top"><td>4.54       </td>   <td>RefHashTableO&lt;&gt;::findBucketElem </td></tr>
254<tr valign="top"><td>4.20       </td>   <td>IGXMLScanner::scanStartTagNS </td></tr>
255<tr valign="top"><td>3.75       </td>   <td>ElemStack::mapPrefixToURI </td></tr>
256<tr valign="top"><td>3.58       </td>   <td>ReaderMgr::getNextChar </td></tr>
257<tr valign="top"><td>3.20       </td>   <td>IGXMLScanner::basicAttrValueScan </td></tr>
258                  </tbody>
259               </table>
260      </section>
261      <section>
262         <title>The Parabix Framework</title>
263         <para> The Parabix (parallel bit stream) framework is a transformative approach to XML
264            parsing (and other forms of text processing.) The key idea is to exploit the
265            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
266            data from long blocks of input data by using one register bit per single input byte. To
267            facilitate this, the input data is first transposed into a set of basis bit streams.
268              For example, <xref linkend="xml-bytes"/> shows  the ASCII bytes for the string "<code>b7&lt;A</code>" with
269                the corresponding  8 basis bit streams, b<subscript>0</subscript> through  b<subscript>7</subscript> shown in  <xref linkend="xml-bits"/>.
270            The bits used to construct b<subscript>7</subscript> have been highlighted in this example.
271              Boolean-logic operations (&#8743;, \&#8744; and &#172; denote the
272              boolean AND, OR and NOT operators) are used to classify the input bits into a set of
273               <emphasis role="ital">character-class bit streams</emphasis>, which identify key
274            characters (or groups of characters) with a <code>1</code>. For example, one of the
275            fundamental characters in XML is a left-angle bracket. A character is an
276               <code>&apos;&lt;&apos; if and only if
277               &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>)
278               &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>)
279               &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>)
280               &#8743; &#172; (b<subscript>6</subscript> &#8744;
281               b<subscript>7</subscript>) = 1</code>. Similarly, a character is numeric, <code>[0-9]
282               if and only if &#172;(b<subscript>0</subscript> &#8744;
283               b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
284                  b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript>
285               &#8743; (b<subscript>5</subscript> &#8744;
286            b<subscript>6</subscript>))</code>. An important observation here is that ranges of
287            characters may require fewer operations than individual characters and
288            <!-- the classification cost could be amortized over many character classes.--> multiple
289            classes can share the classification cost. </para>
290         <table xml:id="xml-bytes">
291                  <caption>
292                     <para>XML Source Data</para>
293                  </caption>
294                  <colgroup>
295                     <col align="right" valign="top"/>
296                     <col align="centre" valign="top"/>
297                     <col align="centre" valign="top"/>
298                     <col align="centre" valign="top"/>
299                     <col align="centre" valign="top"/>
300                  </colgroup>
301                  <tbody>
302  <tr><td>String </td><td> <code>b</code> </td><td> <code>7</code> </td><td> <code>&lt;</code> </td><td> <code>A</code> </td></tr>
303  <tr><td>ASCII </td><td> <code>0110001<emphasis role="bold">0</emphasis></code> </td><td> <code>0011011<emphasis role="bold">1</emphasis></code> </td><td> <code>0011110<emphasis role="bold">0</emphasis></code> </td><td> <code>0100000<emphasis role="bold">1</emphasis></code> </td></tr>
304  </tbody>
305 
306 
307</table>         
308         <table xml:id="xml-bits">
309                  <caption>
310                     <para>8-bit ASCII Basis Bit Streams</para>
311                  </caption>
312                  <colgroup>
313                     <col align="centre" valign="top"/>
314                     <col align="centre" valign="top"/>
315                     <col align="centre" valign="top"/>
316                     <col align="centre" valign="top"/>
317                     <col align="centre" valign="top"/>
318                     <col align="centre" valign="top"/>
319                     <col align="centre" valign="top"/>
320                     <col align="centre" valign="top"/>
321                  </colgroup>
322                  <tbody>
323<tr><td> b<subscript>0</subscript> </td><td> b<subscript>1</subscript> </td><td> b<subscript>2</subscript> </td><td> b<subscript>3</subscript></td><td> b<subscript>4</subscript> </td><td> b<subscript>5</subscript> </td><td> b<subscript>6</subscript> </td><td> b<subscript>7</subscript> </td></tr>
324 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
325 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
326 <tr><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>0</code></emphasis> </td></tr>
327 <tr><td> <code>0</code> </td><td> <code>1</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <code>0</code> </td><td> <emphasis role="bold"><code>1</code></emphasis> </td></tr>
328  </tbody>
329 
330 
331</table>         
332
333         <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
334         <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
335         <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
336         <!-- process, intra-element well-formedness validation is performed on each block -->
337         <!-- of text. -->
338         <para> Consider, for example, the XML source data stream shown in the first line of <xref linkend="derived"/>.
339The remaining lines of this figure show
340            several parallel bit streams that are computed in Parabix-style parsing, with each bit
341            of each stream in one-to-one correspondence to the source character code units of the
342            input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are
343            represented as underscores. The first bit stream shown is that for the opening angle
344            brackets that represent tag openers in XML. The second and third streams show a
345            partition of the tag openers into start tag marks and end tag marks depending on the
346            character immediately following the opener (i.e., &quot;<code>/</code>&quot;) or
347            not. The remaining three lines show streams that can be computed in subsequent parsing
348            (using the technique of bitstream addition <citation linkend="cameron-EuroPar2011"/>), namely streams
349            marking the element names, attribute names and attribute values of tags. </para>
350            <table xml:id="derived">
351                  <caption>
352                     <para>XML Source Data and Derived Parallel Bit Streams</para>
353                  </caption>
354                  <colgroup>
355                     <col align="centre" valign="top"/>
356                     <col align="left" valign="top"/>
357                  </colgroup>
358                  <tbody>
359          <tr><td> Source Data </td><td> <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
360          <tr><td> Tag Openers </td><td> <code>1____________1____________________________1____________1__________</code></td></tr>
361           <tr><td> Start Tag Marks </td><td> <code>_1____________1___________________________________________________</code></td></tr>
362           <tr><td> End Tag Marks </td><td> <code>___________________________________________1____________1_________</code></td></tr>
363           <tr><td> Empty Tag Marks </td><td> <code>__________________________________________________________________</code></td></tr>
364           <tr><td> Element Names </td><td> <code>_11111111_____1111111_____________________________________________</code></td></tr>
365           <tr><td> Attribute Names </td><td> <code>______________________11_______11_________________________________</code></td></tr>
366           <tr><td> Attribute Values </td><td> <code>__________________________111________111__________________________</code></td></tr>
367                  </tbody>
368               </table>         
369
370         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
371            parsing performance. The first is that the use of the full register width offers a
372            considerable information advantage over sequential byte-at-a-time parsing. That is,
373            sequential processing of bytes uses just 8 bits of each register, greatly limiting the
374            processor resources that are effectively being used at any one time. The second is that
375            byte-at-a-time loop scanning loops are actually often just computing a single bit of
376            information per iteration: is the scan complete yet? Rather than computing these
377            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
378            bytes at a time using 128-bit registers) should provide substantial benefit. </para>
379         <para> Previous studies have shown that the Parabix approach improves many aspects of XML
380            processing, including transcoding <citation linkend="Cameron2008"/>, character classification and
381            validation, tag parsing and well-formedness checking. The first Parabix parser used
382            processor bit scan instructions to considerably accelerate sequential scanning loops for
383            individual characters <citation linkend="CameronHerdyLin2008"/>. Recent work has incorporated a method
384            of parallel scanning using bitstream addition <citation linkend="cameron-EuroPar2011"/>, as well as
385            combining SIMD methods with 4-stage pipeline parallelism to further improve throughput
386            <citation linkend="HPCA2012"/>. Although these research prototypes handled the full syntax of
387            schema-less XML documents, they lacked the functionality required by full XML parsers. </para>
388         <para> Commercial XML processors support transcoding of multiple character sets and can
389            parse and validate against multiple document vocabularies. Additionally, they provide
390            API facilities beyond those found in research prototypes, including the widely used SAX,
391            SAX2 and DOM interfaces. </para>
392      </section>
393      <section>
394         <title>Sequential vs. Parallel Paradigm</title>
395         <para> Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents
396            sequentially. Each character is examined to distinguish between the XML-specific markup,
397            such as a left angle bracket <code>&lt;</code>, and the content held within the
398            document. As the parser progresses through the document, it alternates between markup
399            scanning, validation and content processing modes. </para>
400         <para> In other words, Xerces belongs to an equivalence class of applications termed FSM
401           applications<xref linkend="FSM"/>.<footnote xml:id="FSM"><para>Herein FSM applications are considered software systems whose
402            behaviour is defined by the inputs, current state and the events associated with
403              transitions of states.</para></footnote> Each state transition indicates the processing context of
404            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
405            character could induce a state transition. </para>
406         <para> Parabix-style XML parsers utilize a concept of layered processing. A block of source
407            text is transformed into a set of lexical bitstreams, which undergo a series of
408            operations that can be grouped into logical layers, e.g., transposition, character
409            classification, and lexical analysis. Each layer is pipeline parallel and require
410            neither speculation nor pre-parsing stages<citation linkend="HPCA2012"/>. To meet the API requirements
411            of the document-ordered Xerces output, the results of the Parabix processing layers must
412            be interleaved to produce the equivalent behaviour. </para>
413      </section>
414   </section>
415   <section xml:id="architecture">
416      <title>Architecture</title>
417      <section>
418         <title>Overview</title>
419         <!--\def \CSG{Content Stream Generator}-->
420         <para> icXML is more than an optimized version of Xerces. Many components were grouped,
421            restructured and rearchitected with pipeline parallelism in mind. In this section, we
422            highlight the core differences between the two systems. As shown in Figure
423              <xref linkend="xerces-arch"/>, Xerces is comprised of five main modules: the transcoder, reader,
424            scanner, namespace binder, and validator. The <emphasis role="ital"
425            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
426            the majority of the character set encoding validation is performed as a byproduct of
427            this process. The <emphasis role="ital">Reader</emphasis> is responsible for the
428            streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current
429            line/column position,
430            <!--(which is reported in the unlikely event that the input contains an error), -->
431            performs line-break normalization and validates context-specific character set issues,
432            such as tokenization of qualified-names. The <emphasis role="ital">Scanner</emphasis>
433            pulls data through the reader and constructs the intermediate representation (IR) of the
434            document; it deals with all issues related to entity expansion, validates the XML
435            well-formedness constraints and any character set encoding issues that cannot be
436            completely handled by the reader or transcoder (e.g., surrogate characters, validation
437            and normalization of character references, etc.) The <emphasis role="ital">Namespace
438               Binder</emphasis> is a core piece of the element stack. It handles namespace scoping
439            issues between different XML vocabularies. This allows the scanner to properly select
440            the correct schema grammar structures. The <emphasis role="ital">Validator</emphasis>
441            takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder)
442            and assesses whether the final output matches the user-defined DTD and schema grammar(s)
443            before passing it to the end-user. </para>     
444        <figure xml:id="xerces-arch">
445          <title>Xerces Architecture</title>
446          <mediaobject>
447            <imageobject>
448              <imagedata format="png" fileref="xerces.png" width="150cm"/>
449            </imageobject>
450          </mediaobject>
451          <caption>
452          </caption>
453        </figure>
454         <para> In icXML functions are grouped into logical components. As shown in
455             <xref linkend="xerces-arch"/>, two major categories exist: (1) the Parabix Subsystem and (2) the
456               Markup Processor. All tasks in (1) use the Parabix Framework <citation linkend="HPCA2012"/>, which
457            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
458              Adapter</emphasis>, discussed in <xref linkend="character-set-adapter"/>, mirrors
459            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
460              lexical bitstreams, similar to those shown in <xref linkend="parabix1"/>. These lexical
461            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
462            additional processing is performed. The first precursor to producing UTF-16 is the
463               <emphasis role="ital">Parallel Markup Parser</emphasis> phase. It takes the lexical
464            streams and produces a set of marker bitstreams in which a 1-bit identifies significant
465            positions within the input data. One bitstream for each of the critical piece of
466            information is created, such as the beginning and ending of start tags, end tags,
467            element names, attribute names, attribute values and content. Intra-element
468            well-formedness validation is performed as an artifact of this process. Like Xerces,
469            icXML must provide the Line and Column position of each error. The <emphasis role="ital"
470               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
471            document position(s) through the use of an optimized population count algorithm,
472              described in <xref linkend="errorhandling"/>. From here, two data-independent
473            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
474         <para> A typical XML file contains few unique element and attribute names&#8212;but
475            each of them will occur frequently. icXML stores these as distinct data structures,
476            called symbols, each with their own global identifier (GID). Using the symbol marker
477            streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol
478               Resolver</emphasis> scans through the raw data to produce a sequence of GIDs, called
479            the <emphasis role="ital">symbol stream</emphasis>. </para>
480         <para> The final components of the Parabix Subsystem are the <emphasis role="ital">Content
481               Preparation Unit</emphasis> and <emphasis role="ital">Content Stream
482            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
483            filters them, according to the information provided by the Parallel Markup Parser, and
484            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital">content stream</emphasis>, discussed in <xref linkend="contentstream"/>. </para>
485         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
486            document. The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to
487            validate and produce the sequential output for the end user. The <emphasis role="ital"
488               >Final WF checker</emphasis> performs inter-element well-formedness validation that
489            would be too costly to perform in bit space, such as ensuring every start tag has a
490            matching end tag. Xerces's namespace binding functionality is replaced by the <emphasis
491               role="ital">Namespace Processor</emphasis>. Unlike Xerces, it is a discrete phase
492            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
493               stream</emphasis>, which are associated with each symbol occurrence. This is
494                 discussed in <xref linkend="namespace-handling"/>. Finally, the <emphasis
495               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
496            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
497        <figure xml:id="icxml-arch">
498          <title>icXML Architecture</title>
499          <mediaobject>
500            <imageobject>
501              <imagedata format="png" fileref="icxml.png" width="500cm"/>
502            </imageobject>
503          </mediaobject>
504          <caption>
505          </caption>
506        </figure>
507      </section>
508      <section xml:id="character-set-adapter">
509         <title>Character Set Adapters</title>
510         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
511            Xerces itself and provide the end-consumer with a single encoding format. In the
512            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
513            because of the need to decode and classify each byte of input, mapping variable-length
514            UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In
515            other cases, transcoding may involve table look-up operations for each byte of input. In
516            any case, transcoding imposes at least a cost of buffer copying. </para>
517         <para> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
518            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
519            that input code units represent valid characters, mapping the characters of the encoding
520            into the appropriate bitstreams for XML parsing actions (i.e., producing the lexical
521            item streams), as well as supporting ultimate transcoding requirements. All of this work
522            is performed using the parallel bitstream representation of the source input. </para>
523         <para> An important observation is that many character sets are an extension to the legacy
524            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
525            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
526            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
527            serves to compute lexical item streams for all such ASCII-based character sets. </para>
528         <para> A second observation is that&#8212;regardless of which character set is
529            used&#8212;quite often all of the characters in a particular block of input will be
530            within the ASCII range. This is a very simple test to perform using the bitstream
531            representation, simply confirming that the bit 0 stream is zero for the entire block.
532            For blocks satisfying this test, all logic dealing with non-ASCII characters can simply
533            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
534            UTF-16 form are each set to zero in this case. </para>
535         <para> A third observation is that repeated transcoding of the names of XML elements,
536            attributes and so on can be avoided by using a look-up mechanism. That is, the first
537            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
538            numeric symbol ID. Transcoding of the symbol is applied at this time. Subsequent look-up
539            operations can avoid transcoding by simply retrieving the stored representation. As
540            symbol look up is required to apply various XML validation rules, there is achieves the
541            effect of transcoding each occurrence without additional cost. </para>
542         <para> The cost of individual character transcoding is avoided whenever a block of input is
543            confined to the ASCII subset and for all but the first occurrence of any XML element or
544            attribute name. Furthermore, when transcoding is required, the parallel bitstream
545            representation supports efficient transcoding operations. In the important case of UTF-8
546            to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit
547              parallel fashion based on UTF-8 streams <citation linkend="Cameron2008"/>, and all but the final bytes
548            of multi-byte sequences can be marked for deletion as discussed in the following
549            subsection. In other cases, transcoding within a block only need be applied for
550            non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream
551            using bit scan operations. </para>
552      </section>
553      <section xml:id="par-filter">
554         <title>Combined Parallel Filtering</title>
555         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
556            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
557            Chinese characters <code>&#x4F60;&#x597D;</code> are represented as two
558            three-byte UTF-8 sequences <code>E4 BD A0</code> and <code>E5 A5 BD</code> while the
559            UTF-16 representation must be compressed down to the two code units <code>4F60</code>
560            and <code>597D</code>. In the bit parallel representation, this corresponds to a
561            reduction from six bit positions representing UTF-8 code units (bytes) down to just two
562            bit positions representing UTF-16 code units (double bytes). This compression may be
563            achieved by arranging to calculate the correct UTF-16 bits at the final position of each
564            sequence and creating a deletion mask to mark the first two bytes of each 3-byte
565            sequence for deletion. In this case, the portion of the mask corresponding to these
566            input bytes is the bit sequence <code>110110</code>. Using this approach, transcoding
567            may then be completed by applying parallel deletion and inverse transposition of the
568            UTF-16 bitstreams<citation linkend="Cameron2008"/>. </para>
569         <para> Rather than immediately paying the costs of deletion and transposition just for
570            transcoding, however, icXML defers these steps so that the deletion masks for several
571            stages of processing may be combined. In particular, this includes core XML requirements
572            to normalize line breaks and to replace character reference and entity references by
573            their corresponding text. In the case of line break normalization, all forms of line
574            breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations
575            must be normalized to a single LF character in each case. In icXML, this is achieved by
576            first marking CR positions, performing two bit parallel operations to transform the
577            marked CRs into LFs, and then marking for deletion any LF that is found immediately
578            after the marked CR as shown by the Pablo source code in
579              <xref  linkend="fig-LBnormalization"/>.
580              <figure xml:id="fig-LBnormalization">
581                <caption>Line Break Normalization Logic</caption>
582  <programlisting>
583# XML 1.0 line-break normalization rules.
584if lex.CR:
585# Modify CR (#x0D) to LF (#x0A)
586  u16lo.bit_5 ^= lex.CR
587  u16lo.bit_6 ^= lex.CR
588  u16lo.bit_7 ^= lex.CR
589  CRLF = pablo.Advance(lex.CR) &amp; lex.LF
590  callouts.delmask |= CRLF
591# Adjust LF streams for line/column tracker
592  lex.LF |= lex.CR
593  lex.LF ^= CRLF
594</programlisting>
595</figure>
596         </para>
597         <para> In essence, the deletion masks for transcoding and for line break normalization each
598            represent a bitwise filter; these filters can be combined using bitwise-or so that the
599            parallel deletion algorithm need only be applied once. </para>
600         <para> A further application of combined filtering is the processing of XML character and
601            entity references. Consider, for example, the references <code>&amp;</code> or
602               <code>&#x3C;</code>. which must be replaced in XML processing with the single
603               <code>&amp;</code> and <code>&lt;</code> characters, respectively. The
604            approach in icXML is to mark all but the first character positions of each reference for
605            deletion, leaving a single character position unmodified. Thus, for the references
606               <code>&amp;</code> or <code>&#x3C;</code> the masks <code>01111</code> and
607               <code>011111</code> are formed and combined into the overall deletion mask. After the
608            deletion and inverse transposition operations are finally applied, a post-processing
609            step inserts the proper character at these positions. One note about this process is
610            that it is speculative; references are assumed to generally be replaced by a single
611            UTF-16 code unit. In the case, that this is not true, it is addressed in
612            post-processing. </para>
613         <para> The final step of combined filtering occurs during the process of reducing markup
614            data to tag bytes preceding each significant XML transition as described in
615              <xref linkend="contentstream"/>. Overall, icXML avoids separate buffer copying
616            operations for each of the these filtering steps, paying the cost of parallel deletion
617            and inverse transposition only once. Currently, icXML employs the parallel-prefix
618            compress algorithm of Steele~<citation linkend="HackersDelight"/> Performance is independent of the
619            number of positions deleted. Future versions of icXML are expected to take advantage of
620            the parallel extract operation~<citation linkend="HilewitzLee2006"/> that Intel is now providing in its
621            Haswell architecture. </para>
622      </section>
623      <section xml:id="contentstream">
624         <title>Content Stream</title>
625         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
626            Rather that parsing an XML document in its original format, the input is transformed
627            into one that is easier for the parser to iterate through and produce the sequential
628            output. In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
629            <!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
630            is transformed into <!-- CODE -->
631            <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
632            through the parallel filtering algorithm, described in <xref linkend="par-filter"/>. </para>
633         <para> Combined with the symbol stream, the parser traverses the content stream to
634            effectively reconstructs the input document in its output form. The initial <emphasis
635               role="ital">0</emphasis> indicates an empty content string. The following
636               <code>&gt;</code> indicates that a start tag without any attributes is the first
637            element in this text and the first unused symbol, <code>document</code>, is the element
638            name. Succeeding that is the content string <code>fee</code>, which is null-terminated
639            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
640            operations are required to produce these strings, which as
641              <xref linkend="xerces-profile"/> shows accounts for 6.83% of Xerces's execution time.
642            Additionally, it is cheap to locate the terminal character of each string: using the
643            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
644            null character in the content stream in parallel, which in turn means the parser can
645            directly jump to the end of every string without scanning for it. </para>
646         <para> Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the
647            existence of an attribute. Because all of the intra-element was performed in the Parabix
648            Subsystem, this must be a legal attribute. Since attributes can only occur within start
649            tags and must be accompanied by a textual value, the next symbol in the symbol stream
650            must be the element name of a start tag, and the following one must be the name of the
651            attribute and the string that follows the <code>=</code> must be its value. However, the
652            subsequent <code>=</code> is not treated as an independent attribute because the parser
653            has yet to read a <code>&gt;</code>, which marks the end of a start tag. Thus only
654            one symbol is taken from the symbol stream and it (along with the string value) is added
655            to the element. Eventually the parser reaches a <code>/</code>, which marks the
656            existence of an end tag. Every end tag requires an element name, which means they
657            require a symbol. Inter-element validation whenever an empty tag is detected to ensure
658            that the appropriate scope-nesting rules have been applied. </para>
659      </section>
660      <section xml:id="namespace-handling">
661         <title>Namespace Handling</title>
662         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
663         <para> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
664            together. It is especially important when a vocabulary application-dependant meaning,
665            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
666            to uniform resource identifiers (URIs), which are strings used to identify specific
667            names or resources. On line 1 in <xref linkend="namespace-ex"/>, the <code>xmlns</code>
668            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
669               &apos;<code>pub.net</code>&apos; and the default (empty) prefix to
670               <code>book.org</code>. Thus to the XML processor, the <code>title</code> on line 2
671            and <code>price</code> on line 4 both read as
672            <code>&quot;book.org&quot;:title</code> and
673               <code>&quot;book.org&quot;:price</code> respectively, whereas on line 3 and
674            5, <code>p:name</code> and <code>price</code> are seen as
675               <code>&quot;pub.net&quot;:name</code> and
676               <code>&quot;pub.net&quot;:price</code>. Even though the actual element name
677               <code>price</code>, due to namespace scoping rules they are viewed as two
678            uniquely-named items because the current vocabulary is determined by the namespace(s)
679            that are in-scope. </para>
680<table xml:id="namespace-ex">
681                  <caption>
682                     <para>XML Namespace Example</para>
683                  </caption>
684                  <colgroup>
685                     <col align="centre" valign="top"/>
686                     <col align="left" valign="top"/>
687                  </colgroup>
688                  <tbody>
689 <tr><td>1. </td><td><![CDATA[<book xmlns:p="pub.net" xmlns="book.org">]]> </td></tr>
690 <tr><td>2. </td><td><![CDATA[  <title>BOOK NAME</title>]]> </td></tr>
691 <tr><td>3. </td><td><![CDATA[  <p:name>PUBLISHER NAME</p:name>]]> </td></tr>
692 <tr><td>4. </td><td><![CDATA[  <price>X</price>]]> </td></tr>
693 <tr><td>5. </td><td><![CDATA[  <price xmlns="publisher.net">Y</price>]]> </td></tr>
694 <tr><td>6. </td><td><![CDATA[</book>]]> </td></tr>
695                  </tbody>
696               </table>         
697
698         <para> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
699            persist for the lifetime of the application through the use of a global URI pool. Xerces
700            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
701            (end tag) occurs in the document. Because a namespace declaration affects the entire
702            element, it must be processed prior to grammar validation. This is a costly process
703            considering that a typical namespaced XML document only comes in one of two forms: (1)
704            those that declare a set of namespaces upfront and never change them, and (2) those that
705            repeatedly modify the namespaces in predictable patterns. </para>
706         <para> For that reason, icXML contains an independent namespace stack and utilizes bit
707            vectors to cheaply perform <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
708            <!-- performance advantage figure?? average cycles/byte cost? --> When a prefix is
709            declared (e.g., <code>xmlns:p=&quot;pub.net&quot;</code>), a namespace binding
710            is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution
711            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
712            every prefix contains a bit vector marking every NSID that has ever been associated with
713              it within the document. For example, in <xref linkend="namespace-ex"/>, the prefix binding
714            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
715            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
716            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
717            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
718            found using a bit-scan intrinsic. A namespace binding table, similar to
719            <xref linkend="namespace-binding"/>, provides the actual URI ID. </para>
720<table xml:id="namespace-binding">
721                  <caption>
722                     <para>Namespace Binding Table Example</para>
723                  </caption>
724                  <colgroup>
725                     <col align="centre" valign="top"/>
726                     <col align="centre" valign="top"/>
727                     <col align="centre" valign="top"/>
728                     <col align="centre" valign="top"/>
729                     <col align="centre" valign="top"/>
730                   </colgroup>
731                   <thead>
732                     <tr><th>NSID </th><th> Prefix </th><th> URI </th><th> Prefix ID </th><th> URI ID </th>
733                     </tr>
734                   </thead>
735                  <tbody>
736<tr><td>0 </td><td> <code> p</code> </td><td> <code> pub.net</code> </td><td> 0 </td><td> 0 </td></tr> 
737 <tr><td>1 </td><td> <code> xmlns</code> </td><td> <code> books.org</code> </td><td> 1 </td><td> 1 </td></tr> 
738 <tr><td>2 </td><td> <code> xmlns</code> </td><td> <code> pub.net</code> </td><td> 1 </td><td> 0 </td></tr> 
739                  </tbody>
740               </table>         
741         <para>
742            <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
743            <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
744            <!-- NSid = bitscan(VisiblePrefixBinding); -->
745            <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
746         </para>
747         <para> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
748            any modification to the currently visible namespaces is calculated and stored within a
749            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
750            is found, the currently visible namespaces is XORed with the vector at the top of the
751            stack. This allows any number of changes to be performed at each scope-level with a
752            constant time.
753            <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
754         </para>
755      </section>
756      <section xml:id="errorhandling">
757         <title>Error Handling</title>
758         <para>
759            <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
760            Xerces outputs error messages in two ways: through the programmer API and as thrown
761            objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to
762            assess whether the next character is legal; if not, the current state determines the
763            type and severity of the error. icXML emits errors in the similar manner&#8212;but
764            how it discovers them is substantially different. Recall that in Figure
765            <xref linkend="icxml-arch"/>, icXML is divided into two sections: the Parabix Subsystem and
766            Markup Processor, each with its own system for detecting and producing error messages. </para>
767         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
768            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
769            the byte-position of an error within a block, and the type of error is determined by the
770            equation that discovered it. The difficulty of error processing in this section is that
771            in Xerces the line and column number must be given with every error production. Two
772            major issues exist because of this: (1) line position adheres to XML white-normalization
773            rules; as such, some sequences of characters, e.g., a carriage return followed by a line
774            feed, are counted as a single new line character. (2) column position is counted in
775            characters, not bytes or code units; thus multi-code-unit code-points and surrogate
776            character pairs are all counted as a single column position. Note that typical XML
777            documents are error-free but the calculation of the line/column position is a constant
778            overhead in Xerces. <!-- that must be maintained in the case that one occurs. --> To
779            reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence
780            of the error and performs the minimal amount of book-keeping necessary to facilitate it.
781            icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates
782            the information within the Line Column Tracker (LCT). One of the CSA's major
783            responsibilities is transcoding an input text.
784            <!-- from some encoding format to near-output-ready UTF-16. --> During this process,
785            white-space normalization rules are applied and multi-code-unit and surrogate characters
786            are detected and validated. A <emphasis role="ital">line-feed bitstream</emphasis>,
787            which marks the positions of the normalized new lines characters, is a natural
788            derivative of this process. Using an optimized population count algorithm, the line
789            count can be summarized cheaply for each valid block of text.
790            <!-- The optimization delays the counting process .... --> Column position is more
791            difficult to calculate. It is possible to scan backwards through the bitstream of new
792            line characters to determine the distance (in code-units) between the position between
793            which an error was detected and the last line feed. However, this distance may exceed
794            than the actual character position for the reasons discussed in (2). To handle this, the
795            CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together
796            many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters,
797            and any characters that were removed during the normalization process. When an error is
798            detected, the sum of those skipped positions is subtracted from the distance to
799            determine the actual column number. </para>
800         <para> The Markup Processor is a state-driven machine. As such, error detection within it
801            is very similar to Xerces. However, reporting the correct line/column is a much more
802            difficult problem. The Markup Processor parses the content stream, which is a series of
803            tagged UTF-16 strings. Each string is normalized in accordance with the XML
804            specification. All symbol data and unnecessary whitespace is eliminated from the stream;
805            thus its impossible to derive the current location using only the content stream. To
806            calculate the location, the Markup Processor borrows three additional pieces of
807            information from the Parabix Subsystem: the line-feed, skip mask, and a <emphasis
808               role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the
809            (code-unit) position of every datum that was suppressed from the source during the
810            production of the content stream. Armed with these, it is possible to calculate the
811            actual line/column using the same system as the Parabix Subsystem until the sum of the
812            negated deletion mask stream is equal to the current position. </para>
813      </section>
814   </section>
815
816   <section xml:id="multithread">
817      <title>Multithreading with Pipeline Parallelism</title>
818      <para> As discussed in section <xref linkend="background-xerces"/>, Xerces can be considered a FSM
819         application. These are &quot;embarrassingly
820         sequential.&quot;<citation linkend="Asanovic-EECS-2006-183"/> and notoriously difficult to
821         parallelize. However, icXML is designed to organize processing into logical layers. In
822         particular, layers within the Parabix Subsystem are designed to operate over significant
823         segments of input data before passing their outputs on for subsequent processing. This fits
824         well into the general model of pipeline parallelism, in which each thread is in charge of a
825         single module or group of modules. </para>
826      <para> The most straightforward division of work in icXML is to separate the Parabix Subsystem
827         and the Markup Processor into distinct logical layers into two separate stages. The
828         resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained
829         software-pipeline application. In this case, the Parabix Subsystem thread
830               <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a
831         time and produces the content, symbol and URI streams, then stores them in a pre-allocated
832         shared data structure <code>S</code>. The Markup Processor thread
833            <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness
834         and grammar-based validation, and the provides parsed XML data to the application through
835         the Xerces API. The shared data structure is implemented using a ring buffer, where every
836         entry contains an independent set of data streams. In the examples of
837           <xref linkend="threads_timeline1"/>, the ring buffer has four entries. A
838         lock-free mechanism is applied to ensure that each entry can only be read or written by one
839         thread at the same time. In  <xref linkend="threads_timeline1"/> the processing time of
840               <code>T<subscript>1</subscript></code> is longer than
841         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
842         waits for <code>T<subscript>1</subscript></code> to write to the shared memory. 
843         <xref linkend="threads_timeline2"/> illustrates the scenario in which
844         <code>T<subscript>1</subscript></code> is faster and must wait for
845            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
846         reuse the memory space. </para>
847      <para>
848        <figure xml:id="threads_timeline1">
849          <title>Thread Balance in Two-Stage Pipelines: Stage 1 Dominant</title>
850          <mediaobject>
851            <imageobject>
852              <imagedata format="png" fileref="threads_timeline1.png" width="500cm"/>
853            </imageobject>
854          </mediaobject>
855         </figure>
856        <figure xml:id="threads_timeline2">
857          <title>Thread Balance in Two-Stage Pipelines: Stage 2 Dominant</title>
858        <mediaobject>
859            <imageobject>
860              <imagedata format="png" fileref="threads_timeline2.png" width="500cm"/>
861            </imageobject>
862          </mediaobject>
863        </figure>
864      </para>
865      <para> Overall, our design is intended to benefit a range of applications. Conceptually, we
866         consider two design points. The first, the parsing performed by the Parabix Subsystem
867         dominates at 67% of the overall cost, with the cost of application processing (including
868         the driver logic within the Markup Processor) at 33%. The second is almost the opposite
869         scenario, the cost of application processing dominates at 60%, while the cost of XML
870         parsing represents an overhead of 40%. </para>
871      <para> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
872         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
873         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
874         total application cost. In this case, the single-threaded icXML should achieve a 1.5x
875         speedup over Xerces so that the total application cost reduces to 67% of the original.
876         However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing
877         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
878         to a 3x speedup at best. </para>
879      <para> At the other extreme of our design range, we consider an application in which core
880         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
881         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
882         significant aspect of our two-stage multi-threaded design then becomes the ability to hide
883         the entire latency of parsing within the serial time required by the application. In this
884         case, we achieve an overall speedup in processing time by 1.67x. </para>
885      <para> Although the structure of the Parabix Subsystem allows division of the work into
886         several pipeline stages and has been demonstrated to be effective for four pipeline stages
887         in a research prototype <citation linkend="HPCA2012"/>, our analysis here suggests that the further
888         pipelining of work within the Parabix Subsystem is not worthwhile if the cost of
889         application logic is little as 33% of the end-to-end cost using Xerces. To achieve benefits
890         of further parallelization with multi-core technology, there would need to be reductions in
891         the cost of application logic that could match reductions in core parsing cost. </para>
892   </section>
893
894   <section xml:id="performance">
895      <title>Performance</title>
896      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
897         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
898         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
899         Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core)
900         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
901         12.04 (Linux). </para>
902      <para> We analyzed the execution profiles of each XML parser using the performance counters
903         found in the processor. We chose several key hardware events that provide insight into the
904         profile of each application and indicate if the processor is doing useful work. The set of
905         events included in our study are: processor cycles, branch instructions, branch
906         mispredictions, and cache misses. The Performance Application Programming Interface (PAPI)
907         Version 5.5.0 <citation linkend="papi"/> toolkit was installed on the test system to facilitate the
908         collection of hardware performance monitoring statistics. In addition, we used the Linux
909         perf <citation linkend="perf"/> utility to collect per core hardware events. </para>
910      <section>
911         <title>Xerces C++ SAXCount</title>
912         <para> Xerces comes with sample applications that demonstrate salient features of the
913            parser. SAXCount is the simplest such application: it counts the elements, attributes
914            and characters of a given XML file using the (event based) SAX API and prints out the
915            totals. </para>
916
917 <para> <xref linkend="XMLdocs"/> shows the document characteristics of the XML input files
918            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
919            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
920            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
921            documents and consist entirely of single byte encoded ASCII characters.
922  <table xml:id="XMLdocs">
923                  <caption>
924                     <para>XML Document Characteristics</para>
925                  </caption>
926                  <colgroup>
927                     <col align="left" valign="top"/>
928                     <col align="centre" valign="top"/>
929                     <col align="centre" valign="top"/>
930                     <col align="centre" valign="top"/>
931                     <col align="centre" valign="top"/>
932                  </colgroup>
933                  <tbody>
934 <tr><td>File Name              </td><td> jaw.xml               </td><td> road.gml      </td><td> po.xml        </td><td> soap.xml </td></tr> 
935<tr><td>File Type               </td><td> document              </td><td> data          </td><td> data          </td><td> data   </td></tr>     
936<tr><td>File Size (kB)          </td><td> 7343                  </td><td> 11584         </td><td> 76450         </td><td> 2717 </td></tr> 
937<tr><td>Markup Item Count       </td><td> 74882                 </td><td> 280724        </td><td> 4634110       </td><td> 18004 </td></tr> 
938  <tr><td>Markup Density                </td><td> 0.13                  </td><td> 0.57          </td><td> 0.76          </td><td> 0.87  </td></tr> 
939                  </tbody>
940               </table>           
941</para>     
942         <para> A key predictor of the overall parsing performance of an XML file is markup
943           density<footnote><para>Markup Density: the ratio of markup bytes used to define the structure
944             of the document vs. its file size.</para></footnote>. This metric has substantial influence on the
945            performance of traditional recursive descent XML parsers because it directly corresponds
946            to the number of state transitions that occur when parsing a document. We use a mixture
947            of document-oriented and data-oriented XML files to analyze performance over a spectrum
948            of markup densities. </para>
949         <para> <xref linkend="perf_SAX"/> compares the performance of Xerces, icXML and pipelined icXML
950            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
951            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
952            speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less
953            affected through a reduction in branches and the use of parallel-processing techniques.
954            icXML-p performs better as markup-density increases because the work performed by each
955            stage is well balanced in this application. </para>
956         <para>
957        <figure xml:id="perf_SAX">
958          <title>SAXCount Performance Comparison</title>
959          <mediaobject>
960            <imageobject>
961              <imagedata format="png" fileref="perf_SAX.png" width="500cm"/>
962            </imageobject>
963          </mediaobject>
964          <caption>
965          </caption>
966        </figure>
967         </para>
968      </section>
969      <section>
970         <title>GML2SVG</title>
971<para>   As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
972was chosen.   This application transforms geospatially encoded data represented using
973an XML representation in the form of Geography Markup Language (GML) <citation linkend="lake2004geography"/> 
974into a different XML format  suitable for displayable maps:
975Scalable Vector Graphics (SVG) format<citation linkend="lu2007advances"/>. In the GML2SVG benchmark, GML feature elements
976and GML geometry elements tags are matched. GML coordinate data are then extracted
977and transformed to the corresponding SVG path data encodings.
978Equivalent SVG path elements are generated and output to the destination
979SVG document.  The GML2SVG application is thus considered typical of a broad
980class of XML applications that parse and extract information from
981a known XML format for the purpose of analysis and restructuring to meet
982the requirements of an alternative format.</para>
983
984<para>Our GML to SVG data translations are executed on GML source data
985modelling the city of Vancouver, British Columbia, Canada.
986The GML source document set
987consists of 46 distinct GML feature layers ranging in size from approximately 9 KB to 125.2 MB
988and with an average document size of 18.6 MB. Markup density ranges from approximately 0.045 to 0.719
989and with an average markup density of 0.519. In this performance study,
990213.4 MB of source GML data generates 91.9 MB of target SVG data.</para>
991
992
993        <figure xml:id="perf_GML2SVG">
994          <title>Performance Comparison for GML2SVG</title>
995          <mediaobject>
996            <imageobject>
997              <imagedata format="png" fileref="Throughput.png" width="500cm"/>
998            </imageobject>
999          </mediaobject>
1000          <caption>
1001          </caption>
1002        </figure>
1003       
1004<para><xref linkend="perf_GML2SVG"/> compares the performance of the GML2SVG application linked against
1005the Xerces, icXML and icXML-p.   
1006On the GML workload with this application, single-thread icXML
1007achieved about a 50% acceleration over Xerces,
1008increasing throughput on our test machine from 58.3 MB/sec to 87.9 MB/sec.   
1009Using icXML-p, a further throughput increase to 111 MB/sec was recorded,
1010approximately a 2X speedup.</para>
1011
1012<para>An important aspect of icXML is the replacement of much branch-laden
1013sequential code inside Xerces with straight-line SIMD code using far
1014fewer branches.  <xref linkend="branchmiss_GML2SVG"/> shows the corresponding
1015improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
1016It is also interesting to note that icXML-p goes even further.   
1017In essence, in using pipeline parallelism to split the instruction
1018stream onto separate cores, the branch target buffers on each core are
1019less overloaded and able to increase the successful branch prediction rate.</para>
1020
1021        <figure xml:id="branchmiss_GML2SVG">
1022          <title>Comparative Branch Misprediction Rate</title>
1023          <mediaobject>
1024            <imageobject>
1025              <imagedata format="png" fileref="BM.png" width="500cm"/>
1026            </imageobject>
1027          </mediaobject>
1028          <caption>
1029          </caption>
1030        </figure>
1031
1032<para>The behaviour of the three versions with respect to L1 cache misses per kB is shown
1033in <xref linkend="cachemiss_GML2SVG"/>.   Improvements are shown in both instruction-
1034and data-cache performance with the improvements in instruction-cache
1035behaviour the most dramatic.   Single-threaded icXML shows substantially improved
1036performance over Xerces on both measures.   
1037Although icXML-p is slightly worse with respect to data-cache performance,
1038this is more than offset by a further dramatic reduction in instruction-cache miss rate.
1039Again partitioning the instruction stream through the pipeline parallelism model has
1040significant benefit.</para>
1041
1042        <figure xml:id="cachemiss_GML2SVG">
1043          <title>Comparative Cache Miss Rate</title>
1044          <mediaobject>
1045            <imageobject>
1046              <imagedata format="png" fileref="CM.png" width="500cm"/>
1047            </imageobject>
1048          </mediaobject>
1049          <caption>
1050          </caption>
1051        </figure>
1052
1053<para>One caveat with this study is that the GML2SVG application did not exhibit
1054a relative balance of processing between application code and Xerces library
1055code reaching the 33% figure.  This suggests that for this application and
1056possibly others, further separating the logical layers of the
1057icXML engine into different pipeline stages could well offer significant benefit.
1058This remains an area of ongoing work.</para>
1059      </section>
1060   </section>
1061
1062   <section xml:id="conclusion">
1063      <title>Conclusion and Future Work</title>
1064      <para> This paper is the first case study documenting the significant performance benefits
1065         that may be realized through the integration of parallel bitstream technology into existing
1066         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
1067         integration of SIMD and multicore parallelism was shown capable of dramatic producing
1068         dramatic increases in throughput and reductions in branch mispredictions and cache misses.
1069         The modified parser, going under the name icXML is designed to provide the full
1070         functionality of the original Xerces library with complete compatibility of APIs. Although
1071         substantial re-engineering was required to realize the performance potential of parallel
1072         technologies, this is an important case study demonstrating the general feasibility of
1073         these techniques. </para>
1074      <para> The further development of icXML to move beyond 2-stage pipeline parallelism is
1075         ongoing, with realistic prospects for four reasonably balanced stages within the library.
1076         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
1077         multistage pipelined parsing library should offer substantial benefits. </para>
1078      <para> The example of XML parsing may be considered prototypical of finite-state machines
1079         applications which have sometimes been considered &quot;embarassingly
1080         sequential&quot; and so difficult to parallelize that &quot;nothing
1081         works.&quot; So the case study presented here should be considered an important data
1082         point in making the case that parallelization can indeed be helpful across a broad array of
1083         application types. </para>
1084      <para> To overcome the software engineering challenges in applying parallel bitstream
1085         technology to existing software systems, it is clear that better library and tool support
1086         is needed. The techniques used in the implementation of icXML and documented in this paper
1087         could well be generalized for applications in other contexts and automated through the
1088         creation of compiler technology specifically supporting parallel bitstream programming.
1089      </para>
1090   </section>
1091
1092   <!-- 
1093   <section>
1094      <title>Acknowledgments</title>
1095      <para></para>
1096   </section>
1097-->
1098<!--
1099   <bibliography>
1100      <title>Bibliography</title>
1101      <bibliomixed xml:id="XMLChip09" xreflabel="Leventhal and Lemoine 2009">Leventhal, Michael and
1102         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
1103         Processing XML Efficiently 2009, Montréal.</bibliomixed>
1104      <bibliomixed xml:id="Datapower09" xreflabel="Salz, Achilles and Maze 2009">Salz, Richard,
1105         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
1106         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
1107         Efficiently 2009, Montréal.</bibliomixed>
1108      <bibliomixed xml:id="PPoPP08" xreflabel="Cameron 2007">Cameron, Robert D. 2007. A Case Study
1109         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
1110         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
1111         Lake City, Utah. On the Web at <link>http://research.ihost.com/ppopp08/</link>.</bibliomixed>
1112      <bibliomixed xml:id="CASCON08" xreflabel="Cameron, Herdy and Lin 2008">Cameron, Robert D.,
1113         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
1114         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
1115         Practice of Parallel Programming 2008, Toronto.</bibliomixed>
1116      <bibliomixed xml:id="SVGOpen08" xreflabel="Herdy, Burggraf and Cameron 2008">Herdy, Kenneth
1117         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
1118         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
1119         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
1120         Nuremburg. On the Web at
1121            <link>http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/</link>.</bibliomixed>
1122      <bibliomixed xml:id="Ross06" xreflabel="Ross 2006">Ross, Kenneth A. 2006. Efficient hash
1123         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
1124            <link>www.cs.columbia.edu/~kar/pubsk/icde2007.pdf</link>.</bibliomixed>
1125      <bibliomixed xml:id="ASPLOS09" xreflabel="Cameron and Lin 2009">Cameron, Robert D. and Dan
1126         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
1127         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</bibliomixed>
1128      <bibliomixed xml:id="Wu08" xreflabel="Wu et al. 2008">Wu, Yu, Qi Zhang, Zhiqiang Yu and
1129         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
1130         Proceedings of Balisage 2008, Montréal. On the Web at
1131            <link>http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html</link>.</bibliomixed>
1132      <bibliomixed xml:id="u8u16" xreflabel="Cameron 2008">u8u16 - A High-Speed UTF-8 to UTF-16
1133         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
1134         Science Simon Fraser University, June 21 2007.</bibliomixed>
1135      <bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth
1136         Edition) W3C Recommendation 26 November 2008. On the Web at
1137            <link>http://www.w3.org/TR/REC-xml/</link>.</bibliomixed>
1138      <bibliomixed xml:id="Unicode" xreflabel="Unicode">The Unicode Consortium. 2009. On the Web at
1139            <link>http://unicode.org/</link>.</bibliomixed>
1140      <bibliomixed xml:id="Pex06" xreflabel="Hilewitz and Lee 2006"> Hilewitz, Y. and Ruby B. Lee.
1141         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
1142         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
1143         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</bibliomixed>
1144      <bibliomixed xml:id="InfoSet" xreflabel="XML Infoset">XML Information Set (Second Edition) W3C
1145         Recommendation 4 February 2004. On the Web at
1146         <link>http://www.w3.org/TR/xml-infoset/</link>.</bibliomixed>
1147      <bibliomixed xml:id="Saxon" xreflabel="Saxon">SAXON The XSLT and XQuery Processor. On the Web
1148         at <link>http://saxon.sourceforge.net/</link>.</bibliomixed>
1149      <bibliomixed xml:id="Kay08" xreflabel="Kay 2008"> Kay, Michael Y. 2008. Ten Reasons Why Saxon
1150         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</bibliomixed>
1151      <bibliomixed xml:id="AElfred" xreflabel="Ælfred"> The Ælfred XML Parser. On the Web at
1152            <link>http://saxon.sourceforge.net/aelfred.html</link>.</bibliomixed>
1153      <bibliomixed xml:id="JNI" xreflabel="Hitchens 2002">Hitchens, Ron. Java NIO. O'Reilly, 2002.</bibliomixed>
1154      <bibliomixed xml:id="Expat" xreflabel="Expat">The Expat XML Parser.
1155            <link>http://expat.sourceforge.net/</link>.</bibliomixed>
1156      <bibliomixed xml:id="GRID2006">   </bibliomixed>
1157
1158      <bibliomixed xml:id="IPDPS2008">  </bibliomixed>
1159
1160      <bibliomixed xml:id="HPCC2011">   </bibliomixed>
1161
1162      <bibliomixed xml:id="ParaDOM2009">        </bibliomixed>
1163
1164      <bibliomixed xml:id="ICWS2008">   </bibliomixed>
1165
1166      <bibliomixed xml:id="HPCA2012">
1167        </bibliomixed>
1168      <bibliomixed xml:id="E-SCIENCE2007">
1169        </bibliomixed>
1170      <bibliomixed xml:id="XMLSSE42">
1171        </bibliomixed>
1172      <bibliomixed xml:id="Cameron2009">
1173        </bibliomixed>
1174      <bibliomixed xml:id="cameron-EuroPar2011">
1175        </bibliomixed>
1176      <bibliomixed xml:id="Cameron2008">
1177        </bibliomixed>
1178      <bibliomixed xml:id="CameronHerdyLin2008">
1179        </bibliomixed>
1180      <bibliomixed xml:id="HackersDelight">
1181        </bibliomixed>
1182      <bibliomixed xml:id="HilewitzLee2006">
1183        </bibliomixed>
1184      <bibliomixed xml:id="Asanovic-EECS-2006-183">
1185        </bibliomixed>
1186      <bibliomixed xml:id="papi">
1187        </bibliomixed>
1188      <bibliomixed xml:id="perf">
1189        </bibliomixed>
1190      <bibliomixed xml:id="lake2004geography">
1191        </bibliomixed>
1192      <bibliomixed xml:id="lu2007advances">
1193        </bibliomixed>
1194   </bibliography>
1195-->
1196<bibliography>
1197  <title>Bibliography</title>
1198  <bibliomixed xml:id="CameronHerdyLin2008" xreflabel="Cameron and Herdy 2008">Cameron, Robert D., Herdy, Kenneth S. and Lin, Dan. High performance XML parsing using parallel bit stream technology. CASCON'08: Proc. 2008 conference of the center for advanced studies on collaborative research. 2008 New York, NY, USA</bibliomixed>
1199  <bibliomixed xml:id="papi" xreflabel="Innovative Computing Laboratory">Innovative Computing Laboratory, University of Texas. Performance Application Programming Interface.<link>http://icl.cs.utk.edu/papi/</link></bibliomixed>
1200  <bibliomixed xml:id="perf" xreflabel="Eranian and Gouriou">Eranian, Stephane, Gouriou, Eric, Moseley, Tipp and Bruijn, Willem de. Linux kernel profiling with perf.<link>https://perf.wiki.kernel.org/index.php/Tutorial</link></bibliomixed>
1201  <bibliomixed xml:id="Cameron2008" xreflabel="Cameron 2008">Cameron, Robert D.. A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding. Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008 New York, NY, USA</bibliomixed>
1202  <bibliomixed xml:id="ParaDOM2009" xreflabel="Shah and Rao 2009">Shah, Bhavik, Rao, Praveen, Moon, Bongki and Rajagopalan, Mohan. A Data Parallel Algorithm for XML DOM Parsing. Database and XML Technologies. 2009</bibliomixed>
1203  <bibliomixed xml:id="XMLSSE42" xreflabel="Lei 2008">Lei, Zhai. XML Parsing Accelerator with Intel Streaming SIMD Extensions 4 (Intel SSE4). 2008<link>Intel Software Network</link></bibliomixed>
1204  <bibliomixed xml:id="Cameron2009" xreflabel="Cameron and Herdy 2009">Cameron, Rob, Herdy, Ken and Amiri, Ehsan Amiri. Parallel Bit Stream Technology as a Foundation for XML Parsing Performance. Int'l Symposium on Processing XML Efficiently: Overcoming Limits on Space, Time, or Bandwidth. 2009</bibliomixed>
1205  <bibliomixed xml:id="HilewitzLee2006" xreflabel="Hilewitz and Lee 2006">Hilewitz, Yedidya and Lee, Ruby B.. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions. ASAP '06: Proc. IEEE 17th Int'l Conference on Application-specific Systems, Architectures and Processors. 2006 Washington, DC, USA</bibliomixed>
1206  <bibliomixed xml:id="Asanovic-EECS-2006-183" xreflabel="Asanovic and others 2006">Asanovic, Krste and others. The Landscape of Parallel Computing Research: A View from Berkeley. 2006</bibliomixed>
1207  <bibliomixed xml:id="GRID2006" xreflabel="Lu and Chiu 2006">Lu, Wei, Chiu, Kenneth and Pan, Yinfei. A Parallel Approach to XML Parsing. Proceedings of the 7th IEEE/ACM International Conference on Grid Computing. 2006 Washington, DC, USA</bibliomixed>
1208  <bibliomixed xml:id="cameron-EuroPar2011" xreflabel="Cameron and Amiri 2011">Cameron, Robert D., Amiri, Ehsan, Herdy, Kenneth S., Lin, Dan, Shermer, Thomas C. and Popowich, Fred P.. Parallel Scanning with Bitstream Addition: An XML Case Study. Euro-Par 2011, LNCS 6853, Part II. 2011 Berlin, Heidelberg</bibliomixed>
1209  <bibliomixed xml:id="HPCA2012" xreflabel="Lin and Medforth 2012">Lin, Dan, Medforth, Nigel, Herdy, Kenneth S., Shriraman, Arrvindh and Cameron, Rob. Parabix: Boosting the efficiency of text processing on commodity processors. International Symposium on High-Performance Computer Architecture. 2012 Los Alamitos, CA, USA</bibliomixed>
1210  <bibliomixed xml:id="HPCC2011" xreflabel="You and Wang 2011">You, Cheng-Han and Wang, Sheng-De. A Data Parallel Approach to XML Parsing and Query. 10th IEEE International Conference on High Performance Computing and Communications. 2011 Los Alamitos, CA, USA</bibliomixed>
1211  <bibliomixed xml:id="E-SCIENCE2007" xreflabel="Pan and Zhang 2007">Pan, Yinfei, Zhang, Ying, Chiu, Kenneth and Lu, Wei. Parallel XML Parsing Using Meta-DFAs. International Conference on e-Science and Grid Computing. 2007 Los Alamitos, CA, USA</bibliomixed>
1212  <bibliomixed xml:id="ICWS2008" xreflabel="Pan and Zhang 2008">Pan, Yinfei, Zhang, Ying and Chiu, Kenneth. Hybrid Parallelism for XML SAX Parsing. IEEE International Conference on Web Services. 2008 Los Alamitos, CA, USA</bibliomixed>
1213  <bibliomixed xml:id="IPDPS2008" xreflabel="Pan and Zhang 2008">Pan, Yinfei, Zhang, Ying and Chiu, Kenneth. Simultaneous transducers for data-parallel XML parsing. International Parallel and Distributed Processing Symposium. 2008 Los Alamitos, CA, USA</bibliomixed>
1214  <bibliomixed xml:id="HackersDelight" xreflabel="Warren 2002">Warren, Henry S.. Hacker's Delight. 2002</bibliomixed>
1215  <bibliomixed xml:id="lu2007advances" xreflabel="Lu and Dos Santos 2007">Lu, C.T., Dos Santos, R.F., Sripada, L.N. and Kou, Y.. Advances in GML for geospatial applications. 2007</bibliomixed>
1216  <bibliomixed xml:id="lake2004geography" xreflabel="Lake and Burggraf 2004">Lake, R., Burggraf, D.S., Trninic, M. and Rae, L.. Geography mark-up language (GML) [foundation for the geo-web]. 2004</bibliomixed>
1217</bibliography>
1218
1219</article>
Note: See TracBrowser for help on using the repository browser.