Changeset 3039


Ignore:
Timestamp:
Apr 18, 2013, 7:02:21 PM (6 years ago)
Author:
ksherdy
Message:

Initial translation. Special characters, figures, tables, bib, to go.

Location:
docs/Balisage13/Bal2013came0601
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/Balisage13/Bal2013came0601/Bal2013came0601.html

    r3038 r3039  
    22<head>
    33<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    4 <title>Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</title>
     4<title></title>
    55<link rel="stylesheet" href="balisage-plain.css" type="text/css">
    6 <meta name="keywords" content=", , ">
     6<meta name="keywords" content="">
    77</head>
    88<body>
     
    1212<div lang="en" class="article">
    1313<div class="titlepage">
    14 <h2 class="article-title" id="idp346280">Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</h2>
     14<h2 class="article-title" id="idp197760"></h2>
     15<div class="author">
     16<h3 class="author">Nigel Medforth</h3>
     17<div class="affiliation">
     18<p class="jobtitle"></p>
     19<p class="orgname"></p>
     20</div>
     21<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
     22</div>
     23<div class="author">
     24<h3 class="author">Dan Lin</h3>
     25<div class="affiliation">
     26<p class="jobtitle"></p>
     27<p class="orgname"></p>
     28</div>
     29<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
     30</div>
     31<div class="author">
     32<h3 class="author">Kenneth Herdy</h3>
     33<div class="affiliation">
     34<p class="jobtitle">Graduate Student, School of Computing Science</p>
     35<p class="orgname">Simon Fraser University </p>
     36</div>
     37<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:ksherdy@sfu.ca">ksherdy@sfu.ca</a>&gt;</code></h5>
     38</div>
    1539<div class="author">
    1640<h3 class="author">Rob Cameron</h3>
     
    2246</div>
    2347<div class="author">
    24 <h3 class="author">Ken Herdy</h3>
     48<h3 class="author">Arrvindh Shriraman</h3>
    2549<div class="affiliation">
    26 <p class="jobtitle">Graduate Student, School of Computing Science</p>
    27 <p class="orgname">Simon Fraser University </p>
    28 </div>
    29 <h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:ksherdy@cs.sfu.ca">ksherdy@cs.sfu.ca</a>&gt;</code></h5>
    30 </div>
    31 <div class="author">
    32 <h3 class="author">Ehsan Amiri</h3>
    33 <div class="affiliation">
    34 <p class="jobtitle">Graduate Student, School of Computing Science</p>
    35 <p class="orgname">Simon Fraser University</p>
    36 </div>
    37 <h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:eamiri@cs.sfu.ca">eamiri@cs.sfu.ca</a>&gt;</code></h5>
    38 </div>
    39 <div class="legalnotice" id="idp498304"><p id="idp498432">Copyright © 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
    40             This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
    41             Works 2.5 Canada License.</p></div>
     50<p class="jobtitle"></p>
     51<p class="orgname"></p>
     52</div>
     53<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:"></a>&gt;</code></h5>
     54</div>
    4255<div class="abstract">
    4356<p class="title"><b>Abstract</b></p>
    44 <p id="idp514976">By first transforming the octets (bytes) of XML texts into eight parallel bit
    45             streams, the SIMD features of commodity processors can be exploited for parallel
    46             processing of blocks of 128 input bytes at a time. Established transcoding and parsing
    47             techniques are reviewed followed by new techniques including parsing with bitstream
    48             addition. Further opportunities are discussed in light of expected advances in CPU
    49             architecture and compiler technology. Implications for various APIs and information
    50             models are presented as well opportunities for collaborative open-source
    51          development.</p>
     57<p id="idp286432">Prior research on the acceleration of XML processing
     58using SIMD and multi-core parallelism has lead to
     59a number of interesting research prototypes.  This work
     60investigates the extent to which the techniques underlying
     61these prototypes could result in systematic performance
     62benefits when fully integrated into a commercial XML parser.
     63The widely used Xerces-C++ parser of the Apache Software
     64Foundation was chosen as the foundation for the study.
     65A systematic restructuring of the parser was undertaken,
     66while maintaining the existing API for application programmers.
     67Using SIMD techniques alone, an increase in parsing speed
     68of at least 50% was observed in a range of applications.
     69When coupled with pipeline parallelism on dual core processors,
     70improvements of 2x and beyond were realized.
     71</p>
    5272</div>
    5373<hr>
     
    5676<p><b>Table of Contents</b></p>
    5777<dl>
    58 <dt><span class="section"><a href="#idp499912" class="toc">Introduction</a></span></dt>
    59 <dt><span class="section"><a href="#idp644272" class="toc">A Catalog of Parallel Bit Streams for XML</a></span></dt>
     78<dt><span class="section"><a href="#idp273984" class="toc">Introduction</a></span></dt>
     79<dt><span class="section"><a href="#idp274904" class="toc">Background</a></span></dt>
    6080<dd><dl>
    61 <dt><span class="section"><a href="#idp644640" class="toc">Introduction</a></span></dt>
    62 <dt><span class="section"><a href="#idp646032" class="toc">Basis Bit Streams</a></span></dt>
    63 <dt><span class="section"><a href="#idp696728" class="toc">General Streams</a></span></dt>
     81<dt><span class="section"><a href="#idp275240" class="toc">Xerces C++ Structure</a></span></dt>
     82<dt><span class="section"><a href="#idp280344" class="toc">The Parabix Framework</a></span></dt>
     83<dt><span class="section"><a href="#idp416432" class="toc">Sequential vs. Parallel Paradigm</a></span></dt>
     84</dl></dd>
     85<dt><span class="section"><a href="#idp419384" class="toc">Architecture</a></span></dt>
    6486<dd><dl>
    65 <dt><span class="section"><a href="#idp697328" class="toc">Deletion Mask Streams</a></span></dt>
    66 <dt><span class="section"><a href="#idp707448" class="toc">Error Flag Streams </a></span></dt>
     87<dt><span class="section"><a href="#idp419704" class="toc">Overview</a></span></dt>
     88<dt><span class="section"><a href="#idp429312" class="toc">Character Set Adapters</a></span></dt>
     89<dt><span class="section"><a href="#idp435504" class="toc">Combined Parallel Filtering</a></span></dt>
     90<dt><span class="section"><a href="#idp249920" class="toc">Content Stream</a></span></dt>
     91<dt><span class="section"><a href="#idp254672" class="toc">Namespace Handling</a></span></dt>
     92<dt><span class="section"><a href="#idp469568" class="toc">Error Handling</a></span></dt>
    6793</dl></dd>
    68 <dt><span class="section"><a href="#idp716704" class="toc">Lexical Item Streams</a></span></dt>
    69 <dt><span class="section"><a href="#idp755744" class="toc">UTF-8 Byte Classification, Scope and Validation Streams</a></span></dt>
     94<dt><span class="section"><a href="#idp476680" class="toc">Multithreading with Pipeline Parallelism</a></span></dt>
     95<dt><span class="section"><a href="#idp483472" class="toc">Performance</a></span></dt>
    7096<dd><dl>
    71 <dt><span class="section"><a href="#idp757104" class="toc">UTF-8 Byte Classification Streams</a></span></dt>
    72 <dt><span class="section"><a href="#idp758312" class="toc">UTF-8 Scope Streams</a></span></dt>
    73 <dt><span class="section"><a href="#idp778672" class="toc">UTF-8 Validation Streams</a></span></dt>
    74 <dt><span class="section"><a href="#idp781056" class="toc">XML Character Validation Streams</a></span></dt>
    75 <dt><span class="section"><a href="#idp782512" class="toc">UTF-8 to UTF-16 Transcoding</a></span></dt>
    76 <dt><span class="section"><a href="#idp785384" class="toc">UTF-8 Indexed UTF-16 Streams</a></span></dt>
     97<dt><span class="section"><a href="#idp485352" class="toc">Xerces C++ SAXCount</a></span></dt>
     98<dt><span class="section"><a href="#idp490336" class="toc">GML2SVG</a></span></dt>
    7799</dl></dd>
    78 <dt><span class="section"><a href="#idp786760" class="toc">Control Character Streams</a></span></dt>
    79 <dd><dl>
    80 <dt><span class="section"><a href="#idp787768" class="toc">XML Character Validation</a></span></dt>
    81 <dt><span class="section"><a href="#idp789136" class="toc">XML 1.0 End-of-line Handling</a></span></dt>
    82 </dl></dd>
    83 <dt><span class="section"><a href="#idp797864" class="toc">Call Out Streams</a></span></dt>
    84 <dd><dl>
    85 <dt><span class="section"><a href="#idp798776" class="toc">Comment, Processing Instruction and CDATA Section Call Out Streams</a></span></dt>
    86 <dt><span class="section"><a href="#idp807344" class="toc">Reference Call Out Streams</a></span></dt>
    87 <dt><span class="section"><a href="#idp809320" class="toc">Tag Call Out Streams</a></span></dt>
    88 </dl></dd>
    89 </dl></dd>
    90 <dt><span class="section"><a href="#idp822632" class="toc">SIMD Beyond Bitstreams: Names and Numbers</a></span></dt>
    91 <dd><dl>
    92 <dt><span class="section"><a href="#idp823448" class="toc">Name Lookup</a></span></dt>
    93 <dt><span class="section"><a href="#idp830280" class="toc">Numeric Processing</a></span></dt>
    94 </dl></dd>
    95 <dt><span class="section"><a href="#idp833080" class="toc">APIs and Parallel Bit Streams</a></span></dt>
    96 <dd><dl>
    97 <dt><span class="section"><a href="#idp833440" class="toc">The ILAX Streaming API</a></span></dt>
    98 <dt><span class="section"><a href="#idp836736" class="toc">Efficient XML in Java Using Array Set Models</a></span></dt>
    99 <dd><dl><dt><span class="section"><a href="#idp845360" class="toc">Saxon-B TinyTree Example</a></span></dt></dl></dd>
    100 </dl></dd>
    101 <dt><span class="section"><a href="#idp853000" class="toc">Compiler Technology</a></span></dt>
    102 <dd><dl>
    103 <dt><span class="section"><a href="#idp855592" class="toc">Character Class Compiler</a></span></dt>
    104 <dt><span class="section"><a href="#idp857544" class="toc">Regular Expression Compilation</a></span></dt>
    105 <dt><span class="section"><a href="#idp858856" class="toc">Unbounded Bit Stream Compilation</a></span></dt>
    106 </dl></dd>
    107 <dt><span class="section"><a href="#idp860384" class="toc">Conclusion</a></span></dt>
    108 <dt><span class="section"><a href="#idp864448" class="toc">Acknowledgments</a></span></dt>
     100<dt><span class="section"><a href="#idp490912" class="toc">Conclusion and Future Work</a></span></dt>
    109101</dl>
    110102</div>
    111 <div class="section" id="idp499912">
     103<div class="section" id="idp273984">
    112104<h2 class="title" style="clear: both">Introduction</h2>
    113 <p id="idp500256"> While particular XML applications may benefit from special-purpose hardware such as XML
    114          chips [<a class="xref" href="#XMLChip09" id="idp500496">Leventhal and Lemoine 2009</a>] or appliances [<a class="xref" href="#Datapower09" id="idp506976">Salz, Achilles and Maze 2009</a>], the bulk
    115          of the world's XML processing workload will continue to be handled by XML software stacks
    116          on commodity processors. Exploiting the SIMD capabilities of such processors such as the
    117          SSE instructions of x86 chips, parallel bit stream technology offers the potential of
    118          dramatic improvement over byte-at-a-time processing for a variety of XML processing tasks.
    119          Character set issues such as Unicode validation and transcoding [<a class="xref" href="#PPoPP08" id="idp507840">Cameron 2007</a>], normalization of line breaks and white space and XML character validation can be
    120          handled fully in parallel using this representation. Lexical item streams, such as the bit
    121          stream marking the positions of opening angle brackets, can also be formed in parallel.
    122          Bit-scan instructions of commodity processors may then be used on lexical item streams to
    123          implement rapid single-instruction scanning across variable-length multi-byte text blocks
    124          as in the Parabix XML parser [<a class="xref" href="#CASCON08" id="idp508744">Cameron, Herdy and Lin 2008</a>]. Overall, these techniques may be
    125          combined to yield end-to-end performance that may be 1.5X to 15X faster than alternatives
    126             [<a class="xref" href="#SVGOpen08" id="idp509256">Herdy, Burggraf and Cameron 2008</a>].</p>
    127 <p id="idp509720">Continued research in parallel bit stream techniques as well as more conventional
    128          application of SIMD techniques in XML processing offers further prospects for improvement
    129          of core XML components as well as for tackling performance-critical tasks further up the
    130          stack. A newly prototyped technique for parallel tag parsing using bitstream addition is
    131          expected to improve parsing performance even beyond that achieved using sequential bit
    132          scans. Several techniques for improved symbol table performance are being investigated,
    133          including parallel hash value calculation and length-based sorting using the cheap length
    134          determination afforded by bit scans. To deliver the benefits of parallel bit stream
    135          technology to the Java world, we are developing Array Set Model (ASM) representations of
    136          XML Infoset and other XML information models for efficient transmission across the JNI
    137          boundary.</p>
    138 <p id="idp510920">Amplifying these software advances, continuing hardware advances in commodity processors
    139          increase the relative advantage of parallel bit stream techniques over traditional
    140          byte-at-a-time processors. For example, the Intel Core architecture improved SSE processing
    141          to give superscalar execution of bitwise logic operations (3 instructions per cycle vs. 1
    142          in Pentium 4). Upcoming 256-bit AVX technology extends the register set and replaces
    143          destructive two-operand instructions with a nondestructive three-operand form. General
    144          purpose programming on graphic processing units (GPGPU) such as the upcoming 512-bit
    145          Larrabee processor may also be useful for XML applications using parallel bit streams. New
    146          instruction set architectures may also offer dramatic improvements in core algorithms.
    147          Using the relatively simple extensions to support the principle of inductive doubling, a 3X
    148          improvement in several core parallel bit stream algorithms may be achieved [<a class="xref" href="#ASPLOS09" id="idp638256">Cameron and Lin 2009</a>]. Other possibilities include direct implementation of parallel
    149          extract and parallel deposit (pex/pdep) instructions [<a class="xref" href="#Pex06" id="idp638640">Hilewitz and Lee 2006</a>], and
    150          bit-level interleave operations as in Larrabee, each of which would have important
    151          application to parallel bit stream processing.</p>
    152 <p id="idp639128">Further prospects for XML performance improvement arise from leveraging the
    153          intraregister parallelism of parallel bit stream technology to exploit the interchip
    154          parallelism of multicore computing. Parallel bit stream techniques can support multicore
    155          parallelism in both data partitioning and task partitioning models. For example, the
    156          datasection partitioning approach of Wu, Zhang, Yu and Li may be used to partition blocks
    157          for speculative parallel parsing on separate cores followed by a postprocessing step to
    158          join partial S-trees [<a class="xref" href="#Wu08" id="idp639856">Wu et al. 2008</a>].</p>
    159 <p id="idp640296">In our view, the established and expected performance advantages of parallel bit stream
    160          technology over traditional byte-at-a-time processing are so compelling that parallel bit
    161          stream technology should ultimately form the foundation of every high-performance XML
    162          software stack. We envision a common high-performance XML kernel that may be customized to
    163          a variety of processor architectures and that supports a wide range of existing and new XML
    164          APIs. Widespread deployment of this technology should greatly benefit the XML community in
    165          addressing both the deserved and undeserved criticism of XML on performance grounds. A
    166          further benefit of improved performance is a substantial greening of XML technologies.</p>
    167 <p id="idp641272">To complement our research program investigating fundamental algorithms and issues in
    168          high-performance XML processing, our work also involves development of open source software
    169          implementing these algorithms, with a goal of full conformance to relevant specifications.
    170          From the research perspective, this approach is valuable in ensuring that the full
    171          complexity of required XML processing is addressed in reporting and assessing processing
    172          results. However, our goal is also to use this open source software as a basis of
    173          technology transfer. A Simon Fraser University spin-off company, called International
    174          Characters, Inc., has been created to commercialize the results of this work using a
    175          patent-based open source model.</p>
    176 <p id="idp642272">To date, we have not yet been successful in establishing a broader community of
    177          participation with our open source code base. Within open-source communities, there is
    178          often a general antipathy towards software patents; this may limit engagement with our
    179          technology, even though it has been dedicated for free use in open source. </p>
    180 <p id="idp642832">A further complication is the inherent difficulty of SIMD programming in general, and
    181          parallel bit stream programming in particular. Considerable work is required with each new
    182          algorithmic technique being investigated as well as in retargetting our techniques for each
    183          new development in SIMD and multicore processor technologies. To address these concerns, we
    184          have increasingly shifted the emphasis of our research program towards compiler technology
    185          capable of generating parallel bit stream code from higher-level specifications.</p>
    186 </div>
    187 <div class="section" id="idp644272">
    188 <h2 class="title" style="clear: both">A Catalog of Parallel Bit Streams for XML</h2>
    189 <div class="section" id="idp644640">
    190 <h3 class="title" style="clear: both">Introduction</h3>
    191 <p id="idp644984">In this section, we introduce the fundamental concepts of parallel bit stream
    192             technology and present a comprehensive catalog of parallel bit streams for use in XML
    193             processing. In presenting this catalog, the focus is on the specification of the bit
    194             streams as data streams in one-to-one correspondence with the character code units of an
    195             input XML stream. The goal is to define these bit streams in the abstract without
    196             initially considering memory layouts, register widths or other issues related to
    197             particular target architectures. In cataloging these techniques, we also hope to convey
    198             a sense of the breadth of applications of parallel bit stream technology to XML
    199             processing tasks. </p>
    200 </div>
    201 <div class="section" id="idp646032">
    202 <h3 class="title" style="clear: both">Basis Bit Streams</h3>
    203 <p id="idp646376">Given a byte-oriented text stream represented in UTF-8, for example, we define a
    204             transform representation of this text consisting of a set of eight parallel bit streams
    205             for the individual bits of each byte. Thus, the <code class="code">Bit0</code> stream is the stream
    206             of bits consisting of bit 0 of each byte in the input byte stream, <code class="code">Bit1</code> is
    207             the bit stream consisting of bit 1 of each byte in the input stream and so on. The set
    208             of streams <code class="code">Bit0</code> through <code class="code">Bit7</code> are known as the <span class="ital">basis
    209                streams</span> of the parallel bit stream representation. The following table
    210             shows an example XML character stream together with its representation as a set of 8
    211             basis streams. <div class="table-wrapper" id="idp648288">
    212 <p class="title">Table I</p>
    213 <div class="caption"><p id="idp451472">XML Character Stream Transposition.</p></div>
    214 <table class="table">
    215 <colgroup span="1">
    216 <col align="left" valign="top" span="1">
    217 <col align="left" valign="top" span="1">
    218 <col align="left" valign="top" span="1">
    219 <col align="left" valign="top" span="1">
    220 <col align="left" valign="top" span="1">
    221 <col align="left" valign="top" span="1">
    222 </colgroup>
    223 <tbody>
    224 <tr valign="top">
    225 <td>Input Data</td>
    226 <td>
    227                         <code class="code">&lt;</code>
    228                      </td>
    229 <td>
    230                         <code class="code">t</code>
    231                      </td>
    232 <td>
    233                         <code class="code">a</code>
    234                      </td>
    235 <td>
    236                         <code class="code">g</code>
    237                      </td>
    238 <td>
    239                         <code class="code">/</code>
    240                      </td>
    241 <td>
    242                         <code class="code">&gt;</code>
    243                      </td>
    244 </tr>
    245 <tr valign="top">
    246 <td>ASCII</td>
    247 <td>
    248                         <code class="code">00111100</code>
    249                      </td>
    250 <td>
    251                         <code class="code">01110100</code>
    252                      </td>
    253 <td>
    254                         <code class="code">01100001</code>
    255                      </td>
    256 <td>
    257                         <code class="code">01100111</code>
    258                      </td>
    259 <td>
    260                         <code class="code">00101111</code>
    261                      </td>
    262 <td>
    263                         <code class="code">00111110</code>
    264                      </td>
    265 </tr>
    266 <tr valign="top">
    267 <td>Bit0</td>
    268 <td>
    269                         <code class="code">0</code>
    270                      </td>
    271 <td>
    272                         <code class="code">0</code>
    273                      </td>
    274 <td>
    275                         <code class="code">0</code>
    276                      </td>
    277 <td>
    278                         <code class="code">0</code>
    279                      </td>
    280 <td>
    281                         <code class="code">0</code>
    282                      </td>
    283 <td>
    284                         <code class="code">0</code>
    285                      </td>
    286 </tr>
    287 <tr valign="top">
    288 <td>Bit1</td>
    289 <td>
    290                         <code class="code">0</code>
    291                      </td>
    292 <td>
    293                         <code class="code">1</code>
    294                      </td>
    295 <td>
    296                         <code class="code">1</code>
    297                      </td>
    298 <td>
    299                         <code class="code">1</code>
    300                      </td>
    301 <td>
    302                         <code class="code">0</code>
    303                      </td>
    304 <td>
    305                         <code class="code">0</code>
    306                      </td>
    307 </tr>
    308 <tr valign="top">
    309 <td>Bit2</td>
    310 <td>
    311                         <code class="code">1</code>
    312                      </td>
    313 <td>
    314                         <code class="code">1</code>
    315                      </td>
    316 <td>
    317                         <code class="code">1</code>
    318                      </td>
    319 <td>
    320                         <code class="code">1</code>
    321                      </td>
    322 <td>
    323                         <code class="code">1</code>
    324                      </td>
    325 <td>
    326                         <code class="code">1</code>
    327                      </td>
    328 </tr>
    329 <tr valign="top">
    330 <td>Bit3</td>
    331 <td>
    332                         <code class="code">1</code>
    333                      </td>
    334 <td>
    335                         <code class="code">1</code>
    336                      </td>
    337 <td>
    338                         <code class="code">0</code>
    339                      </td>
    340 <td>
    341                         <code class="code">0</code>
    342                      </td>
    343 <td>
    344                         <code class="code">0</code>
    345                      </td>
    346 <td>
    347                         <code class="code">1</code>
    348                      </td>
    349 </tr>
    350 <tr valign="top">
    351 <td>Bit4</td>
    352 <td>
    353                         <code class="code">1</code>
    354                      </td>
    355 <td>
    356                         <code class="code">0</code>
    357                      </td>
    358 <td>
    359                         <code class="code">0</code>
    360                      </td>
    361 <td>
    362                         <code class="code">0</code>
    363                      </td>
    364 <td>
    365                         <code class="code">1</code>
    366                      </td>
    367 <td>
    368                         <code class="code">1</code>
    369                      </td>
    370 </tr>
    371 <tr valign="top">
    372 <td>Bit5</td>
    373 <td>
    374                         <code class="code">1</code>
    375                      </td>
    376 <td>
    377                         <code class="code">1</code>
    378                      </td>
    379 <td>
    380                         <code class="code">0</code>
    381                      </td>
    382 <td>
    383                         <code class="code">1</code>
    384                      </td>
    385 <td>
    386                         <code class="code">1</code>
    387                      </td>
    388 <td>
    389                         <code class="code">1</code>
    390                      </td>
    391 </tr>
    392 <tr valign="top">
    393 <td>Bit6</td>
    394 <td>
    395                         <code class="code">0</code>
    396                      </td>
    397 <td>
    398                         <code class="code">0</code>
    399                      </td>
    400 <td>
    401                         <code class="code">0</code>
    402                      </td>
    403 <td>
    404                         <code class="code">1</code>
    405                      </td>
    406 <td>
    407                         <code class="code">1</code>
    408                      </td>
    409 <td>
    410                         <code class="code">1</code>
    411                      </td>
    412 </tr>
    413 <tr valign="top">
    414 <td>Bit7</td>
    415 <td>
    416                         <code class="code">0</code>
    417                      </td>
    418 <td>
    419                         <code class="code">0</code>
    420                      </td>
    421 <td>
    422                         <code class="code">1</code>
    423                      </td>
    424 <td>
    425                         <code class="code">1</code>
    426                      </td>
    427 <td>
    428                         <code class="code">1</code>
    429                      </td>
    430 <td>
    431                         <code class="code">0</code>
    432                      </td>
    433 </tr>
    434 </tbody>
    435 </table>
    436 </div>
    437          </p>
    438 <p id="idp694048"> Depending on the features of a particular processor architecture, there are a number
    439             of algorithms for transposition to parallel bit stream form. Several of these algorithms
    440             employ a three-stage structure. In the first stage, the input byte stream is divided
    441             into a pair of half-length streams consisting of four bits for each byte, for example,
    442             one stream for the high nybble of each byte and another for the low nybble of each byte.
    443             In the second stage, these streams of four bits per byte are each divided into streams
    444             consisting of two bits per original byte, for example streams for the
    445             <code class="code">Bit0/Bit1</code>, <code class="code">Bit2/Bit3</code>, <code class="code">Bit4/Bit5</code>, and
    446                <code class="code">Bit6/Bit7</code> pairs. In the final stage, the streams are further subdivided
    447             in the individual bit streams. </p>
    448 <p id="idp695744"> Using SIMD capabilities, this process is quite efficient, with an amortized cost of
    449             1.1 CPU cycles per input byte on Intel Core 2 with SSE, or 0.6 CPU cycles per input byte
    450             on Power PC G4 with Altivec. With future advances in processor technology, this
    451             transposition overhead is expected to reduce, possibly taking advantage of upcoming
    452             parallel extract (pex) instructions on Intel technology. In the ideal, only 24
    453             instructions are needed to transform a block of 128 input bytes using 128-bit SSE
    454             registers using the inductive doubling instruction set architecture, representing an
    455             overhead of less than 0.2 instructions per input byte. </p>
    456 </div>
    457 <div class="section" id="idp696728">
    458 <h3 class="title" style="clear: both">General Streams</h3>
    459 <p id="idp697048">This section describes bit streams which support basic processing operations.</p>
    460 <div class="section" id="idp697328">
    461 <h4 class="title" style="clear: both">Deletion Mask Streams</h4>
    462 <p id="idp697648">DelMask (deletion mask) streams marks character code unit positions for deletion.
    463                Since the deletion operation is dependency free across many stages of XML processing,
    464                it is possible to simply mark and record deletion positions as deletion mask streams for future processing. A single
    465                invocation of a SIMD based parallel deletion algorithm can then perform the deletion of
    466                positions accumulated across a number of stages through a bitwise ORing of deletion
    467                masks. For example, deletion arises in the replacement of predefined entities with a
    468                single character, such as in the replacement of the &amp;amp; entity, with the
    469                &amp; character. Deletion also arises in XML
    470                end-of-line handling, and CDATA section delimeter processing. Several algorithms to
    471                delete bits at positions marked by DelMask are possible [<a class="xref" href="#u8u16" id="idp699152">Cameron 2008</a>]. </p>
    472 <p id="idp699464">The following table provides an example of generating a DelMask in the context of
    473                bit stream based parsing of well-formed character references and predefined entities.
    474                The result is the generation of a DelMask stream. <div class="table-wrapper" id="idp699848">
    475 <p class="title">Table II</p>
    476 <div class="caption"><p id="idp700104">DelMask Stream Generation</p></div>
    477 <table class="table">
    478 <colgroup span="1">
    479 <col align="left" valign="top" span="1">
    480 <col align="left" valign="top" span="1">
    481 </colgroup>
    482 <tbody>
    483 <tr valign="top">
    484 <td>Input Data</td>
    485 <td>
    486                            <code class="code">&amp;gt; &amp;#13; &amp;#x0a;</code>
    487                         </td>
    488 </tr>
    489 <tr valign="top">
    490 <td>GenRefs</td>
    491 <td>
    492                            <code class="code">_11______________</code>
    493                         </td>
    494 </tr>
    495 <tr valign="top">
    496 <td>DecRefs</td>
    497 <td>
    498                            <code class="code">_______11________</code>
    499                         </td>
    500 </tr>
    501 <tr valign="top">
    502 <td>HexRefs</td>
    503 <td>
    504                            <code class="code">______________11_</code>
    505                         </td>
    506 </tr>
    507 <tr valign="top">
    508 <td>DelMask</td>
    509 <td>
    510                            <code class="code">111__1111__11111_</code>
    511                         </td>
    512 </tr>
    513 <tr valign="top">
    514 <td>ErrorFlag</td>
    515 <td>
    516                            <code class="code">_________________</code>
    517                         </td>
    518 </tr>
    519 </tbody>
    520 </table>
    521 </div>
    522             </p>
    523 </div>
    524 <div class="section" id="idp707448">
    525 <h4 class="title" style="clear: both">Error Flag Streams </h4>
    526 <p id="idp707768">Error flag streams indicates the character code unit positions of syntactical
    527                errors. XML processing examples which benefit from the marking of error positions
    528                include UTF-8 character sequence validation and XML parsing [<a class="xref" href="#u8u16" id="idp708152">Cameron 2008</a>].</p>
    529 <p id="idp708464">The following table provides an example of using bit streams to parse character
    530                references and predefined entities which fail to meet the XML 1.0 well-formedness
    531                constraints. The result is the generation of an error flag stream that marks the
    532                positions of mal-formed decimal and hexical character references respectively. <div class="table-wrapper" id="idp708968">
    533 <p class="title">Table III</p>
    534 <div class="caption"><p id="idp709224">Error Flag Stream Generation</p></div>
    535 <table class="table">
    536 <colgroup span="1">
    537 <col align="left" valign="top" span="1">
    538 <col align="left" valign="top" span="1">
    539 </colgroup>
    540 <tbody>
    541 <tr valign="top">
    542 <td>Input Data</td>
    543 <td>
    544                            <code class="code">&amp;gt; &amp;#, &amp;#x; </code>
    545                         </td>
    546 </tr>
    547 <tr valign="top">
    548 <td>GenRefs</td>
    549 <td>
    550                            <code class="code">_11___________</code>
    551                         </td>
    552 </tr>
    553 <tr valign="top">
    554 <td>DecRefs</td>
    555 <td>
    556                            <code class="code">______________</code>
    557                         </td>
    558 </tr>
    559 <tr valign="top">
    560 <td>HexRefs</td>
    561 <td>
    562                            <code class="code">______________</code>
    563                         </td>
    564 </tr>
    565 <tr valign="top">
    566 <td>DelMask</td>
    567 <td>
    568                            <code class="code">111__11__111__</code>
    569                         </td>
    570 </tr>
    571 <tr valign="top">
    572 <td>ErrorFlag</td>
    573 <td>
    574                            <code class="code">_______1____1_</code>
    575                         </td>
    576 </tr>
    577 </tbody>
    578 </table>
    579 </div>
    580             </p>
    581 </div>
    582 </div>
    583 <div class="section" id="idp716704">
    584 <h3 class="title" style="clear: both">Lexical Item Streams</h3>
    585 <p id="idp717056">Lexical item streams differ from traditional streams of tokens in that they are bit
    586             streams that mark the positions of tokens, whitespace or delimiters. Additional bit
    587             streams, such as the reference streams and callout streams, are subsequently constructed
    588             based on the information held within the set of lexical items streams. Differentiation
    589             between the actual tokens that may occur at a particular point (e.g., the different XML
    590             tokens that begin “&lt;”) may be performed using multicharacter recognizers on the
    591             bytestream representation [<a class="xref" href="#CASCON08" id="idp718216">Cameron, Herdy and Lin 2008</a>].</p>
    592 <p id="idp718568">A key role of lexical item streams in XML parsing is to facilitate fast scanning
    593             operations. For example, a left angle bracket lexical item stream may be formed to
    594             identify those character code unit positions at which a “&lt;” character occurs.
    595             Hardware register bit scan operations may then be used by the XML parser on the left
    596             angle bracket stream to efficiently identify the position of the next “&lt;”. Based
    597             on the capabilities of current commodity processors, a single register bit scan
    598             operation may effectively scan up to 64 byte positions with a single instruction.</p>
    599 <p id="idp720152">Overall, the construction of the full set of lexical item stream computations
    600             requires approximately 1.0 CPU cycles per byte when implemented for 128 positions at a
    601             time using 128-bit SSE registers on Intel Core2 processors [<a class="xref" href="#CASCON08" id="idp720536">Cameron, Herdy and Lin 2008</a>].
    602             The following table defines the core lexical item streams defined by the Parabix XML
    603             parser.</p>
    604 <p id="idp721048">
    605             <div class="table-wrapper" id="idp721176">
    606 <p class="title">Table IV</p>
    607 <div class="caption"><p id="idp721432">Lexical item stream descriptions.</p></div>
    608 <table class="table"><tbody>
    609 <tr>
    610 <td align="left"> LAngle </td>
    611 <td align="left"> Marks the position of any left angle bracket character.</td>
    612 </tr>
    613 <tr>
    614 <td align="left"> RAngle </td>
    615 <td align="left"> Marks the position of any right angle bracket character.</td>
    616 </tr>
    617 <tr>
    618 <td align="left"> LBracket </td>
    619 <td align="left"> Marks the position of any left square bracker character.</td>
    620 </tr>
    621 <tr>
    622 <td align="left"> RBracket </td>
    623 <td align="left"> Marks the position of any right square bracket
    624                      character.</td>
    625 </tr>
    626 <tr>
    627 <td align="left"> Exclam </td>
    628 <td align="left"> Marks the position of any exclamation mark character.</td>
    629 </tr>
    630 <tr>
    631 <td align="left"> QMark </td>
    632 <td align="left"> Marks the position of any question mark character.</td>
    633 </tr>
    634 <tr>
    635 <td align="left"> Hyphen </td>
    636 <td align="left"> Marks the position of any hyphen character.</td>
    637 </tr>
    638 <tr>
    639 <td align="left"> Equals </td>
    640 <td align="left"> Marks the position of any equal sign character.</td>
    641 </tr>
    642 <tr>
    643 <td align="left"> SQuote </td>
    644 <td align="left"> Marks the position of any single quote character.</td>
    645 </tr>
    646 <tr>
    647 <td align="left"> DQuote </td>
    648 <td align="left"> Marks the position of any double quote character.</td>
    649 </tr>
    650 <tr>
    651 <td align="left"> Slash </td>
    652 <td align="left"> Marks the position of any forward slash character</td>
    653 </tr>
    654 <tr>
    655 <td align="left"> NameScan </td>
    656 <td align="left"> Marks the position of any XML name character.</td>
    657 </tr>
    658 <tr>
    659 <td align="left"> WS </td>
    660 <td align="left"> Marks the position of any XML 1.0 whitespace character.</td>
    661 </tr>
    662 <tr>
    663 <td align="left"> PI_start </td>
    664 <td align="left"> Marks the position of the start of any processing instruction
    665                         at the '?' character position.</td>
    666 </tr>
    667 <tr>
    668 <td align="left"> PI_end </td>
    669 <td align="left"> Marks the position of any end of any processing instruction
    670                         at the '&gt;' character position.</td>
    671 </tr>
    672 <tr>
    673 <td align="left"> CtCD_start </td>
    674 <td align="left"> Marks the position of the start of any comment or CDATA
    675                         section at the '!' character position.</td>
    676 </tr>
    677 <tr>
    678 <td align="left"> EndTag_start </td>
    679 <td align="left"> Marks the position of any end tag at the '/' character
    680                         position.</td>
    681 </tr>
    682 <tr>
    683 <td align="left"> CD_end </td>
    684 <td align="left"> Marks the position of the end of any CDATA section at the '&gt;'
    685                         character position. </td>
    686 </tr>
    687 <tr>
    688 <td align="left"> DoubleHyphen </td>
    689 <td align="left"> Marks the position of any double hyphen character.</td>
    690 </tr>
    691 <tr>
    692 <td align="left"> RefStart </td>
    693 <td align="left"> Marks the position of any ampersand character.</td>
    694 </tr>
    695 <tr>
    696 <td align="left"> Hash </td>
    697 <td align="left"> Marks the position of any hash character.</td>
    698 </tr>
    699 <tr>
    700 <td align="left"> x </td>
    701 <td align="left"> Marks the position of any 'x' character.</td>
    702 </tr>
    703 <tr>
    704 <td align="left"> Digit </td>
    705 <td align="left"> Marks the position of any digit.</td>
    706 </tr>
    707 <tr>
    708 <td align="left"> Hex </td>
    709 <td align="left"> Marks the position of any hexidecimal character.</td>
    710 </tr>
    711 <tr>
    712 <td align="left"> Semicolon </td>
    713 <td align="left"> Marks the position of any semicolon character.</td>
    714 </tr>
    715 </tbody></table>
    716 </div>
    717          </p>
    718 <p id="idp745408"> The following illustrates a number of the lexical item streams. </p>
    719 <p id="idp745672">
    720             <div class="table-wrapper" id="idp745800">
    721 <p class="title">Table V</p>
    722 <div class="caption"><p id="idp746056">Lexical Item Streams</p></div>
    723 <table class="table">
    724 <colgroup span="1">
    725 <col align="left" valign="top" span="1">
    726 <col align="left" valign="top" span="1">
    727 </colgroup>
    728 <tbody>
    729 <tr valign="top">
    730 <td>Input Data</td>
    731 <td>
    732                         <code class="code">&lt;tag&gt;&lt;tag&gt; text &amp;lt;
    733                            &amp;#x3e; &lt;/tag&gt;&lt;/tag&gt;</code>
    734                      </td>
    735 </tr>
    736 <tr valign="top">
    737 <td>LAngle</td>
    738 <td>
    739                         <code class="code">1____1______________________1_____1_____</code>
    740                      </td>
    741 </tr>
    742 <tr valign="top">
    743 <td>RAngle</td>
    744 <td>
    745                         <code class="code">____1____1_______________________1_____1</code>
    746                      </td>
    747 </tr>
    748 <tr valign="top">
    749 <td>WS</td>
    750 <td>
    751                         <code class="code">__________1____1____1______1____________</code>
    752                      </td>
    753 </tr>
    754 <tr valign="top">
    755 <td>RefStart</td>
    756 <td>
    757                         <code class="code">________________1____1__________________</code>
    758                      </td>
    759 </tr>
    760 <tr valign="top">
    761 <td>Hex</td>
    762 <td>
    763                         <code class="code">__1____1____1___________11_____1_____1__</code>
    764                      </td>
    765 </tr>
    766 <tr valign="top">
    767 <td>Semicolon</td>
    768 <td>
    769                         <code class="code">___________________1______1_____________</code>
    770                      </td>
    771 </tr>
    772 <tr valign="top">
    773 <td>Slash</td>
    774 <td>
    775                         <code class="code">_____________________________1_____1____</code>
    776                      </td>
    777 </tr>
    778 </tbody>
    779 </table>
    780 </div>
    781          </p>
    782 </div>
    783 <div class="section" id="idp755744">
    784 <h3 class="title" style="clear: both">UTF-8 Byte Classification, Scope and Validation Streams</h3>
    785 <p id="idp756128"> An XML parser must accept the UTF-8 encoding of Unicode [<a class="xref" href="#XML10" id="idp756320">XML 1.0</a>].
    786             It is a fatal error if an XML document determined to be in UTF-8 contains byte sequences
    787             that are not legal in that encoding. UTF-8 byte classification, scope, XML character
    788             validation and error flag bit streams are defined to validate UTF-8 byte sequences and
    789             support transcoding to UTF-16.</p>
    790 <div class="section" id="idp757104">
    791 <h4 class="title" style="clear: both">UTF-8 Byte Classification Streams</h4>
    792 <p id="idp757464">UTF-8 byte classification bit streams classify UTF-8 bytes based on their role in
    793                forming single and multibyte sequences. The u8Prefix and u8Suffix bit streams
    794                identify bytes that represent, respectively, prefix or suffix bytes of multibyte
    795                UTF-8 sequences. The u8UniByte bit stream identifies those bytes that may be
    796                considered single-byte sequences. The u8Prefix2, u8Prefix3, and u8Prefix4 refine the
    797                u8Prefix respectively indicating prefixes of two, three or four byte
    798             sequences respectively.</p>
    799 </div>
    800 <div class="section" id="idp758312">
    801 <h4 class="title" style="clear: both">UTF-8 Scope Streams</h4>
    802 <p id="idp758656"> Scope streams represent expectations established by UTF-8 prefix bytes. For
    803                example, the u8Scope22 bit stream represents the positions at which the second byte of a
    804                two-byte sequence is expected based on the occurrence of a two-byte prefix in the
    805                immediately preceding position. The u8scope32, u8Scope33, u8Scope42, u8scope43, and
    806                u8Scope44 complete the set of UTF-8 scope streams.</p>
    807 <p id="idp477976"> The following example demonstrates the UTF-8 character encoding validation
    808                process using parallel bit stream techniques. The result of this validation process
    809                is an error flag stream identifying the positions at which errors occur.</p>
    810 <p id="idp478440">
    811                <div class="table-wrapper" id="idp478568">
    812 <p class="title">Table VI</p>
    813 <div class="caption"><p id="idp478824">UTF-8 Scope Streams</p></div>
    814 <table class="table">
    815 <colgroup span="1">
    816 <col align="left" valign="top" span="1">
    817 <col align="left" valign="top" span="1">
    818 </colgroup>
    819 <tbody>
    820 <tr valign="top">
    821 <td>Input Data</td>
    822 <td><code class="code">A Text in Farsi: ى ك  Ù
    823 Â ØªÂ Ù†Â Â ÙÂ Ø§Â Ø±Â Ø³Â Ù‰</code></td>
    824 </tr>
    825 <tr valign="top">
    826 <td>High Nybbles</td>
    827 <td>
    828                            <code class="code">42567726624677632D8DBDBDAD82D8DAD82D8D8</code>
    829                         </td>
    830 </tr>
    831 <tr valign="top">
    832 <td>Low Nybbles</td>
    833 <td>
    834                            <code class="code">10458409E061239A099838187910968A9509399</code>
    835                         </td>
    836 </tr>
    837 <tr valign="top">
    838 <td>u8Unibyte</td>
    839 <td>
    840                            <code class="code">11111111111111111__________1______1____</code>
    841                            
    842                         </td>
    843 </tr>
    844 <tr valign="top">
    845 <td>u8Prefix</td>
    846 <td>
    847                            <code class="code">_________________1_1_1_1_1__1_1_1__1_1_</code>
    848                         </td>
    849 </tr>
    850 <tr valign="top">
    851 <td>u8Suffix</td>
    852 <td>
    853                            <code class="code">__________________1_1_1_1_1__1_1_1__1_1</code>
    854                         </td>
    855 </tr>
    856 <tr valign="top">
    857 <td>u8Prefix2</td>
    858 <td>
    859                            <code class="code">_________________1_1_1_1_1__1_1_1__1_1_</code>
    860                         </td>
    861 </tr>
    862 <tr valign="top">
    863 <td>u8Scope22</td>
    864 <td>
    865                            <code class="code">__________________1_1_1_1_1__1_1_1__1_1</code>
    866                            
    867                         </td>
    868 </tr>
    869 <tr valign="top">
    870 <td>ErrorFlag</td>
    871 <td>
    872                            <code class="code">_______________________________________</code>
    873                         </td>
    874 </tr>
    875 </tbody>
    876 </table>
    877 </div>
    878                
    879 
    880             </p>
    881 </div>
    882 <div class="section" id="idp778672">
    883 <h4 class="title" style="clear: both">UTF-8 Validation Streams</h4>
    884 <p id="idp779024"> Proper formation of UTF-8 byte sequences requires that the correct number of
    885                suffix bytes always follow a UTF-8 prefix byte, and that certain illegal byte
    886                combinations are ruled out. For example, sequences beginning with the prefix bytes
    887                0xF5 through 0xFF are illegal as they would represent code point values above 10FFFF.
    888                In addition, there are constraints on the first suffix byte following certain special
    889                prefixes, namely that a suffix following the prefix 0xE0 must fall in the range
    890                0xA0–0xBF, a suffix following the prefix 0xED must fall in the range 0x80–0x9F, a
    891                suffix following the prefix 0xF0 must fall in the range 0x90–0xBF and a suffix
    892                following the prefix 0xF4 must fall in the range 0x80–0x8F. The task of ensuring that
    893                each of these constraints hold is known as UTF-8 validation. The bit streams xE0,
    894                xED, xF0, xF4, xA0_xBF, x80_x9F, x90_xBF, and x80_x8F are constructed to flag the
    895                aforementioned UTF-8 validation errors. The result of UTF-8 validation is a UTF-8
    896                error flag bit stream contructed as the ORing of a series of UTF-8 validation tests.
    897             </p>
    898 </div>
    899 <div class="section" id="idp781056">
    900 <h4 class="title" style="clear: both">XML Character Validation Streams</h4>
    901 <p id="idp781416">The UTF-8 character sequences <span class="ital">0xEF 0xBF 0xBF</span> and
    902                   <span class="ital">0xEF 0xBF 0xBE</span> correspond to the Unicode code points 0xFFFE
    903                and 0xFFFF respectively. In XML 1.0, 0xFFFE and 0xFFFF represent characters outside
    904                the legal XML character ranges. As such, bit streams which mark 0xEF, 0xBF, and 0xBE
    905                character are constructed to flag illegal UTF-8 character sequences. </p>
    906 </div>
    907 <div class="section" id="idp782512">
    908 <h4 class="title" style="clear: both">UTF-8 to UTF-16 Transcoding</h4>
    909 <p id="idp782864">UTF-8 is often preferred for storage and data exchange, it is suitable for
    910                processing, but it is significantly more complex to process than UTF-16 [<a class="xref" href="#Unicode" id="idp783160">Unicode</a>]. As such, XML documents are typically encoded in UTF-8 for
    911                serialization and transport, and subsequently transcoded to UTF-16 for processing
    912                with programming languages such as Java and C#. Following the parallel bit stream
    913                methods developed for the u8u16 transcoder, a high-performance standalone UTF-8 to
    914                UTF-16 transcoder [<a class="xref" href="#u8u16" id="idp783920">Cameron 2008</a>], transcoding to UTF-16 may be achieved by
    915                computing a series of 16 bit streams. One stream for each of the individual bits of a
    916                UTF-16 code unit. </p>
    917 <p id="idp784488">The bit streams for UTF-16 are conveniently divided into groups: the eight streams
    918                u16Hi0, u16Hi1, ..., u16Hi7 for the high byte of each UTF-16 code unit and the eight
    919                streams u16Lo1, ..., u16Lo7 for the low byte. Upon conversion of the parallel bit
    920                stream data back to byte streams, eight sequential byte streams U16h0, U16h1, ...,
    921                U16Hi7 are used for the high byte of each UTF-16 code unit, while U16Lo0, U16Lo1,...,
    922                U16Lo7 are used for the corresponding low byte. Interleaving these streams then
    923                produces the full UTF-16 doublebyte stream.</p>
    924 </div>
    925 <div class="section" id="idp785384">
    926 <h4 class="title" style="clear: both">UTF-8 Indexed UTF-16 Streams</h4>
    927 <p id="idp785744">UTF-16 bit streams are initially defined in UTF-8 indexed form. That is, with sets
    928                of bits in one-to-one correspondence with UTF-8 bytes. However, only one set of
    929                UTF-16 bits is required for encoding two or three-byte UTF-8 sequences and only two
    930                sets are required for surrogate pairs corresponding to four-byte UTF-8 sequences. The
    931                u8LastByte (u8UniByte , u8Scope22 , u8Scope33 , and u8Scope44 ) and u8Scope42 streams
    932                mark the positions at which the correct UTF-16 bits are computed. The bit sets at
    933                other positions must be deleted to compress the streams to the UTF-16 indexed form.
    934             </p>
    935 </div>
    936 </div>
    937 <div class="section" id="idp786760">
    938 <h3 class="title" style="clear: both">Control Character Streams</h3>
    939 <p id="idp787112">The control character bit streams marks ASCII control characters in the range
    940             0x00-0x1F. Additional control character bit streams mark the tab, carriage return, line
    941             feed, and space character. In addition, a bit stream to mark carriage return line
    942             combinations is also constructed. Presently, control character bit streams support the
    943             operations of XML 1.0 character validation and XML end-of-line handling.</p>
    944 <div class="section" id="idp787768">
    945 <h4 class="title" style="clear: both">XML Character Validation</h4>
    946 <p id="idp788120">Legal characters in XML are the tab, carriage return, and line feed characters,
    947                together with all Unicode characters and excluding the surrogate blocks, as well as hexadecimal OxFFFE and
    948                OxFFFF [<a class="xref" href="#XML10" id="idp788480">XML 1.0</a>]. The x00_x1F bit stream is constructed and used in
    949                combination with the additional control character bit streams to flags the positions
    950                of illegal control characters.</p>
    951 </div>
    952 <div class="section" id="idp789136">
    953 <h4 class="title" style="clear: both">XML 1.0 End-of-line Handling</h4>
    954 <p id="idp789496">In XML 1.0 the two-character sequence CR LF (carriage return, line feed) as well as
    955                any CR character not followed by a LF character must be converted to a single LF
    956                character [<a class="xref" href="#XML10" id="idp789840">XML 1.0</a>].</p>
    957 <p id="idp790224">By defining carriage return, line feed, and carriage return line feed bit streams,
    958                dentoted CR, LF and CRLF respectively, end-of-line normalization processing can be
    959                performed in parallel using only a small number of logical and shift operations.</p>
    960 <p id="idp790704"></p>
    961 <p id="idp790832">The following example demonstrates the generation of the CRLF deletion mask. In
    962                this example, the position of all CR characters followed by LF characters are marked
    963                for deletion. Isolated carriage returns are then replaced with LF characters.
    964                Completion of this process satisfies the XML 1.0 end-of-line handling requirements.
    965                For clarity, this example encodes input data carriage returns as
    966                <span class="ital">C</span> characters, whereas line feed characters are shown as
    967                   <span class="ital">L</span> characters.</p>
    968 <p id="idp792008">
    969                <div class="table-wrapper" id="idp792136">
    970 <p class="title">Table VII</p>
    971 <div class="caption"><p id="idp792392">XML 1.0 End-of-line Handling</p></div>
    972 <table class="table">
    973 <colgroup span="1"><col align="left" valign="top" span="1"></colgroup>
    974 <tbody>
    975 <tr valign="top">
    976 <td>Input Data</td>
    977 <td>
    978                            <code class="code">first line C second line CL third line L one more C nothing
    979                            left</code>
    980                         </td>
    981 </tr>
    982 <tr valign="top">
    983 <td>CR</td>
    984 <td>
    985                            <code class="code">-----------1-------------1------------------------1-------------</code>
    986                         </td>
    987 </tr>
    988 <tr valign="top">
    989 <td>LF</td>
    990 <td>
    991                            <code class="code">--------------------------1------------1------------------------</code>
    992                         </td>
    993 </tr>
    994 <tr valign="top">
    995 <td>DelMask</td>
    996 <td>
    997                            <code class="code">--------------------------1-------------------------------------</code>
    998                         </td>
    999 </tr>
    1000 </tbody>
    1001 </table>
    1002 </div>
    1003 
    1004             </p>
    1005 </div>
    1006 </div>
    1007 <div class="section" id="idp797864">
    1008 <h3 class="title" style="clear: both">Call Out Streams</h3>
    1009 <p id="idp798208"> Call out bit streams mark the extents of XML markup structures such as comments,
    1010             processing instruction and CDATA sections as well as physical structures such as character and
    1011             entity references and general references.  Call out streams are also formed for logical markup structures such
    1012             start tags, end tags and empty element tags. </p>
    1013 <div class="section" id="idp798776">
    1014 <h4 class="title" style="clear: both">Comment, Processing Instruction and CDATA Section Call Out Streams</h4>
    1015 <p id="idp799168">Comments, processing instructions and CDATA sections call out streams, Ct_Span,
    1016                PI_Span and CD_Span respectively, define sections of an XML document which
    1017                contain markup that is not interpreted by an XML processor. As such, the union of
    1018                Ct_Span, PI_Span and CD_Span streams defines the regions of non-interpreteable markup.
    1019                The stream formed by this union is termed the CtCDPI_Mask.</p>
    1020 <p id="idp799816">The following tables provides an example of constructing the CtCDPI_Mask. </p>
    1021 <div class="table-wrapper" id="idp800088">
    1022 <p class="title">Table VIII</p>
    1023 <div class="caption"><p id="idp800344">CtCDPI Mask Generation</p></div>
    1024 <table class="table">
    1025 <colgroup span="1">
    1026 <col align="left" valign="top" span="1">
    1027 <col align="left" valign="top" span="1">
    1028 </colgroup>
    1029 <tbody>
    1030 <tr valign="top">
    1031 <td>Input Data</td>
    1032 <td><code class="code">&lt;?php?&gt; &lt;!-- example --&gt; &lt;![CDATA[ shift: a&lt;&lt;1 ]]&gt;</code></td>
    1033 </tr>
    1034 <tr valign="top">
    1035 <td>CD_Span</td>
    1036 <td><code class="code">___________________________1111111111111111111111_</code></td>
    1037 </tr>
    1038 <tr valign="top">
    1039 <td>Ct_Span</td>
    1040 <td><code class="code">___________111111111111___________________________</code></td>
    1041 </tr>
    1042 <tr valign="top">
    1043 <td>PI_Span</td>
    1044 <td><code class="code">_11111____________________________________________</code></td>
    1045 </tr>
    1046 <tr valign="top">
    1047 <td>CtCDPI_Mask</td>
    1048 <td><code class="code">_111111__111111111111111__111111111111111111111111</code></td>
    1049 </tr>
    1050 <tr valign="top">
    1051 <td>ErrorFlag</td>
    1052 <td><code class="code">__________________________________________________</code></td>
    1053 </tr>
    1054 </tbody>
    1055 </table>
    1056 </div>
    1057 <p id="idp805984"> With the removal of all non-interpreteable markup, several phases of parallel bit
    1058                stream based SIMD operations may follow operating on up to 128 byte positions on
    1059                current commondity processors and assured of XML markup relevancy. For
    1060                example, with the extents identification of comments, processing instructions and
    1061                CDATA secions, XML names may be identified and length sorted for efficient symbol
    1062                table construction. </p>
    1063 <p id="idp806680"> As an aside, comments and CDATA sections must first be validated to ensure
    1064                that comments do not contain "--" sequences and that CDATA sections do not contain illegal
    1065                "]]&gt;" sequences prior to ignorable markup stream generation.</p>
    1066 </div>
    1067 <div class="section" id="idp807344">
    1068 <h4 class="title" style="clear: both">Reference Call Out Streams</h4>
    1069 <p id="idp807696">The reference call out streams are the GenRefs, DecRefs, and HexRefs streams. This
    1070                subset of the call out streams marks the extents of all but the closing semicolon of
    1071                general and character references.</p>
    1072 <p id="idp808128">Predefined character
    1073                (&amp;lt;,&amp;gt;,&amp;amp;,&amp;apos;,&amp;quot;) and numeric character
    1074                references (&amp;#nnnn;, &amp;#xhhhh;) must be replaced by a single character
    1075                   [<a class="xref" href="#XML10" id="idp808304">XML 1.0</a>]. As previously shown, this subset of call out streams enables the construction of a DelMask for
    1076                references.</p>
    1077 </div>
    1078 <div class="section" id="idp809320">
    1079 <h4 class="title" style="clear: both">Tag Call Out Streams</h4>
    1080 <p id="idp809672">Whereas sequential bit scans over lexical item streams form the basis of XML
    1081                parsing, in the current Parabix parser a new method of parallel parsing has been
    1082                developed and prototyped using the concept of bitstream addition. Fundamental to this
    1083                method is the concept of a <span class="ital">cursor</span> stream, a bit stream marking
    1084                the positions of multiple parallel parses currently in process. </p>
    1085 <p id="idp810520">The results of parallel parsing using the bit stream addition technique produces a
    1086                set of tag call out bit streams. These streams mark the extents of each start tag,
    1087                end tag and empty element tag. Within tags, additional streams mark start
    1088                and end positions for tag names, as well as attribute names and values. An error flag
    1089                stream marks the positions of any syntactic errors encountered during parsing.</p>
    1090 <p id="idp811648"> The set of tag call out streams consists of the ElemNames, AttNames, AttVals, Tags,
    1091                EmptyTagEnds and EndTags bit streams. The following example demonstrates the bit
    1092                stream output produced which from parallel parsing using bit stream addition. </p>
    1093 <div class="table-wrapper" id="idp812120">
    1094 <p class="title">Table IX</p>
    1095 <div class="caption"><p id="idp812376">Tag Call Out Streams</p></div>
    1096 <table class="table">
    1097 <colgroup span="1">
    1098 <col align="left" valign="top" span="1">
    1099 <col align="left" valign="top" span="1">
    1100 </colgroup>
    1101 <tbody>
    1102 <tr valign="top">
    1103 <td>Input Data</td>
    1104 <td>
    1105                         <code class="code">&lt;root&gt;&lt;t1&gt;text&lt;/t1&gt;&lt;t2
    1106                            a1='foo' a2 =
    1107                            'fie'&gt;more&lt;/t2&gt;&lt;tag3
    1108                            att3='b'/&gt;&lt;/root&gt;</code>
    1109                      </td>
    1110 </tr>
    1111 <tr valign="top">
    1112 <td>ElemNames</td>
    1113 <td>
    1114                         <code class="code">_1111__11___________11_______________________________1111__________________</code>
    1115                      </td>
    1116 </tr>
    1117 <tr valign="top">
    1118 <td>AttNames</td>
    1119 <td>
    1120                         <code class="code">_______________________11_______11________________________1111_____________</code>
    1121                      </td>
    1122 </tr>
    1123 <tr valign="top">
    1124 <td>AttrVals</td>
    1125 <td>
    1126                         <code class="code">__________________________11111______11111_____________________111_________</code>
    1127                      </td>
    1128 </tr>
    1129 <tr valign="top">
    1130 <td>EmptyTagEnds</td>
    1131 <td>
    1132                         <code class="code">___________________________________________________________________1_______</code>
    1133                      </td>
    1134 </tr>
    1135 <tr valign="top">
    1136 <td>EndTags</td>
    1137 <td>
    1138                         <code class="code">_______________111______________________________111__________________11111_</code>
    1139                      </td>
    1140 </tr>
    1141 <tr valign="top">
    1142 <td>Start/EmptyTags</td>
    1143 <td>
    1144                         <code class="code">_1111__11___________1111111111111111111111___________11111111111111________</code>
    1145                      </td>
    1146 </tr>
    1147 <tr valign="top">
    1148 <td>ErrorFlag</td>
    1149 <td>
    1150                         <code class="code">___________________________________________________________________________</code>
    1151                      </td>
    1152 </tr>
    1153 </tbody>
    1154 </table>
    1155 </div>
    1156 </div>
    1157 </div>
    1158 </div>
    1159 <div class="section" id="idp822632">
    1160 <h2 class="title" style="clear: both">SIMD Beyond Bitstreams: Names and Numbers</h2>
    1161 <p id="idp822952">Whereas the fundamental innovation of our work is the use of SIMD technology in
    1162          implementing parallel bit streams for XML, there are also important ways in which more
    1163          traditional byte-oriented SIMD operations can be useful in accelerating other aspects of
    1164          XML processing.</p>
    1165 <div class="section" id="idp823448">
    1166 <h3 class="title" style="clear: both">Name Lookup</h3>
    1167 <p id="idp823784">Efficient symbol table mechanisms for looking up element and attribute names is
    1168             important for almost all XML processing applications. It is also an important technique
    1169             merely for assessing well-formedness of an XML document; rather than validating the
    1170             character-by-character composition of each occurrence of an XML name as it is
    1171             encountered, it is more efficient to validate all but the first occurrence by first
    1172             determining whether the name already exists in a table of prevalidated names.</p>
    1173 <p id="idp825088">The first symbol table mechanism deployed in the Parabix parser simply used the
    1174             hashmaps of the C++ standard template library, without deploying any SIMD technology.
    1175             However, with the overhead of character validation, transcoding and parsing dramatically
    1176             reduced by parallel bit stream technology, we found that symbol lookups then accounted
    1177             for about half of the remaining execution time in a statistics gathering application
    1178                [<a class="xref" href="#CASCON08" id="idp825712">Cameron, Herdy and Lin 2008</a>]. Thus, symbol table processing was identified as a major
    1179             target for further performance improvement. </p>
    1180 <p id="idp826232"> Our first effort to improve symbol table performance was to employ the splash tables
    1181             with cuckoo hashing as described by Ross [<a class="xref" href="#Ross06" id="idp826512">Ross 2006</a>], using SIMD
    1182             technology for parallel bucket processing. Although this technique did turn out to have
    1183             the advantage of virtually constant-time performance even for very large vocabularies,
    1184             it was not particularly helpful for the relatively small vocabularies typically found in
    1185             XML document processing. </p>
    1186 <p id="idp827288"> However, a second approach has been found to be quite useful, taking advantage of
    1187             parallel bit streams for cheap determination of symbol length. In essence, the length of
    1188             a name can be determined very cheaply using a single bit scan operation. This then makes
    1189             it possible to use length-sorted symbol table processing, as follows. First, the
    1190             occurrences of all names are stored in arrays indexed by length. Then the length-sorted
    1191             arrays may each be inserted into the symbol table in turn. The advantage of this is that
    1192             a separate loop may be written for each length. Length sorting makes for very efficient
    1193             name processing. For example hash value computations and name comparisons can be made by
    1194             loading multibyte values and performing appropriate shifting and masking operations,
    1195             without the need for a byte-at-a-time loop. In initial experiments, this length-sorting
    1196             approach was found to reduce symbol lookup cost by a factor of two. </p>
    1197 <p id="idp828544"> Current research includes the application of SIMD technology to further enhance the
    1198             performance of length-sorted lookup. We have identified a promising technique for
    1199             parallel processing of multiple name occurrences using a parallel trie lookup technique.
    1200             Given an array of occurrences of names of a particular length, the first one, two or
    1201             four bytes of each name are gathered and stored in a linear array. SIMD techniques are
    1202             then used to compare these prefixes with the possible prefixes for the current position
    1203             within the trie. In general, a very small number of possibilities exist for each trie
    1204             node, allowing for fast linear search through all possibilities. Typically, the
    1205             parallelism is expected to exceed the number of possibilities to search through at each
    1206             node. With length-sorting to separate the top-level trie into many small subtries, we
    1207             expect only a single step of symbol lookup to be needed in most practical instances. </p>
    1208 <p id="idp829808">The gather step of this algorithm is actually a common technique in SIMD processing.
    1209             Instruction set support for gather operations is a likely future direction for SIMD
    1210             technology.</p>
    1211 </div>
    1212 <div class="section" id="idp830280">
    1213 <h3 class="title" style="clear: both">Numeric Processing</h3>
    1214 <p id="idp830624"> Many XML applications involve numeric data fields as attribute values or element
    1215             content. Although most current XML APIs uniformly return information to applications in
    1216             the form of character strings, it is reasonable to consider direct API support for
    1217             numeric conversions within a high-performance XML engine. With string to numeric
    1218             conversion such a common need, why leave it to application programmers? </p>
    1219 <p id="idp831280"> High-performance string to numeric conversion using SIMD operations also can
    1220             considerably outperform the byte-at-a-time loops that most application programmers or
    1221             libraries might employ. A first step is reduction of ASCII bytes to corresponding
    1222             decimal nybbles using a SIMD packing operation. Then an inductive doubling algorithm
    1223             using SIMD operations may be employed. First, 16 sets of adjacent nybble values in the
    1224             range 0-9 can be combined in just a few SIMD operations to 16 byte values in the range
    1225             0-99. Then 8 sets of byte values may similarly be combined with further SIMD processing
    1226             to produce doublebyte values in the range 0-9999. Further combination of doublebyte
    1227             values into 32-bit integers and so on can also be performed using SIMD operations. </p>
    1228 <p id="idp832336"> Using appropriate gather operations to bring numeric strings into appropriate array
    1229             structures, an XML engine could offer high-performance numeric conversion services to
    1230             XML application programmers. We expect this to be an important direction for our future
    1231             work, particularly in support of APIs that focus on direct conversion of XML data into
    1232             business objects. </p>
    1233 </div>
    1234 </div>
    1235 <div class="section" id="idp833080">
    1236 <h2 class="title" style="clear: both">APIs and Parallel Bit Streams</h2>
    1237 <div class="section" id="idp833440">
    1238 <h3 class="title" style="clear: both">The ILAX Streaming API</h3>
    1239 <p id="idp833792">The In-Line API for XML (ILAX) is the base API provided with the Parabix parser. It
    1240             is intended for low-level extensions compiled right into the engine, with minimum
    1241             possible overhead. It is similar to streaming event-based APIs such as SAX, but
    1242             implemented by inline substitution rather than using callbacks. In essence, an extension
    1243             programmer provides method bodies for event-processing methods declared internal to the
    1244             Parabix parsing engine, compiling the event processing code directly with the core code
    1245             of the engine. </p>
    1246 <p id="idp834592"> Although ILAX can be used directly for application programming, its primary use is
    1247             for implementing engine extensions that support higher-level APIs. For example, the
    1248             implementation of C or C++ based streaming APIs based on the Expat [<a class="xref" href="#Expat" id="idp834992">Expat</a>] or general SAX models can be quite directly implemented. C/C++ DOM
    1249             or other tree-based APIs can also be fairly directly implemented. However, delivering
    1250             Parabix performance to Java-based XML applications is challenging due to the
    1251             considerable overhead of crossing the Java Native Interface (JNI) boundary. This issue
    1252             is addressed with the Array Set Model (ASM) concept discussed in the following section. </p>
    1253 <p id="idp835872"> With the recent development of parallel parsing using bitstream addition, it is
    1254             likely that the underlying ILAX interface of Parabix will change. In essence, ILAX
    1255             suffers the drawback of all event-based interfaces: they are fundamentally sequential in
    1256             number. As research continues, we expect efficient parallel methods building on parallel
    1257             bit stream foundations to move up the stack of XML processing requirements. Artificially
    1258             imposing sequential processing is thus expected to constrain further advances in XML
    1259             performance. </p>
    1260 </div>
    1261 <div class="section" id="idp836736">
    1262 <h3 class="title" style="clear: both">Efficient XML in Java Using Array Set Models</h3>
    1263 <p id="idp837112"> In our GML-to-SVG case study, we identified the lack of high-performance XML
    1264             processing solutions for Java to be of particular interest. Java byte code does not
    1265             provide access to the SIMD capabilities of the underlying machine architecture. Java
    1266             just-in-time compilers might be capable of using some SIMD facilities, but there is no
    1267             real prospect of conventional compiler technology translating byte-at-a-time algorithms
    1268             into parallel bit stream code. So the primary vehicle for delivering high-performance
    1269             XML processing is to call native parallel bit stream code written in C through JNI
    1270             capabilities. </p>
    1271 <p id="idp838000">However, each JNI call is expensive, so it is desirable to minimize the number of
    1272             calls and get as much work done during each call as possible. This mitigates against
    1273             direct implementation of streaming APIs in Java through one-to-one mappings to an
    1274             underlying streaming API in C. Instead, we have concentrated on gathering information on
    1275             the C side into data structures that can then be passed to the Java side. However, using
    1276             either C pointer-based structures or C++ objects is problematic because these are
    1277             difficult to interpret on the Java side and are not amenable to Java's automatic storage
    1278             management system. Similarly, Java objects cannot be conveniently created on the C side.
    1279             However, it is possible to transfer arrays of simple data values (bytes or integers)
    1280             between C and Java, so that makes a reasonable focus for bulk data communication between
    1281             C and Java. </p>
    1282 <p id="idp839192"><span class="ital">Array Set Models</span> are array-based representations of information
    1283             representing an XML document in accord with XML InfoSet [<a class="xref" href="#InfoSet" id="idp839600">XML Infoset</a>] or
    1284             other XML data models relevant to particular APIs. As well as providing a mechanism for
    1285             efficient bulk data communication across the JNI boundary, ASMs potentially have a
    1286             number of other benefits in high-performance XML processing. <div class="itemizedlist" id="idp840256"><ul>
    1287 <li id="idp840384"><p id="idp840512">Prefetching. Commodity processors commonly support hardware and/or software
    1288                      prefetching to ensure that data is available in a processor cache when it is
    1289                      needed. In general, prefetching is most effective in conjunction with the
    1290                      continuous sequential memory access patterns associated with array
    1291                   processing.</p></li>
    1292 <li id="idp841160"><p id="idp841288">DMA. Some processing environments provide Direct Memory Access (DMA)
    1293                      controllers for block data movement in parallel with computation. For example,
    1294                      the Cell Broadband Engine uses DMA controllers to move the data to and from the
    1295                      local stores of the synergistic processing units. Arrays of contiguous data
    1296                      elements are well suited to bulk data movement using DMA.</p></li>
    1297 <li id="idp842000"><p id="idp842128">SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern
    1298                      processor instruction sets allow simultaneous application of particular
    1299                      instructions to sets of elements from parallel arrays. For effective use of
    1300                      SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS
    1301                      (Array of Structures) model. </p></li>
    1302 <li id="idp842800"><p id="idp842928">Multicore processors. Array-oriented processing can enable the effective
    1303                      distribution of work to the individual cores of a multicore system in two
    1304                      distinct ways. First, provided that sequential dependencies can be minimized or
    1305                      eliminated, large arrays can be divided into separate segments to be processed
    1306                      in parallel on each core. Second, pipeline parallelism can be used to implement
    1307                      efficient multipass processing with each pass consisting of a processing kernel
    1308                      with array-based input and array-based output. </p></li>
    1309 <li id="idp844472"><p id="idp844600">Streaming buffers for large XML documents. In the event that an XML document
    1310                      is larger than can be reasonably represented entirely within processor memory,
    1311                      a buffer-based streaming model can be applied to work through a document using
    1312                      sliding windows over arrays of elements stored in document order. </p></li>
    1313 </ul></div>
    1314          </p>
    1315 <div class="section" id="idp845360">
    1316 <h4 class="title" style="clear: both">Saxon-B TinyTree Example</h4>
    1317 <p id="idp845712">As a first example of the ASM concept, current work includes a proof-of-concept to
    1318                deliver a high-performance replacement for building the TinyTree data structure used
    1319                in Saxon-B 6.5.5, an open-source XSLT 2.0 processor written in Java [<a class="xref" href="#Saxon" id="idp846112">Saxon</a>]. Although XSLT stylesheets may be cached for performance, the
    1320                caching of source XML documents is typically not possible. A new TinyTree object to
    1321                represent the XML source document is thus commonly constructed with each new query so
    1322                that the overall performance of simple queries on large source XML documents is
    1323                highly dependent on TinyTree build time. Indeed, in a study of Saxon-SA, the
    1324                commercial version of Saxon, query time was shown to be dominated by TinyTree build
    1325                time [<a class="xref" href="#Kay08" id="idp847040">Kay 2008</a>]. Similar performance results are demonstrable for the
    1326                Saxon-B XSLT processor as well. </p>
    1327 <p id="idp847568"> The Saxon-B processor studied is a pure Java solution, converting a SAX (Simple
    1328                API for XML) event stream into the TinyTree Java object using the efficient Aelfred
    1329                XML parser [<a class="xref" href="#AElfred" id="idp847912">Ælfred</a>]. The TinyTree structure is itself an
    1330                array-based structure mapping well suited to the ASM concept. It consists of six
    1331                parallel arrays of integers indexed on node number and containing one entry for each
    1332                node in the source document, with the exception of attribute and namespace nodes
    1333                   [<a class="xref" href="#Saxon" id="idp848624">Saxon</a>]. Four of the arrays respectively provide node kind, name
    1334                code, depth, and next sibling information for each node, while the two others are
    1335                overloaded for different purposes based on node kind value. For example, in the
    1336                context of a text node , one of the overloaded arrays holds the text buffer offset
    1337                value whereas the other holds the text buffer length value. Attributes and namespaces
    1338                are represented using similiar parallel array of values. The stored TinyTree values
    1339                are primarily primitive Java types, however, object types such as Java Strings and
    1340                Java StringBuffers are also used to hold attribute values and comment values
    1341                respectively. </p>
    1342 <p id="idp849784"> In addition to the TinyTree object, Saxon-B maintains a NamePool object which
    1343                represents a collection of XML name triplets. Each triplet is composed of a Namespace
    1344                URI, a Namespace prefix and a local name and encoded as an integer value known as a
    1345                namecode. Namecodes permit efficient name search and look-up using integer
    1346                comparison. Namecodes may also be subsequently decoded to recover namespace and local
    1347                name information. </p>
    1348 <p id="idp850488"> Using the Parabix ILAX interface, a high-performance reimplementation of TinyTree
    1349                and NamePool data structures was built to compare with the Saxon-B implementation. In
    1350                fact, two functionally equivalent versions of the ASM java class were constructed. An
    1351                initial version was constructed based on a set of primitive Java arrays constructed
    1352                and allocated in the Java heap space via JNI New&lt;PrimitiveType&gt;Array
    1353                method call. In this version, the JVM garbage collector is aware of all memory
    1354                allocated in the native code. However, in this approach, large array copy operations
    1355                limited overall performance to approximately a 2X gain over the Saxon-B build time. </p>
    1356 <p id="idp851584">To further address the performance penalty imposed by copying large array values,
    1357                a second version of the ASM Java object was constructed based on natively backed
    1358                Direct Memory Byte Buffers [<a class="xref" href="#JNI" id="idp851944">Hitchens 2002</a>]. In this version the JVM garbage
    1359                collector is unaware any native memory resources backing the Direct Memory Byte
    1360                Buffers. Large JNI-based copy operations are avoided; however, system memory must be
    1361                explicitly deallocated via a Java native method call. Using this approach, our
    1362                preliminary results show an approximate total 2.5X gain over Saxon-B build time.
    1363             </p>
    1364 </div>
    1365 </div>
    1366 </div>
    1367 <div class="section" id="idp853000">
    1368 <h2 class="title" style="clear: both">Compiler Technology</h2>
    1369 <p id="idp853344"> An important focus of our recent work is on the development of compiler technology to
    1370          automatically generate the low-level SIMD code necessary to implement bit stream processing
    1371          given suitable high-level specifications. This has several potential benefits. First, it
    1372          can eliminate the tedious and error-prone programming of bit stream operations in terms of
    1373          register-at-a-time SIMD operations. Second, compilation technology can automatically employ
    1374          a variety of performance improvement techniques that are difficult to apply manually. These
    1375          include algorithms for instruction scheduling and register allocation as well as
    1376          optimization techniques for common subexpression expression elimination and register
    1377          rematerialization among others. Third, compiler technology makes it easier to make changes
    1378          to the low-level code for reasons of perfective or adaptive maintenance.</p>
    1379 <p id="idp854496">Beyond these reasons, compiler technology also offers the opportunity for retargetting
    1380          the generation of code to accommodate different processor architectures and API
    1381          requirements. Strategies for efficient parallel bit stream code can vary considerably
    1382          depending on processor resources such as the number of registers available, the particular
    1383          instruction set architecture supported, the size of L1 and L2 data caches, the number of
    1384          available cores and so on. Separate implementation of custom code for each processor
    1385          architecture would thus be likely to be prohibitively expensive, prone to errors and
    1386          inconsistencies and difficult to maintain. Using compilation technology, however, the idea
    1387          would be to implement a variety of processor-specific back-ends all using a common front
    1388          end based on parallel bit streams. </p>
    1389 <div class="section" id="idp855592">
    1390 <h3 class="title" style="clear: both">Character Class Compiler</h3>
    1391 <p id="idp855944">The first compiler component that we have implemented is a character class compiler,
    1392             capable of generation all the bit stream logic necessary to produce a set of lexical
    1393             item streams each corresponding to some particular set of characters to be recognized.
    1394             By taking advantage of common patterns between characters within classes, and special
    1395             optimization logic for recognizing character-class ranges, our existing compiler is able
    1396             to generate well-optimized code for complex sets of character classes involving numbers
    1397             of special characters as well as characters within specific sets of ranges. </p>
    1398 </div>
    1399 <div class="section" id="idp857544">
    1400 <h3 class="title" style="clear: both">Regular Expression Compilation</h3>
    1401 <p id="idp857904">Based on the character class compiler, we are currently investigating the
    1402             construction of a regular expression compiler that can implement bit-stream based
    1403             parallel regular-expression matching similar to that describe previously for parallel
    1404             parsing by bistream addition. This compiler works with the assumption that bitstream
    1405             regular-expression definitions are deterministic; no backtracking is permitted with the
    1406             parallel bit stream representation. In XML applications, this compiler is primarily
    1407             intended to enforce regular-expression constraints on string datatype specifications
    1408             found in XML schema. </p>
    1409 </div>
    1410 <div class="section" id="idp858856">
    1411 <h3 class="title" style="clear: both">Unbounded Bit Stream Compilation</h3>
    1412 <p id="idp859216">The Catalog of XML Bit Streams presented earlier consist of a set of abstract,
    1413             unbounded bit streams, each in one-to-one correspondence with input bytes of a text
    1414             file. Determining how these bit streams are implemented using fixed-width SIMD
    1415             registers, and possibly processed in fixed-length buffers that represent some multiple
    1416             of the register width is a source of considerable programming complexity. The general
    1417             goal of our compilation strategy in this case is to allow operations to be programmed in
    1418             terms of unbounded bit streams and then automatically reduced to efficient low-level
    1419             code with the application of a systematic code generation strategy for handling block
    1420             and buffer boundary crossing. This is work currently in progress. </p>
    1421 </div>
    1422 </div>
    1423 <div class="section" id="idp860384">
    1424 <h2 class="title" style="clear: both">Conclusion</h2>
    1425 <p id="idp860704">Parallel bit stream technology offers the opportunity to dramatically speed up the core
    1426          XML processing components used to implement virtually any XML API. Character validation and
    1427          transcoding, whitespace processing, and parsing up to including the full validation of tag
    1428          syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the
    1429          positions of all element names, attribute names and attribute values can also be produced,
    1430          followed by fast bit scan operations to generate position and length values. Beyond bit
    1431          streams, byte-oriented SIMD processing of names and numerals can also accelerate
    1432          performance beyond sequential byte-at-a-time methods. </p>
    1433 <p id="idp861640">Advances in processor architecture are likely to further amplify the performance of
    1434          parallel bit stream technology over traditional byte-at-a-time processing over the next
    1435          decade. Improvements to SIMD register width, register complement and operation format can
    1436          all result in further gains. New SIMD instruction set features such as inductive doubling
    1437          support, parallel extract and deposit instructions, bit interleaving and scatter/gather
    1438          capabilities should also result in significant speed-ups. Leveraging the intraregister
    1439          parallelism of parallel bit stream technology within SIMD registers to take of intrachip
    1440          parallelism on multicore processors should accelerate processing further. </p>
    1441 <p id="idp862592">Technology transfer using a patent-based open-source business model is a further goal of
    1442          our work with a view to widespread deployment of parallel bit stream technology in XML
    1443          processing stacks implementing a variety of APIs. The feasibility of substantial
    1444          performance improvement in replacement of technology implementing existing APIs has been
    1445          demonstrated even in complex software architectures involving delivery of performance
    1446          benefits across the JNI boundary. We are seeking to accelerate these deployment efforts
    1447          both through the development of compiler technology to reliably apply these methods to a
    1448          variety of architectures as well as to identify interested collaborators using open-source
    1449          or commercial models. </p>
    1450 </div>
    1451 <div class="section" id="idp864448">
    1452 <h2 class="title" style="clear: both">Acknowledgments</h2>
    1453 <p id="idp864792">This work is supported in part by research grants and scholarships from the Natural
    1454          Sciences and Engineering Research Council of Canada, the Mathematics of Information
    1455          Technology and Complex Systems Network and the British Columbia Innovation Council. </p>
    1456 <p id="idp865264">We thank our colleague Dan Lin (Linda) for her work in high-performance symbol table
    1457          processing. </p>
    1458 </div>
    1459 <div class="bibliography" id="idp865632">
     105<p id="idp274328"></p>
     106<p id="idp274456"></p>
     107<p id="idp274584"></p>
     108<p id="idp274712"></p>
     109</div>
     110<div class="section" id="idp274904">
     111<h2 class="title" style="clear: both">Background</h2>
     112<div class="section" id="idp275240">
     113<h3 class="title" style="clear: both">Xerces C++ Structure</h3>
     114<p id="idp275584">
     115The Xerces C++ parser
     116
     117
     118
     119
     120features comprehensive support for a variety of character encodings
     121both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
     122multiple XML vocabularies through the XML namespace
     123mechanism, as well as complete implementations
     124of structure and data validation through multiple grammars
     125declared using either legacy DTDs (document type
     126definitions) or modern XML Schema facilities.
     127Xerces also supports several APIs for accessing
     128parser services, including event-based parsing
     129using either pull parsing or SAX/SAX2 push-style
     130parsing as well as a DOM tree-based parsing interface.
     131</p>
     132<p id="idp277232">
     133
     134
     135
     136Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
     137first to the last byte of input data. Each byte passes through several processing layers and is
     138classified and eventually validated within the context of the document state.
     139This introduces implicit dependencies between the various tasks within the application that make it
     140difficult to optimize for performance.
     141As a complex software system, no one feature dominates the overall parsing performance.
     142Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
     143Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
     144parallelization in isolation would only produce a minute improvement in performance.
     145Unfortunately, early investigation into these functions found
     146that incorporating speculation-free thread-level parallelization was impossible
     147and they were already performing well in their given tasks;
     148thus only trivial enhancements were attainable.
     149In order to obtain a systematic acceleration of Xerces,
     150it should be expected that a comprehensive restructuring
     151is required, involving all aspects of the parser.
     152</p>
     153<p id="idp279408">
     154
     155
     156
     157
     158</p>
     159</div>
     160<div class="section" id="idp280344">
     161<h3 class="title" style="clear: both">The Parabix Framework</h3>
     162<p id="idp280664">
     163The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
     164(and other forms of text processing.) The key idea is to exploit the availability of wide
     165SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
     166of input data by using one register bit per single input byte.
     167To facilitate this, the input data is first transposed into a set of basis bit streams.
     168In
     169
     170Boolean-logic operations\footnote{$∧$, $\√$ and $¬$ denote the boolean AND, OR and NOT operators.}
     171are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
     172characters (or groups of characters) with a $1$.
     173For example, one of the fundamental characters in XML is a left-angle bracket.
     174A character is a<code class="code">&lt; if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
     175b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1</code>.
     176Similarly, a character is numeric
     177<code class="code">[0-9] if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))</code>.
     178An important observation here is that ranges of characters may
     179require fewer operations than individual characters and
     180
     181multiple classes can share the classification cost.
     182</p>
     183<p id="idp409632">
     184
     185</p>
     186<p id="idp411072">
     187Consider, for example, the XML source data stream shown in the first line of .
     188The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
     189parsing, with each bit of each stream in one-to-one correspondence to the source character code units
     190of the input stream.
     191For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
     192The first bit stream shown is that for the opening
     193angle brackets that represent tag openers in XML.
     194The second and third streams show a partition of the
     195tag openers into start tag marks and end tag marks
     196depending on the character immediately following the
     197opener (i.e., ``\verb:/:'') or not.  The remaining three
     198lines show streams that can be computed in subsequent
     199parsing (using the technique
     200of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
     201attribute names and attribute values of tags. 
     202</p>
     203<p id="idp413944">
     204Two intuitions may help explain how the Parabix approach can lead
     205to improved XML parsing performance. The first is that
     206the use of the full register width offers a considerable
     207information advantage over sequential byte-at-a-time
     208parsing.  That is, sequential processing of bytes
     209uses just 8 bits of each register, greatly limiting the
     210processor resources that are effectively being used at any one time.
     211The second is that byte-at-a-time loop scanning loops are actually
     212often just computing a single bit of information per iteration:
     213is the scan complete yet?
     214Rather than computing these individual decision-bits, an approach that computes
     215many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
     216should provide substantial benefit.
     217</p>
     218<p id="idp414904">
     219Previous studies have shown that the Parabix approach improves many aspects of XML processing,
     220including transcoding \cite{Cameron2008}, character classification and validation,
     221tag parsing and well-formedness checking. 
     222The first Parabix parser used processor bit scan instructions to considerably accelerate
     223sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
     224Recent work has incorporated a method of parallel
     225scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as
     226well as combining SIMD methods with 4-stage pipeline parallelism to further improve
     227throughput \cite{HPCA2012}.
     228Although these research prototypes handled the full syntax of schema-less XML documents,
     229they lacked the functionality required by full XML parsers.
     230</p>
     231<p id="idp415880">
     232Commercial XML processors support transcoding of multiple character sets and can parse and
     233validate against multiple document vocabularies.
     234Additionally, they provide API facilities beyond those found in research prototypes,
     235including the widely used SAX, SAX2 and DOM interfaces.
     236</p>
     237</div>
     238<div class="section" id="idp416432">
     239<h3 class="title" style="clear: both">Sequential vs. Parallel Paradigm</h3>
     240<p id="idp416752">
     241Xerces—like all traditional XML parsers—processes XML documents sequentially.
     242Each character is examined to distinguish between the
     243XML-specific markup, such as a left angle bracket <code class="code">&lt;</code>, and the
     244content held within the document. 
     245As the parser progresses through the document, it alternates between markup scanning,
     246validation and content processing modes.
     247</p>
     248<p id="idp417824">
     249In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
     250  Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
     251  current state and the events associated with transitions of states.}.
     252Each state transition indicates the processing context of subsequent characters.
     253Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
     254</p>
     255<p id="idp418488">
     256Parabix-style XML parsers utilize a concept of layered processing.
     257A block of source text is transformed into a set of lexical \bitstream{}s,
     258which undergo a series of operations that can be grouped into logical layers,
     259e.g., transposition, character classification, and lexical analysis.
     260Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
     261To meet the API requirements of the document-ordered Xerces output,
     262the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
     263</p>
     264</div>
     265</div>
     266<div class="section" id="idp419384">
     267<h2 class="title" style="clear: both">Architecture</h2>
     268<div class="section" id="idp419704">
     269<h3 class="title" style="clear: both">Overview</h3>
     270<p id="idp420152">
     271\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
     272rearchitected with pipeline parallelism in mind.
     273In this section, we highlight the core differences between the two systems.
     274As shown in Figure \ref{fig:xerces-arch}, Xerces
     275is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
     276The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
     277the majority of the character set encoding validation is performed as a byproduct of this process.
     278The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
     279It tracks the current line/column position,
     280
     281performs line-break normalization and validates context-specific character set issues,
     282such as tokenization of qualified-names.
     283The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
     284of the document; it deals with all issues related to entity expansion, validates
     285the XML well-formedness constraints and any character set encoding issues that cannot
     286be completely handled by the reader or transcoder (e.g., surrogate characters, validation
     287and normalization of character references, etc.)
     288The {\it Namespace Binder} is a core piece of the element stack.
     289It handles namespace scoping issues between different XML vocabularies.
     290This allows the scanner to properly select the correct schema grammar structures.
     291The {\it Validator} takes the IR produced by the Scanner (and
     292potentially annotated by the Namespace Binder) and assesses whether the final output matches
     293the user-defined DTD and schema grammar(s) before passing it to the end-user.
     294</p>
     295<p id="idp422304">
     296
     297</p>
     298<p id="idp422560">
     299In \icXML{} functions are grouped into logical components.
     300As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
     301All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
     302The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
     303mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
     304set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
     305These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
     306after additional processing is performed.
     307The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
     308It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
     309significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
     310the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
     311Intra-element well-formedness validation is performed as an artifact of this process.
     312Like Xerces, \icXML{} must provide the Line and Column position of each error.
     313The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
     314optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
     315From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
     316</p>
     317<p id="idp426184">
     318A typical XML file contains few unique element and attribute names—but each of them will occur frequently.
     319\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
     320Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
     321the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
     322</p>
     323<p id="idp427352">
     324The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
     325The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
     326information provided by the Parallel Markup Parser, and the latter transforms the
     327filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
     328</p>
     329<p id="idp427944">
     330Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
     331The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
     332The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
     333to perform in bit space, such as ensuring every start tag has a matching end tag.
     334Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
     335it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
     336associated with each symbol occurrence.
     337This is discussed in Section \ref{section:arch:namespacehandling}.
     338Finally, the {\it Validation} layer implements the Xerces's validator.
     339However, preprocessing associated with each symbol greatly reduces the work of this stage.
     340</p>
     341<p id="idp428992">
     342
     343</p>
     344</div>
     345<div class="section" id="idp429312">
     346<h3 class="title" style="clear: both">Character Set Adapters</h3>
     347<p id="idp429936">
     348In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
     349provide the end-consumer with a single encoding format.
     350In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
     351because of the need to decode and classify each byte of input, mapping variable-length UTF-8
     352byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
     353In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
     354transcoding imposes at least a cost of buffer copying.
     355</p>
     356<p id="idp430720">
     357In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
     358Given a specified input encoding, a CSA is responsible for checking that
     359input code units represent valid characters, mapping the characters of the encoding into
     360the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item
     361streams), as well as supporting ultimate transcoding requirements.   All of this work
     362is performed using the parallel \bitstream{} representation of the source input.
     363</p>
     364<p id="idp431448">
     365An important observation is that many character sets are an
     366extension to the legacy 7-bit ASCII character set.  This includes the
     367various ISO Latin character sets, UTF-8, UTF-16 and many others.
     368Furthermore, all significant characters for parsing XML are confined to the
     369ASCII repertoire.   Thus, a single common set of lexical item calculations
     370serves to compute lexical item streams for all such ASCII-based character sets.
     371</p>
     372<p id="idp432080">
     373A second observation is that—regardless of which character set is used—quite
     374often all of the characters in a particular block of input will be within the ASCII range.
     375This is a very simple test to perform using the \bitstream{} representation, simply confirming that the
     376bit 0 stream is zero for the entire block.   For blocks satisfying this test,
     377all logic dealing with non-ASCII characters can simply be skipped.
     378Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the
     379UTF-16 form are each set to zero in this case.
     380</p>
     381<p id="idp433648">
     382A third observation is that repeated transcoding of the names of XML
     383elements, attributes and so on can be avoided by using a look-up mechanism.
     384That is, the first occurrence of each symbol is stored in a look-up
     385table mapping the input encoding to a numeric symbol ID.   Transcoding
     386of the symbol is applied at this time.  Subsequent look-up operations
     387can avoid transcoding by simply retrieving the stored representation.
     388As symbol look up is required to apply various XML validation rules,
     389there is achieves the effect of transcoding each occurrence without
     390additional cost.
     391</p>
     392<p id="idp434432">
     393The cost of individual character transcoding is avoided whenever a block of input is
     394confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
     395Furthermore, when transcoding is required, the parallel \bitstream{} representation
     396supports efficient transcoding operations.   
     397In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s
     398can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
     399and all but the final bytes of multi-byte sequences can be marked for deletion as
     400discussed in the following subsection.
     401In other cases, transcoding within a block only need be applied for non-ASCII
     402bytes, which are conveniently identified by iterating through the bit 0 stream
     403using bit scan operations.
     404</p>
     405</div>
     406<div class="section" id="idp435504">
     407<h3 class="title" style="clear: both">Combined Parallel Filtering</h3>
     408<p id="idp435824">
     409As just mentioned, UTF-8 to UTF-16 transcoding involves marking
     410all but the last bytes of multi-byte UTF-8 sequences as
     411positions for deletion.   For example, the two
     412Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
     413are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
     414and \verb'E5 A5 BD' while the UTF-16 representation must be
     415compressed down to the two code units \verb'4F60' and \verb'597D'.
     416In the bit parallel representation, this corresponds to a reduction
     417from six bit positions representing UTF-8 code units (bytes)
     418down to just two bit positions representing UTF-16 code units
     419(double bytes).   This compression may be achieved by
     420arranging to calculate the correct UTF-16 bits at the
     421final position of each sequence and creating a deletion
     422mask to mark the first two bytes of each 3-byte sequence
     423for deletion.   In this case, the portion of the mask
     424corresponding to these input bytes is the bit sequence
     425\verb'110110'.  Using this approach, transcoding may then be
     426completed by applying parallel deletion and inverse transposition of the
     427UTF-16 \bitstream{}s\cite{Cameron2008}.
     428</p>
     429<p id="idp438672">
     430
     431
     432
     433
     434
     435
     436
     437
     438
     439</p>
     440<p id="idp441152">
     441Rather than immediately paying the
     442costs of deletion and transposition just for transcoding,
     443however, \icXML{} defers these steps so that the deletion
     444masks for several stages of processing may be combined.
     445In particular, this includes core XML requirements
     446to normalize line breaks and to replace character
     447reference and entity references by their corresponding
     448text.   In the case of line break normalization,
     449all forms of line breaks, including bare carriage
     450returns (CR), line feeds (LF) and CR-LF combinations
     451must be normalized to a single LF character in
     452each case.   In \icXML{}, this is achieved by
     453first marking CR positions, performing two
     454bit parallel operations to transform the marked
     455CRs into LFs, and then marking for deletion any
     456LF that is found immediately after the marked CR
     457as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
     458
     459</p>
     460<p id="idp441472">
     461In essence, the deletion masks for transcoding and
     462for line break normalization each represent a bitwise
     463filter; these filters can be combined using bitwise-or
     464so that the parallel deletion algorithm need only be
     465applied once.
     466</p>
     467<p id="idp443472">
     468A further application of combined filtering
     469is the processing of XML character and entity
     470references.   Consider, for example, the references <code class="code">&amp;</code> or <code class="code">&lt;</code>.
     471which must be replaced in XML processing with 
     472the single <code class="code">&amp;</code> and <code class="code">&lt;</code> characters, respectively.
     473The approach in \icXML{} is to mark all but the first character
     474positions of each reference for deletion, leaving a
     475single character position unmodified.  Thus, for the
     476references <code class="code">&amp;</code> or <code class="code">&lt;</code> the
     477masks <code class="code">01111</code> and <code class="code">011111</code> are formed and
     478combined into the overall deletion mask.   After the
     479deletion and inverse transposition operations are finally
     480applied, a post-processing step inserts the proper character
     481at these positions.   One note about this process is
     482that it is speculative; references are assumed to generally
     483be replaced by a single UTF-16 code unit.   In the case,
     484that this is not true, it is addressed in post-processing.
     485</p>
     486<p id="idp446576">
     487The final step of combined filtering occurs during
     488the process of reducing markup data to tag bytes
     489preceding each significant XML transition as described
     490in section~\ref{section:arch:contentstream}.  Overall, \icXML{}
     491avoids separate buffer copying operations for each of the
     492these filtering steps, paying the cost of parallel
     493deletion and inverse transposition only once. 
     494Currently, \icXML{} employs the parallel-prefix compress algorithm
     495of Steele~\cite{HackersDelight}  Performance
     496is independent of the number of positions deleted.
     497Future versions of \icXML{} are expected to
     498take advantage of the parallel extract operation~\cite{HilewitzLee2006}
     499that Intel is now providing in its Haswell architecture.
     500</p>
     501</div>
     502<div class="section" id="idp249920">
     503<h3 class="title" style="clear: both">Content Stream</h3>
     504<p id="idp250240">
     505A relatively-unique concept for \icXML{} is the use of a filtered content stream.
     506Rather that parsing an XML document in its original format, the input is transformed
     507into one that is easier for the parser to iterate through and produce the sequential
     508output.
     509In , the source data
     510
     511is transformed into
     512
     513
     514through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
     515</p>
     516<p id="idp251840">
     517Combined with the symbol stream, the parser traverses the content stream to effectively
     518reconstructs the input document in its output form.
     519The initial {\tt\it 0} indicates an empty content string. The following \verb|&gt;|
     520indicates that a start tag without any attributes is the first element in this text and
     521the first unused symbol, <code class="code">document</code>, is the element name.
     522Succeeding that is the content string <code class="code">fee</code>, which is null-terminated in accordance
     523with the Xerces API specification. Unlike Xerces, no memory-copy operations
     524are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
     525accounts for 6.83% of Xerces's execution time.
     526Additionally, it is cheap to locate the terminal character of each string:
     527using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each
     528null character in the content stream in parallel, which in turn means the parser can
     529directly jump to the end of every string without scanning for it.
     530</p>
     531<p id="idp253376">
     532Following ``\verb`fee`'' is a \verb`=`, which marks the existence of an attribute.
     533Because all of the intra-element was performed in the \PS{}, this must be a legal attribute.
     534Since attributes can only occur within start tags and must be accompanied by a textual value,
     535the next symbol in the symbol stream must be the element name of a start tag,
     536and the following one must be the name of the attribute and the string that follows the \verb`=` must be its value.
     537However, the subsequent \verb`=` is not treated as an independent attribute because the parser has yet to
     538read a \verb`&gt;`, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
     539it (along with the string value) is added to the element.
     540Eventually the parser reaches a \verb`/`, which marks the existence of an end tag. Every end tag requires an
     541element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
     542ensure that the appropriate scope-nesting rules have been applied.
     543</p>
     544</div>
     545<div class="section" id="idp254672">
     546<h3 class="title" style="clear: both">Namespace Handling</h3>
     547<p id="idp255320">
     548In XML, namespaces prevents naming conflicts when multiple vocabularies are used together.
     549It is especially important when a vocabulary application-dependant meaning, such as when
     550XML or SVG documents are embedded within XHTML files.
     551Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify
     552specific names or resources.
     553On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML
     554processor to bind the prefix <code class="code">p</code> to the URI '<code class="code">pub.net</code>' and the default (empty)
     555prefix to <code class="code">book.org</code>. Thus to the XML processor, the \verb|title| on line 2 and
     556\verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price|
     557respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as
     558\verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name
     559\verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items
     560because the current vocabulary is determined by the namespace(s) that are in-scope.
     561</p>
     562<p id="idp257760">
     563
     564</p>
     565<p id="idp463944">
     566In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID.
     567These persist for the lifetime of the application through the use of a global URI pool.
     568Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
     569in the document. Because a namespace declaration affects the entire element, it must be processed prior to
     570grammar validation. This is a costly process considering that a typical namespaced XML document only comes
     571in one of two forms:
     572(1) those that declare a set of namespaces upfront and never change them, and
     573(2) those that repeatedly modify the namespaces in predictable patterns.
     574</p>
     575<p id="idp464136">
     576For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform
     577
     578
     579When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps
     580the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
     581Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
     582NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
     583prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively.
     584To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
     585maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
     586NSID can be found using a bit-scan intrinsic.
     587A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
     588</p>
     589<p id="idp466872">
     590
     591</p>
     592<p id="idp467128">
     593
     594
     595
     596
     597</p>
     598<p id="idp468584">
     599To ensure that scoping rules are adhered to,
     600whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored
     601within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
     602currently visible namespaces is XORed with the vector at the top of the stack.
     603This allows any number of changes to be performed at each scope-level with a constant time.
     604
     605</p>
     606</div>
     607<div class="section" id="idp469568">
     608<h3 class="title" style="clear: both">Error Handling</h3>
     609<p id="idp469912">
     610
     611Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
     612As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
     613if not, the current state determines the type and severity of the error.
     614\icXML{} emits errors in the similar manner—but how it discovers them is substantially different.
     615Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{},
     616each with its own system for detecting and producing error messages.
     617</p>
     618<p id="idp471056">
     619Within the \PS{}, all computations are performed in parallel, a block at a time.
     620Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block,
     621and the type of error is determined by the equation that discovered it.
     622The difficulty of error processing in this section is that in Xerces the line and column number must be given
     623with every error production. Two major issues exist because of this:
     624(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
     625followed by a line feed, are counted as a single new line character.
     626(2) column position is counted in characters, not bytes or code units;
     627thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
     628Note that typical XML documents are error-free but the calculation of the
     629line/column position is a constant overhead in Xerces.
     630To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and
     631performs the minimal amount of book-keeping necessary to facilitate it.
     632\icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
     633within the Line Column Tracker (LCT).
     634One of the CSA's major responsibilities is transcoding an input text.
     635During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
     636and validated.
     637A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of
     638this process.
     639Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
     640
     641Column position is more difficult to calculate.
     642It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units)
     643between the position between which an error was detected and the last line feed. However, this distance may exceed
     644than the actual character position for the reasons discussed in (2).
     645To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s,
     646such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
     647normalization process.
     648When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
     649column number.
     650</p>
     651<p id="idp474368">
     652The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
     653However, reporting the correct line/column is a much more difficult problem.
     654The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
     655Each string is normalized in accordance with the XML specification.
     656All symbol data and unnecessary whitespace is eliminated from the stream;
     657thus its impossible to derive the current location using only the content stream.
     658To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
     659the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every
     660datum that was suppressed from the source during the production of the content stream.
     661Armed with these, it is possible to calculate the actual line/column using
     662the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
     663</p>
     664</div>
     665</div>
     666<div class="section" id="idp476680">
     667<h2 class="title" style="clear: both">Multithreading with Pipeline Parallelism</h2>
     668<p id="idp477000">
     669As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
     670These are ``embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
     671However, \icXML{} is designed to organize processing into logical layers.   
     672In particular, layers within the \PS{} are designed to operate
     673over significant segments of input data before passing their outputs on for
     674subsequent processing.  This fits well into the general model of pipeline
     675parallelism, in which each thread is in charge of a single module or group
     676of modules.
     677</p>
     678<p id="idp477784">
     679The most straightforward division of work in \icXML{} is to separate
     680the \PS{} and the \MP{} into distinct logical layers into two separate stages.
     681The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application.
     682In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
     683content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
     684The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation,
     685and the provides parsed XML data to the application through the Xerces API. 
     686The shared data structure is implemented using a ring buffer,
     687where every entry contains an independent set of data streams.
     688In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
     689A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
     690In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
     691thus $T_2$ always waits for $T_1$ to write to the shared memory.
     692Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
     693and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
     694</p>
     695<p id="idp479272">
     696
     697</p>
     698<p id="idp479528">
     699Overall, our design is intended to benefit a range of applications.
     700Conceptually, we consider two design points.
     701The first, the parsing performed by the \PS{} dominates at 67% of the overall cost,
     702with the cost of application processing (including the driver logic within the \MP{}) at 33%.   
     703The second is almost the opposite scenario, the cost of application processing dominates at 60%,
     704while the cost of XML parsing represents an overhead of 40%.
     705</p>
     706<p id="idp479720">
     707Our design is predicated on a goal of using the Parabix
     708framework to achieve a 50% to 100% improvement in the parsing engine itself.   
     709In a best case scenario,
     710a 100% improvement of the \PS{} for the design point in which
     711XML parsing dominates at 67% of the total application cost.
     712In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces
     713so that the total application cost reduces to 67% of the original. 
     714However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads
     715each performing about 33% of the original work.   
     716In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.
     717</p>
     718<p id="idp481424">
     719At the other extreme of our design range, we consider an application
     720in which core parsing cost is 40%.   Assuming the 2x speedup of
     721the \PS{} over the corresponding Xerces core, single-threaded
     722\icXML{} delivers a 25% speedup.   However, the most significant
     723aspect of our two-stage multi-threaded design then becomes the
     724ability to hide the entire latency of parsing within the serial time
     725required by the application.   In this case, we achieve
     726an overall speedup in processing time by 1.67x.
     727</p>
     728<p id="idp482624">
     729Although the structure of the \PS{} allows division of the work into
     730several pipeline stages and has been demonstrated to be effective
     731for four pipeline stages in a research prototype \cite{HPCA2012},
     732our analysis here suggests that the further pipelining of work within
     733the \PS{} is not worthwhile if the cost of application logic is little as
     73433% of the end-to-end cost using Xerces.  To achieve benefits of
     735further parallelization with multi-core technology, there would
     736need to be reductions in the cost of application logic that
     737could match reductions in core parsing cost.
     738</p>
     739</div>
     740<div class="section" id="idp483472">
     741<h2 class="title" style="clear: both">Performance</h2>
     742<p id="idp483808">
     743We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications:
     744the Xerces C++ SAXCount sample application,
     745and a real world GML to SVG transformation application.
     746We investigated XML parser performance using an Intel Core i7 quad-core
     747(Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core),
     74832+32 kB (per core) L1 cache,
     749256 kB (per core) L2 cache,
     7508 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
     751</p>
     752<p id="idp484472">
     753We analyzed the execution profiles of each XML parser
     754using the performance counters found in the processor.
     755We chose several key hardware events that provide insight into the profile of each
     756application and indicate if the processor is doing useful work. 
     757The set of events included in our study are:
     758processor cycles, branch instructions, branch mispredictions,
     759and cache misses. The Performance Application Programming Interface
     760(PAPI) Version 5.5.0 \cite{papi} toolkit
     761was installed on the test system to facilitate the
     762collection of hardware performance monitoring
     763statistics. In addition, we used the Linux perf \cite{perf} utility
     764to collect per core hardware events.
     765</p>
     766<div class="section" id="idp485352">
     767<h3 class="title" style="clear: both">Xerces C++ SAXCount</h3>
     768<p id="idp485696">
     769Xerces comes with sample applications that demonstrate salient features of the parser.
     770SAXCount is the simplest such application:
     771it counts the elements, attributes and characters of a given XML file using the (event based) SAX API
     772and prints out the totals.
     773</p>
     774<p id="idp486160">
     775
     776</p>
     777<p id="idp486432">
     778Table \ref{XMLDocChars} shows the document characteristics of the XML input
     779files selected for the Xerces C++ SAXCount benchmark. The jaw.xml
     780represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence
     781required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented
     782XML documents and consist entirely of single byte encoded ASCII characters.
     783</p>
     784<p id="idp486624">
     785A key predictor of the overall parsing performance of an XML file is markup density\footnote{
     786  Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}.
     787This metric has substantial influence on the performance of traditional recursive descent XML parsers
     788because it directly corresponds to the number of state transitions that occur when parsing a document.
     789We use a mixture of document-oriented and
     790data-oriented XML files to analyze performance over a spectrum
     791of markup densities.
     792</p>
     793<p id="idp488376">
     794Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of
     795CPU cycles per byte for the SAXCount application.
     796The speedup for \icXML{} over Xerces is 1.3x to 1.8x.
     797With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x.
     798Xerces is substantially slowed by dense markup
     799but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques.
     800\icXMLp{} performs better as markup-density increases because the work performed by each stage is
     801well balanced in this application.
     802</p>
     803<p id="idp489744">
     804
     805</p>
     806</div>
     807<div class="section" id="idp490336">
     808<h3 class="title" style="clear: both">GML2SVG</h3>
     809<p id="idp490656"></p>
     810</div>
     811</div>
     812<div class="section" id="idp490912">
     813<h2 class="title" style="clear: both">Conclusion and Future Work</h2>
     814<p id="idp491264">
     815This paper is the first case study documenting the significant
     816performance benefits that may be realized through the integration
     817of parallel \bitstream{} technology into existing widely-used software libraries.
     818In the case of the Xerces-C++ XML parser, the
     819combined integration of SIMD and multicore parallelism was
     820shown capable of dramatic producing dramatic increases in
     821throughput and reductions in branch mispredictions and cache misses.
     822The modified parser, going under the name \icXML{} is designed
     823to provide the full functionality of the original Xerces library
     824with complete compatibility of APIs.  Although substantial
     825re-engineering was required to realize the
     826performance potential of parallel technologies, this
     827is an important case study demonstrating the general
     828feasibility of these techniques.
     829</p>
     830<p id="idp492280">
     831The further development of \icXML{} to move beyond 2-stage
     832pipeline parallelism is ongoing, with realistic prospects for
     833four reasonably balanced stages within the library.  For
     834applications such as GML2SVG which are dominated by time
     835spent on XML parsing, such a multistage pipelined parsing
     836library should offer substantial benefits. 
     837</p>
     838<p id="idp492824">
     839The example of XML parsing may be considered prototypical
     840of finite-state machines applications which have sometimes
     841been considered ``embarassingly sequential'' and so
     842difficult to parallelize that ``nothing works.''  So the
     843case study presented here should be considered an important
     844data point in making the case that parallelization can
     845indeed be helpful across a broad array of application types.
     846</p>
     847<p id="idp493424">
     848To overcome the software engineering challenges in applying
     849parallel \bitstream{} technology to existing software systems,
     850it is clear that better library and tool support is needed.
     851The techniques used in the implementation of \icXML{} and
     852documented in this paper could well be generalized for
     853applications in other contexts and automated through
     854the creation of compiler technology specifically supporting
     855parallel \bitstream{} programming.
     856</p>
     857</div>
     858<div class="bibliography" id="idp494392">
    1460859<h2 class="title" style="clear:both">Bibliography</h2>
    1461 <p class="bibliomixed" id="XMLChip09"><a href="#idp500496">[Leventhal and Lemoine 2009] </a>Leventhal, Michael and
     860<p class="bibliomixed" id="XMLChip09">[Leventhal and Lemoine 2009] Leventhal, Michael and
    1462861         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
    1463862         Processing XML Efficiently 2009, Montréal.</p>
    1464 <p class="bibliomixed" id="Datapower09"><a href="#idp506976">[Salz, Achilles and Maze 2009] </a>Salz, Richard,
     863<p class="bibliomixed" id="Datapower09">[Salz, Achilles and Maze 2009] Salz, Richard,
    1465864         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
    1466865         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
    1467866         Efficiently 2009, Montréal.</p>
    1468 <p class="bibliomixed" id="PPoPP08"><a href="#idp507840">[Cameron 2007] </a>Cameron, Robert D. 2007. A Case Study
     867<p class="bibliomixed" id="PPoPP08">[Cameron 2007] Cameron, Robert D. 2007. A Case Study
    1469868         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
    1470869         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
    1471870         Lake City, Utah. On the Web at <a href="http://research.ihost.com/ppopp08/" class="link" target="_new">http://research.ihost.com/ppopp08/</a>.</p>
    1472 <p class="bibliomixed" id="CASCON08"><a href="#idp508744">[Cameron, Herdy and Lin 2008] </a>Cameron, Robert D.,
     871<p class="bibliomixed" id="CASCON08">[Cameron, Herdy and Lin 2008] Cameron, Robert D.,
    1473872         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
    1474873         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
    1475874         Practice of Parallel Programming 2008, Toronto.</p>
    1476 <p class="bibliomixed" id="SVGOpen08"><a href="#idp509256">[Herdy, Burggraf and Cameron 2008] </a>Herdy, Kenneth
     875<p class="bibliomixed" id="SVGOpen08">[Herdy, Burggraf and Cameron 2008] Herdy, Kenneth
    1477876         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
    1478877         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
     
    1480879         Nuremburg. On the Web at
    1481880            <a href="http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/" class="link" target="_new">http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/</a>.</p>
    1482 <p class="bibliomixed" id="Ross06"><a href="#idp826512">[Ross 2006] </a>Ross, Kenneth A. 2006. Efficient hash
     881<p class="bibliomixed" id="Ross06">[Ross 2006] Ross, Kenneth A. 2006. Efficient hash
    1483882         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
    1484883            <a href="www.cs.columbia.edu/~kar/pubsk/icde2007.pdf" class="link" target="_new">www.cs.columbia.edu/~kar/pubsk/icde2007.pdf</a>.</p>
    1485 <p class="bibliomixed" id="ASPLOS09"><a href="#idp638256">[Cameron and Lin 2009] </a>Cameron, Robert D. and Dan
     884<p class="bibliomixed" id="ASPLOS09">[Cameron and Lin 2009] Cameron, Robert D. and Dan
    1486885         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
    1487886         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</p>
    1488 <p class="bibliomixed" id="Wu08"><a href="#idp639856">[Wu et al. 2008] </a>Wu, Yu, Qi Zhang, Zhiqiang Yu and
     887<p class="bibliomixed" id="Wu08">[Wu et al. 2008] Wu, Yu, Qi Zhang, Zhiqiang Yu and
    1489888         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
    1490889         Proceedings of Balisage 2008, Montréal. On the Web at
    1491890            <a href="http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html" class="link" target="_new">http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html</a>.</p>
    1492 <p class="bibliomixed" id="u8u16"><a href="#idp699152">[Cameron 2008] </a>u8u16 - A High-Speed UTF-8 to UTF-16
     891<p class="bibliomixed" id="u8u16">[Cameron 2008] u8u16 - A High-Speed UTF-8 to UTF-16
    1493892         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
    1494893         Science Simon Fraser University, June 21 2007.</p>
    1495 <p class="bibliomixed" id="XML10"><a href="#idp756320">[XML 1.0] </a>Extensible Markup Language (XML) 1.0 (Fifth
     894<p class="bibliomixed" id="XML10">[XML 1.0] Extensible Markup Language (XML) 1.0 (Fifth
    1496895         Edition) W3C Recommendation 26 November 2008. On the Web at
    1497896            <a href="http://www.w3.org/TR/REC-xml/" class="link" target="_new">http://www.w3.org/TR/REC-xml/</a>.</p>
    1498 <p class="bibliomixed" id="Unicode"><a href="#idp783160">[Unicode] </a>The Unicode Consortium. 2009. On the Web at
     897<p class="bibliomixed" id="Unicode">[Unicode] The Unicode Consortium. 2009. On the Web at
    1499898            <a href="http://unicode.org/" class="link" target="_new">http://unicode.org/</a>.</p>
    1500 <p class="bibliomixed" id="Pex06"><a href="#idp638640">[Hilewitz and Lee 2006] </a> Hilewitz, Y. and Ruby B. Lee.
     899<p class="bibliomixed" id="Pex06">[Hilewitz and Lee 2006] Hilewitz, Y. and Ruby B. Lee.
    1501900         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
    1502901         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
    1503902         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</p>
    1504 <p class="bibliomixed" id="InfoSet"><a href="#idp839600">[XML Infoset] </a>XML Information Set (Second Edition) W3C
     903<p class="bibliomixed" id="InfoSet">[XML Infoset] XML Information Set (Second Edition) W3C
    1505904         Recommendation 4 February 2004. On the Web at
    1506905         <a href="http://www.w3.org/TR/xml-infoset/" class="link" target="_new">http://www.w3.org/TR/xml-infoset/</a>.</p>
    1507 <p class="bibliomixed" id="Saxon"><a href="#idp846112">[Saxon] </a>SAXON The XSLT and XQuery Processor. On the Web
     906<p class="bibliomixed" id="Saxon">[Saxon] SAXON The XSLT and XQuery Processor. On the Web
    1508907         at <a href="http://saxon.sourceforge.net/" class="link" target="_new">http://saxon.sourceforge.net/</a>.</p>
    1509 <p class="bibliomixed" id="Kay08"><a href="#idp847040">[Kay 2008] </a> Kay, Michael Y. 2008. Ten Reasons Why Saxon
     908<p class="bibliomixed" id="Kay08">[Kay 2008] Kay, Michael Y. 2008. Ten Reasons Why Saxon
    1510909         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</p>
    1511 <p class="bibliomixed" id="AElfred"><a href="#idp847912">[Ælfred] </a> The Ælfred XML Parser. On the Web at
     910<p class="bibliomixed" id="AElfred">[Ælfred] The Ælfred XML Parser. On the Web at
    1512911            <a href="http://saxon.sourceforge.net/aelfred.html" class="link" target="_new">http://saxon.sourceforge.net/aelfred.html</a>.</p>
    1513 <p class="bibliomixed" id="JNI"><a href="#idp851944">[Hitchens 2002] </a>Hitchens, Ron. Java NIO. O'Reilly, 2002.</p>
    1514 <p class="bibliomixed" id="Expat"><a href="#idp834992">[Expat] </a>The Expat XML Parser.
     912<p class="bibliomixed" id="JNI">[Hitchens 2002] Hitchens, Ron. Java NIO. O'Reilly, 2002.</p>
     913<p class="bibliomixed" id="Expat">[Expat] The Expat XML Parser.
    1515914            <a href="http://expat.sourceforge.net/" class="link" target="_new">http://expat.sourceforge.net/</a>.</p>
    1516915</div>
  • docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

    r3038 r3039  
    44<article xmlns="http://docbook.org/ns/docbook" version="5.0-subset Balisage-1.3"
    55  xml:id="HR-23632987-8973">
    6    <title>Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</title>
     6   <title></title>
    77   <info>
    88<!--
     
    1414-->
    1515      <abstract>
    16          <para>By first transforming the octets (bytes) of XML texts into eight parallel bit
    17             streams, the SIMD features of commodity processors can be exploited for parallel
    18             processing of blocks of 128 input bytes at a time. Established transcoding and parsing
    19             techniques are reviewed followed by new techniques including parsing with bitstream
    20             addition. Further opportunities are discussed in light of expected advances in CPU
    21             architecture and compiler technology. Implications for various APIs and information
    22             models are presented as well opportunities for collaborative open-source
    23          development.</para>
     16         <para>Prior research on the acceleration of XML processing
     17using SIMD and multi-core parallelism has lead to
     18a number of interesting research prototypes.  This work
     19investigates the extent to which the techniques underlying
     20these prototypes could result in systematic performance
     21benefits when fully integrated into a commercial XML parser.
     22The widely used Xerces-C++ parser of the Apache Software
     23Foundation was chosen as the foundation for the study.
     24A systematic restructuring of the parser was undertaken,
     25while maintaining the existing API for application programmers.
     26Using SIMD techniques alone, an increase in parsing speed
     27of at least 50% was observed in a range of applications.
     28When coupled with pipeline parallelism on dual core processors,
     29improvements of 2x and beyond were realized.
     30</para>
    2431      </abstract>
     32      <author>
     33         <personname>
     34            <firstname>Nigel</firstname>
     35            <surname>Medforth</surname>
     36         </personname>
     37         <personblurb>
     38            <para></para>
     39         </personblurb>
     40         <affiliation>
     41            <jobtitle></jobtitle>
     42            <orgname></orgname>
     43         </affiliation>
     44         <email></email>
     45      </author>
     46      <author>
     47         <personname>
     48            <firstname>Dan</firstname>
     49            <surname>Lin</surname>
     50         </personname>
     51         <personblurb>
     52            <para></para>
     53         </personblurb>
     54         <affiliation>
     55            <jobtitle></jobtitle>
     56            <orgname></orgname>
     57         </affiliation>
     58         <email></email>
     59      </author>
     60      <author>
     61         <personname>
     62            <firstname>Kenneth</firstname>
     63            <surname>Herdy</surname>
     64         </personname>
     65         <personblurb>
     66            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
     67               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
     68               of Science in Computing Science with a Certificate in Spatial Information Systems at
     69               Simon Fraser University in 2005.
     70                                                </para>
     71            <para> Ken is currently pursuing graduate studies in Computing Science at Simon Fraser
     72               University with industrial scholarship support from the Natural Sciences and
     73               Engineering Research Council of Canada, the Mathematics of Information Technology and
     74               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
     75               of the principal techniques that may be used to improve XML processing performance in
     76               the context of the Geography Markup Language (GML).
     77                                                </para>
     78         </personblurb>
     79         <affiliation>
     80            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
     81            <orgname>Simon Fraser University </orgname>
     82         </affiliation>
     83         <email>ksherdy@sfu.ca</email>
     84      </author>
    2585      <author>
    2686         <personname>
     
    3696               patentleft evangelist, advocating university-based technology transfer models
    3797               dedicated to free use in open source. </para>
    38 
    3998         </personblurb>
    4099         <affiliation>
     
    46105      <author>
    47106         <personname>
    48             <firstname>Ken</firstname>
    49             <surname>Herdy</surname>
     107            <firstname>Arrvindh</firstname>
     108            <surname>Shriraman</surname>
    50109         </personname>
    51110         <personblurb>
    52             <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
    53                Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
    54                of Science in Computing Science with a Certificate in Spatial Information Systems at
    55                Simon Fraser University in 2005. </para>
    56             <para> Ken is currently pursuing graduate studies in Computing Science at Simon Fraser
    57                University with industrial scholarship support from the Natural Sciences and
    58                Engineering Research Council of Canada, the Mathematics of Information Technology and
    59                Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
    60                of the principal techniques that may be used to improve XML processing performance in
    61                the context of the Geography Markup Language (GML). </para>
    62 
     111            <para></para>
    63112         </personblurb>
    64113         <affiliation>
    65             <jobtitle>Graduate Student, School of Computing Science</jobtitle>
    66             <orgname>Simon Fraser University </orgname>
     114            <jobtitle></jobtitle>
     115            <orgname></orgname>
    67116         </affiliation>
    68          <email>ksherdy@cs.sfu.ca</email>
    69       </author>
    70       <author>
    71          <personname>
    72             <firstname>Ehsan</firstname>
    73             <surname>Amiri</surname>
    74          </personname>
    75          <personblurb>
    76             <para>Ehsan Amiri is a PhD student of Computer Science at Simon Fraser University.
    77                Before that he studied at Sharif University of Technology, Tehran, Iran. While his
    78                graduate research has been focused on theoretical problems like fingerprinting, Ehsan
    79                has worked on some software projects like development of a multi-node firewall as
    80                well. More recently he has been developing compiler technology for automatic
    81                generation of bit stream processing code. </para>
    82 
    83          </personblurb>
    84          <affiliation>
    85             <jobtitle>Graduate Student, School of Computing Science</jobtitle>
    86             <orgname>Simon Fraser University</orgname>
    87          </affiliation>
    88          <email>eamiri@cs.sfu.ca</email>
     117         <email></email>
    89118      </author>
    90119<!--
     
    97126      <keywordset role="author">
    98127         <keyword/>
    99          <keyword/>
    100          <keyword/>
    101128      </keywordset>
     129
    102130   </info>
    103131   <section>
    104132      <title>Introduction</title>
    105       <para> While particular XML applications may benefit from special-purpose hardware such as XML
    106          chips [<xref linkend="XMLChip09"/>] or appliances [<xref linkend="Datapower09"/>], the bulk
    107          of the world's XML processing workload will continue to be handled by XML software stacks
    108          on commodity processors. Exploiting the SIMD capabilities of such processors such as the
    109          SSE instructions of x86 chips, parallel bit stream technology offers the potential of
    110          dramatic improvement over byte-at-a-time processing for a variety of XML processing tasks.
    111          Character set issues such as Unicode validation and transcoding [<xref linkend="PPoPP08"
    112          />], normalization of line breaks and white space and XML character validation can be
    113          handled fully in parallel using this representation. Lexical item streams, such as the bit
    114          stream marking the positions of opening angle brackets, can also be formed in parallel.
    115          Bit-scan instructions of commodity processors may then be used on lexical item streams to
    116          implement rapid single-instruction scanning across variable-length multi-byte text blocks
    117          as in the Parabix XML parser [<xref linkend="CASCON08"/>]. Overall, these techniques may be
    118          combined to yield end-to-end performance that may be 1.5X to 15X faster than alternatives
    119             [<xref linkend="SVGOpen08"/>].</para>
    120       <para>Continued research in parallel bit stream techniques as well as more conventional
    121          application of SIMD techniques in XML processing offers further prospects for improvement
    122          of core XML components as well as for tackling performance-critical tasks further up the
    123          stack. A newly prototyped technique for parallel tag parsing using bitstream addition is
    124          expected to improve parsing performance even beyond that achieved using sequential bit
    125          scans. Several techniques for improved symbol table performance are being investigated,
    126          including parallel hash value calculation and length-based sorting using the cheap length
    127          determination afforded by bit scans. To deliver the benefits of parallel bit stream
    128          technology to the Java world, we are developing Array Set Model (ASM) representations of
    129          XML Infoset and other XML information models for efficient transmission across the JNI
    130          boundary.</para>
    131 
    132       <para>Amplifying these software advances, continuing hardware advances in commodity processors
    133          increase the relative advantage of parallel bit stream techniques over traditional
    134          byte-at-a-time processors. For example, the Intel Core architecture improved SSE processing
    135          to give superscalar execution of bitwise logic operations (3 instructions per cycle vs. 1
    136          in Pentium 4). Upcoming 256-bit AVX technology extends the register set and replaces
    137          destructive two-operand instructions with a nondestructive three-operand form. General
    138          purpose programming on graphic processing units (GPGPU) such as the upcoming 512-bit
    139          Larrabee processor may also be useful for XML applications using parallel bit streams. New
    140          instruction set architectures may also offer dramatic improvements in core algorithms.
    141          Using the relatively simple extensions to support the principle of inductive doubling, a 3X
    142          improvement in several core parallel bit stream algorithms may be achieved [<xref
    143             linkend="ASPLOS09"/>]. Other possibilities include direct implementation of parallel
    144          extract and parallel deposit (pex/pdep) instructions [<xref linkend="Pex06"/>], and
    145          bit-level interleave operations as in Larrabee, each of which would have important
    146          application to parallel bit stream processing.</para>
    147 
    148       <para>Further prospects for XML performance improvement arise from leveraging the
    149          intraregister parallelism of parallel bit stream technology to exploit the interchip
    150          parallelism of multicore computing. Parallel bit stream techniques can support multicore
    151          parallelism in both data partitioning and task partitioning models. For example, the
    152          datasection partitioning approach of Wu, Zhang, Yu and Li may be used to partition blocks
    153          for speculative parallel parsing on separate cores followed by a postprocessing step to
    154          join partial S-trees [<xref linkend="Wu08"/>].</para>
    155 
    156       <para>In our view, the established and expected performance advantages of parallel bit stream
    157          technology over traditional byte-at-a-time processing are so compelling that parallel bit
    158          stream technology should ultimately form the foundation of every high-performance XML
    159          software stack. We envision a common high-performance XML kernel that may be customized to
    160          a variety of processor architectures and that supports a wide range of existing and new XML
    161          APIs. Widespread deployment of this technology should greatly benefit the XML community in
    162          addressing both the deserved and undeserved criticism of XML on performance grounds. A
    163          further benefit of improved performance is a substantial greening of XML technologies.</para>
    164 
    165       <para>To complement our research program investigating fundamental algorithms and issues in
    166          high-performance XML processing, our work also involves development of open source software
    167          implementing these algorithms, with a goal of full conformance to relevant specifications.
    168          From the research perspective, this approach is valuable in ensuring that the full
    169          complexity of required XML processing is addressed in reporting and assessing processing
    170          results. However, our goal is also to use this open source software as a basis of
    171          technology transfer. A Simon Fraser University spin-off company, called International
    172          Characters, Inc., has been created to commercialize the results of this work using a
    173          patent-based open source model.</para>
    174 
    175       <para>To date, we have not yet been successful in establishing a broader community of
    176          participation with our open source code base. Within open-source communities, there is
    177          often a general antipathy towards software patents; this may limit engagement with our
    178          technology, even though it has been dedicated for free use in open source. </para>
    179 
    180       <para>A further complication is the inherent difficulty of SIMD programming in general, and
    181          parallel bit stream programming in particular. Considerable work is required with each new
    182          algorithmic technique being investigated as well as in retargetting our techniques for each
    183          new development in SIMD and multicore processor technologies. To address these concerns, we
    184          have increasingly shifted the emphasis of our research program towards compiler technology
    185          capable of generating parallel bit stream code from higher-level specifications.</para>
     133      <para></para>
     134      <para></para>
     135      <para></para>
     136      <para></para>
    186137   </section>
    187138
    188139   <section>
    189       <title>A Catalog of Parallel Bit Streams for XML</title>
     140      <title>Background</title>
    190141      <section>
    191          <title>Introduction</title>
    192          <para>In this section, we introduce the fundamental concepts of parallel bit stream
    193             technology and present a comprehensive catalog of parallel bit streams for use in XML
    194             processing. In presenting this catalog, the focus is on the specification of the bit
    195             streams as data streams in one-to-one correspondence with the character code units of an
    196             input XML stream. The goal is to define these bit streams in the abstract without
    197             initially considering memory layouts, register widths or other issues related to
    198             particular target architectures. In cataloging these techniques, we also hope to convey
    199             a sense of the breadth of applications of parallel bit stream technology to XML
    200             processing tasks. </para>
    201       </section>
    202 
    203       <section>
    204          <title>Basis Bit Streams</title>
    205          <para>Given a byte-oriented text stream represented in UTF-8, for example, we define a
    206             transform representation of this text consisting of a set of eight parallel bit streams
    207             for the individual bits of each byte. Thus, the <code>Bit0</code> stream is the stream
    208             of bits consisting of bit 0 of each byte in the input byte stream, <code>Bit1</code> is
    209             the bit stream consisting of bit 1 of each byte in the input stream and so on. The set
    210             of streams <code>Bit0</code> through <code>Bit7</code> are known as the <emphasis>basis
    211                streams</emphasis> of the parallel bit stream representation. The following table
    212             shows an example XML character stream together with its representation as a set of 8
    213             basis streams. <table>
    214                <caption>
    215                   <para>XML Character Stream Transposition.</para>
    216                </caption>
    217                <colgroup>
    218                   <col align="left" valign="top"/>
    219                   <col align="left" valign="top"/>
    220                   <col align="left" valign="top"/>
    221                   <col align="left" valign="top"/>
    222                   <col align="left" valign="top"/>
    223                   <col align="left" valign="top"/>
    224                </colgroup>
    225                <tbody>
    226                   <tr valign="top">
    227                      <td>Input Data</td>
    228                      <td>
    229                         <code>&lt;</code>
    230                      </td>
    231                      <td>
    232                         <code>t</code>
    233                      </td>
    234                      <td>
    235                         <code>a</code>
    236                      </td>
    237                      <td>
    238                         <code>g</code>
    239                      </td>
    240                      <td>
    241                         <code>/</code>
    242                      </td>
    243                      <td>
    244                         <code>&gt;</code>
    245                      </td>
    246                   </tr>
    247                   <tr valign="top">
    248                      <td>ASCII</td>
    249                      <td>
    250                         <code>00111100</code>
    251                      </td>
    252                      <td>
    253                         <code>01110100</code>
    254                      </td>
    255                      <td>
    256                         <code>01100001</code>
    257                      </td>
    258                      <td>
    259                         <code>01100111</code>
    260                      </td>
    261                      <td>
    262                         <code>00101111</code>
    263                      </td>
    264                      <td>
    265                         <code>00111110</code>
    266                      </td>
    267                   </tr>
    268                   <tr valign="top">
    269                      <td>Bit0</td>
    270                      <td>
    271                         <code>0</code>
    272                      </td>
    273                      <td>
    274                         <code>0</code>
    275                      </td>
    276                      <td>
    277                         <code>0</code>
    278                      </td>
    279                      <td>
    280                         <code>0</code>
    281                      </td>
    282                      <td>
    283                         <code>0</code>
    284                      </td>
    285                      <td>
    286                         <code>0</code>
    287                      </td>
    288                   </tr>
    289                   <tr valign="top">
    290                      <td>Bit1</td>
    291                      <td>
    292                         <code>0</code>
    293                      </td>
    294                      <td>
    295                         <code>1</code>
    296                      </td>
    297                      <td>
    298                         <code>1</code>
    299                      </td>
    300                      <td>
    301                         <code>1</code>
    302                      </td>
    303                      <td>
    304                         <code>0</code>
    305                      </td>
    306                      <td>
    307                         <code>0</code>
    308                      </td>
    309                   </tr>
    310                   <tr valign="top">
    311                      <td>Bit2</td>
    312                      <td>
    313                         <code>1</code>
    314                      </td>
    315                      <td>
    316                         <code>1</code>
    317                      </td>
    318                      <td>
    319                         <code>1</code>
    320                      </td>
    321                      <td>
    322                         <code>1</code>
    323                      </td>
    324                      <td>
    325                         <code>1</code>
    326                      </td>
    327                      <td>
    328                         <code>1</code>
    329                      </td>
    330                   </tr>
    331                   <tr valign="top">
    332                      <td>Bit3</td>
    333                      <td>
    334                         <code>1</code>
    335                      </td>
    336                      <td>
    337                         <code>1</code>
    338                      </td>
    339                      <td>
    340                         <code>0</code>
    341                      </td>
    342                      <td>
    343                         <code>0</code>
    344                      </td>
    345                      <td>
    346                         <code>0</code>
    347                      </td>
    348                      <td>
    349                         <code>1</code>
    350                      </td>
    351                   </tr>
    352                   <tr valign="top">
    353                      <td>Bit4</td>
    354                      <td>
    355                         <code>1</code>
    356                      </td>
    357                      <td>
    358                         <code>0</code>
    359                      </td>
    360                      <td>
    361                         <code>0</code>
    362                      </td>
    363                      <td>
    364                         <code>0</code>
    365                      </td>
    366                      <td>
    367                         <code>1</code>
    368                      </td>
    369                      <td>
    370                         <code>1</code>
    371                      </td>
    372                   </tr>
    373                   <tr valign="top">
    374                      <td>Bit5</td>
    375                      <td>
    376                         <code>1</code>
    377                      </td>
    378                      <td>
    379                         <code>1</code>
    380                      </td>
    381                      <td>
    382                         <code>0</code>
    383                      </td>
    384                      <td>
    385                         <code>1</code>
    386                      </td>
    387                      <td>
    388                         <code>1</code>
    389                      </td>
    390                      <td>
    391                         <code>1</code>
    392                      </td>
    393                   </tr>
    394                   <tr valign="top">
    395                      <td>Bit6</td>
    396                      <td>
    397                         <code>0</code>
    398                      </td>
    399                      <td>
    400                         <code>0</code>
    401                      </td>
    402                      <td>
    403                         <code>0</code>
    404                      </td>
    405                      <td>
    406                         <code>1</code>
    407                      </td>
    408                      <td>
    409                         <code>1</code>
    410                      </td>
    411                      <td>
    412                         <code>1</code>
    413                      </td>
    414                   </tr>
    415                   <tr valign="top">
    416                      <td>Bit7</td>
    417                      <td>
    418                         <code>0</code>
    419                      </td>
    420                      <td>
    421                         <code>0</code>
    422                      </td>
    423                      <td>
    424                         <code>1</code>
    425                      </td>
    426                      <td>
    427                         <code>1</code>
    428                      </td>
    429                      <td>
    430                         <code>1</code>
    431                      </td>
    432                      <td>
    433                         <code>0</code>
    434                      </td>
    435                   </tr>
    436                </tbody>
    437             </table>
    438          </para>
    439          <para> Depending on the features of a particular processor architecture, there are a number
    440             of algorithms for transposition to parallel bit stream form. Several of these algorithms
    441             employ a three-stage structure. In the first stage, the input byte stream is divided
    442             into a pair of half-length streams consisting of four bits for each byte, for example,
    443             one stream for the high nybble of each byte and another for the low nybble of each byte.
    444             In the second stage, these streams of four bits per byte are each divided into streams
    445             consisting of two bits per original byte, for example streams for the
    446             <code>Bit0/Bit1</code>, <code>Bit2/Bit3</code>, <code>Bit4/Bit5</code>, and
    447                <code>Bit6/Bit7</code> pairs. In the final stage, the streams are further subdivided
    448             in the individual bit streams. </para>
    449          <para> Using SIMD capabilities, this process is quite efficient, with an amortized cost of
    450             1.1 CPU cycles per input byte on Intel Core 2 with SSE, or 0.6 CPU cycles per input byte
    451             on Power PC G4 with Altivec. With future advances in processor technology, this
    452             transposition overhead is expected to reduce, possibly taking advantage of upcoming
    453             parallel extract (pex) instructions on Intel technology. In the ideal, only 24
    454             instructions are needed to transform a block of 128 input bytes using 128-bit SSE
    455             registers using the inductive doubling instruction set architecture, representing an
    456             overhead of less than 0.2 instructions per input byte. </para>
    457       </section>
    458 
    459       <section>
    460          <title>General Streams</title>
    461          <para>This section describes bit streams which support basic processing operations.</para>
    462 
    463          <section>
    464             <title>Deletion Mask Streams</title>
    465             <para>DelMask (deletion mask) streams marks character code unit positions for deletion.
    466                Since the deletion operation is dependency free across many stages of XML processing,
    467                it is possible to simply mark and record deletion positions as deletion mask streams for future processing. A single
    468                invocation of a SIMD based parallel deletion algorithm can then perform the deletion of
    469                positions accumulated across a number of stages through a bitwise ORing of deletion
    470                masks. For example, deletion arises in the replacement of predefined entities with a
    471                single character, such as in the replacement of the &amp;amp; entity, with the
    472                &amp; character. Deletion also arises in XML
    473                end-of-line handling, and CDATA section delimeter processing. Several algorithms to
    474                delete bits at positions marked by DelMask are possible [<xref linkend="u8u16"/>]. </para>
    475             <para>The following table provides an example of generating a DelMask in the context of
    476                bit stream based parsing of well-formed character references and predefined entities.
    477                The result is the generation of a DelMask stream. <table>
    478                   <caption>
    479                      <para>DelMask Stream Generation</para>
    480                   </caption>
    481                   <colgroup>
    482                      <col align="left" valign="top"/>
    483                      <col align="left" valign="top"/>
    484                   </colgroup>
    485                   <tbody>
    486                      <tr valign="top">
    487                         <td>Input Data</td>
    488                         <td>
    489                            <code>&amp;gt; &amp;#13; &amp;#x0a;</code>
    490                         </td>
    491                      </tr>
    492                      <tr valign="top">
    493                         <td>GenRefs</td>
    494                         <td>
    495                            <code>_11______________</code>
    496                         </td>
    497                      </tr>
    498 
    499                      <tr valign="top">
    500                         <td>DecRefs</td>
    501                         <td>
    502                            <code>_______11________</code>
    503                         </td>
    504                      </tr>
    505                      <tr valign="top">
    506                         <td>HexRefs</td>
    507                         <td>
    508                            <code>______________11_</code>
    509                         </td>
    510                      </tr>
    511                      <tr valign="top">
    512                         <td>DelMask</td>
    513                         <td>
    514                            <code>111__1111__11111_</code>
    515                         </td>
    516                      </tr>
    517                      <tr valign="top">
    518                         <td>ErrorFlag</td>
    519                         <td>
    520                            <code>_________________</code>
    521                         </td>
    522                      </tr>
    523                   </tbody>
    524                </table>
    525             </para>
    526          </section>
    527 
    528          <section>
    529             <title>Error Flag Streams </title>
    530             <para>Error flag streams indicates the character code unit positions of syntactical
    531                errors. XML processing examples which benefit from the marking of error positions
    532                include UTF-8 character sequence validation and XML parsing [<xref linkend="u8u16"
    533                />].</para>
    534             <para>The following table provides an example of using bit streams to parse character
    535                references and predefined entities which fail to meet the XML 1.0 well-formedness
    536                constraints. The result is the generation of an error flag stream that marks the
    537                positions of mal-formed decimal and hexical character references respectively. <table>
    538                   <caption>
    539                      <para>Error Flag Stream Generation</para>
    540                   </caption>
    541                   <colgroup>
    542                      <col align="left" valign="top"/>
    543                      <col align="left" valign="top"/>
    544                   </colgroup>
    545                   <tbody>
    546                      <tr valign="top">
    547                         <td>Input Data</td>
    548                         <td>
    549                            <code>&amp;gt; &amp;#, &amp;#x; </code>
    550                         </td>
    551                      </tr>
    552                      <tr valign="top">
    553                         <td>GenRefs</td>
    554                         <td>
    555                            <code>_11___________</code>
    556                         </td>
    557                      </tr>
    558                      <tr valign="top">
    559                         <td>DecRefs</td>
    560                         <td>
    561                            <code>______________</code>
    562                         </td>
    563                      </tr>
    564                      <tr valign="top">
    565                         <td>HexRefs</td>
    566                         <td>
    567                            <code>______________</code>
    568                         </td>
    569                      </tr>
    570                      <tr valign="top">
    571                         <td>DelMask</td>
    572                         <td>
    573                            <code>111__11__111__</code>
    574                         </td>
    575                      </tr>
    576                      <tr valign="top">
    577                         <td>ErrorFlag</td>
    578                         <td>
    579                            <code>_______1____1_</code>
    580                         </td>
    581                      </tr>
    582                   </tbody>
    583                </table>
    584             </para>
    585          </section>
    586 
    587       </section>
    588 
    589       <section>
    590          <title>Lexical Item Streams</title>
    591          <para>Lexical item streams differ from traditional streams of tokens in that they are bit
    592             streams that mark the positions of tokens, whitespace or delimiters. Additional bit
    593             streams, such as the reference streams and callout streams, are subsequently constructed
    594             based on the information held within the set of lexical items streams. Differentiation
    595             between the actual tokens that may occur at a particular point (e.g., the different XML
    596             tokens that begin “&lt;”) may be performed using multicharacter recognizers on the
    597             bytestream representation [<xref linkend="CASCON08"/>].</para>
    598          <para>A key role of lexical item streams in XML parsing is to facilitate fast scanning
    599             operations. For example, a left angle bracket lexical item stream may be formed to
    600             identify those character code unit positions at which a “&lt;” character occurs.
    601             Hardware register bit scan operations may then be used by the XML parser on the left
    602             angle bracket stream to efficiently identify the position of the next “&lt;”. Based
    603             on the capabilities of current commodity processors, a single register bit scan
    604             operation may effectively scan up to 64 byte positions with a single instruction.</para>
    605          <para>Overall, the construction of the full set of lexical item stream computations
    606             requires approximately 1.0 CPU cycles per byte when implemented for 128 positions at a
    607             time using 128-bit SSE registers on Intel Core2 processors [<xref linkend="CASCON08"/>].
    608             The following table defines the core lexical item streams defined by the Parabix XML
    609             parser.</para>
    610          <para>
    611             <table>
    612                <caption>
    613                   <para>Lexical item stream descriptions.</para>
    614                </caption>
    615                <tbody>
    616                   <tr>
    617                      <td align="left"> LAngle </td>
    618                      <td align="left"> Marks the position of any left angle bracket character.</td>
    619                   </tr>
    620                   <tr>
    621                      <td align="left"> RAngle </td>
    622                      <td align="left"> Marks the position of any right angle bracket character.</td>
    623                   </tr>
    624                   <tr>
    625                      <td align="left"> LBracket </td>
    626                      <td align="left"> Marks the position of any left square bracker character.</td>
    627                   </tr>
    628                   <tr>
    629                      <td align="left"> RBracket </td>
    630                      <td align="left"> Marks the position of any right square bracket
    631                      character.</td>
    632                   </tr>
    633                   <tr>
    634                      <td align="left"> Exclam </td>
    635                      <td align="left"> Marks the position of any exclamation mark character.</td>
    636                   </tr>
    637                   <tr>
    638                      <td align="left"> QMark </td>
    639                      <td align="left"> Marks the position of any question mark character.</td>
    640                   </tr>
    641                   <tr>
    642                      <td align="left"> Hyphen </td>
    643                      <td align="left"> Marks the position of any hyphen character.</td>
    644                   </tr>
    645                   <tr>
    646                      <td align="left"> Equals </td>
    647                      <td align="left"> Marks the position of any equal sign character.</td>
    648                   </tr>
    649                   <tr>
    650                      <td align="left"> SQuote </td>
    651                      <td align="left"> Marks the position of any single quote character.</td>
    652                   </tr>
    653                   <tr>
    654                      <td align="left"> DQuote </td>
    655                      <td align="left"> Marks the position of any double quote character.</td>
    656                   </tr>
    657                   <tr>
    658                      <td align="left"> Slash </td>
    659                      <td align="left"> Marks the position of any forward slash character</td>
    660                   </tr>
    661                   <tr>
    662                      <td align="left"> NameScan </td>
    663                      <td align="left"> Marks the position of any XML name character.</td>
    664                   </tr>
    665                   <tr>
    666                      <td align="left"> WS </td>
    667                      <td align="left"> Marks the position of any XML 1.0 whitespace character.</td>
    668                   </tr>
    669                   <tr>
    670                      <td align="left"> PI_start </td>
    671                      <td align="left"> Marks the position of the start of any processing instruction
    672                         at the '?' character position.</td>
    673                   </tr>
    674                   <tr>
    675                      <td align="left"> PI_end </td>
    676                      <td align="left"> Marks the position of any end of any processing instruction
    677                         at the '>' character position.</td>
    678                   </tr>
    679                   <tr>
    680                      <td align="left"> CtCD_start </td>
    681                      <td align="left"> Marks the position of the start of any comment or CDATA
    682                         section at the '!' character position.</td>
    683                   </tr>
    684                   <tr>
    685                      <td align="left"> EndTag_start </td>
    686                      <td align="left"> Marks the position of any end tag at the '/' character
    687                         position.</td>
    688                   </tr>
    689                   <tr>
    690                      <td align="left"> CD_end </td>
    691                      <td align="left"> Marks the position of the end of any CDATA section at the '>'
    692                         character position. </td>
    693                   </tr>
    694                   <tr>
    695                      <td align="left"> DoubleHyphen </td>
    696                      <td align="left"> Marks the position of any double hyphen character.</td>
    697                   </tr>
    698                   <tr>
    699                      <td align="left"> RefStart </td>
    700                      <td align="left"> Marks the position of any ampersand character.</td>
    701                   </tr>
    702                   <tr>
    703                      <td align="left"> Hash </td>
    704                      <td align="left"> Marks the position of any hash character.</td>
    705                   </tr>
    706                   <tr>
    707                      <td align="left"> x </td>
    708                      <td align="left"> Marks the position of any 'x' character.</td>
    709                   </tr>
    710                   <tr>
    711                      <td align="left"> Digit </td>
    712                      <td align="left"> Marks the position of any digit.</td>
    713                   </tr>
    714                   <tr>
    715                      <td align="left"> Hex </td>
    716                      <td align="left"> Marks the position of any hexidecimal character.</td>
    717                   </tr>
    718                   <tr>
    719                      <td align="left"> Semicolon </td>
    720                      <td align="left"> Marks the position of any semicolon character.</td>
    721                   </tr>
    722                </tbody>
    723             </table>
    724          </para>
    725          <para> The following illustrates a number of the lexical item streams. </para>
    726          <para>
    727             <table>
    728                <caption>
    729                   <para>Lexical Item Streams</para>
    730                </caption>
    731                <colgroup>
    732                   <col align="left" valign="top"/>
    733                   <col align="left" valign="top"/>
    734                </colgroup>
    735                <tbody>
    736                   <tr valign="top">
    737                      <td>Input Data</td>
    738                      <td>
    739                         <code>&lt;tag&gt;&lt;tag&gt; text &amp;lt;
    740                            &amp;#x3e; &lt;/tag&gt;&lt;/tag&gt;</code>
    741                      </td>
    742                   </tr>
    743 
    744                   <tr valign="top">
    745                      <td>LAngle</td>
    746                      <td>
    747                         <code>1____1______________________1_____1_____</code>
    748                      </td>
    749                   </tr>
    750                   <tr valign="top">
    751                      <td>RAngle</td>
    752                      <td>
    753                         <code>____1____1_______________________1_____1</code>
    754                      </td>
    755                   </tr>
    756                   <tr valign="top">
    757                      <td>WS</td>
    758                      <td>
    759                         <code>__________1____1____1______1____________</code>
    760                      </td>
    761                   </tr>
    762                   <tr valign="top">
    763                      <td>RefStart</td>
    764                      <td>
    765                         <code>________________1____1__________________</code>
    766                      </td>
    767                   </tr>
    768                   <tr valign="top">
    769                      <td>Hex</td>
    770                      <td>
    771                         <code>__1____1____1___________11_____1_____1__</code>
    772                      </td>
    773                   </tr>
    774 
    775                   <tr valign="top">
    776                      <td>Semicolon</td>
    777                      <td>
    778                         <code>___________________1______1_____________</code>
    779                      </td>
    780                   </tr>
    781                   <tr valign="top">
    782                      <td>Slash</td>
    783                      <td>
    784                         <code>_____________________________1_____1____</code>
    785                      </td>
    786                   </tr>
    787                </tbody>
    788             </table>
    789          </para>
    790 
    791       </section>
    792 
    793       <section>
    794          <title>UTF-8 Byte Classification, Scope and Validation Streams</title>
    795          <para> An XML parser must accept the UTF-8 encoding of Unicode [<xref linkend="XML10"/>].
    796             It is a fatal error if an XML document determined to be in UTF-8 contains byte sequences
    797             that are not legal in that encoding. UTF-8 byte classification, scope, XML character
    798             validation and error flag bit streams are defined to validate UTF-8 byte sequences and
    799             support transcoding to UTF-16.</para>
    800 
    801          <section>
    802             <title>UTF-8 Byte Classification Streams</title>
    803             <para>UTF-8 byte classification bit streams classify UTF-8 bytes based on their role in
    804                forming single and multibyte sequences. The u8Prefix and u8Suffix bit streams
    805                identify bytes that represent, respectively, prefix or suffix bytes of multibyte
    806                UTF-8 sequences. The u8UniByte bit stream identifies those bytes that may be
    807                considered single-byte sequences. The u8Prefix2, u8Prefix3, and u8Prefix4 refine the
    808                u8Prefix respectively indicating prefixes of two, three or four byte
    809             sequences respectively.</para>
    810          </section>
    811 
    812          <section>
    813             <title>UTF-8 Scope Streams</title>
    814             <para> Scope streams represent expectations established by UTF-8 prefix bytes. For
    815                example, the u8Scope22 bit stream represents the positions at which the second byte of a
    816                two-byte sequence is expected based on the occurrence of a two-byte prefix in the
    817                immediately preceding position. The u8scope32, u8Scope33, u8Scope42, u8scope43, and
    818                u8Scope44 complete the set of UTF-8 scope streams.</para>
    819             <para> The following example demonstrates the UTF-8 character encoding validation
    820                process using parallel bit stream techniques. The result of this validation process
    821                is an error flag stream identifying the positions at which errors occur.</para>
    822             <para>
    823                <table>
    824                   <caption>
    825                      <para>UTF-8 Scope Streams</para>
    826                   </caption>
    827                   <colgroup>
    828                      <col align="left" valign="top"/>
    829                      <col align="left" valign="top"/>
    830                   </colgroup>
    831                   <tbody> <tr valign="top"><td>Input Data</td><td><code>A Text in Farsi:&#160;ى&#160;ك&#160;&#160;Ù
    832 &#160;ت&#160;ن&#160;&#160;ف&#160;ا&#160;ر&#160;س&#160;ى</code></td></tr>
    833                      
    834                      <tr valign="top">
    835                         <td>High Nybbles</td>
    836                         <td>
    837                            <code>42567726624677632D8DBDBDAD82D8DAD82D8D8</code>
    838                         </td>
    839                      </tr>
    840                      <tr valign="top">
    841                         <td>Low Nybbles</td>
    842                        
    843                         <td>
    844                            <code>10458409E061239A099838187910968A9509399</code>
    845                         </td>
    846                      </tr>
    847                      <tr valign="top">
    848                         <td>u8Unibyte</td>
    849                         <td>
    850                            <code>11111111111111111__________1______1____</code>
    851                            
    852                         </td>
    853                      </tr>
    854                      <tr valign="top">
    855                         <td>u8Prefix</td>
    856                         <td>
    857                            <code>_________________1_1_1_1_1__1_1_1__1_1_</code>
    858                         </td>
    859                      </tr>
    860                      
    861                      <tr valign="top">
    862                         <td>u8Suffix</td>
    863                         <td>
    864                            <code>__________________1_1_1_1_1__1_1_1__1_1</code>
    865                         </td>
    866                      </tr>
    867                      <tr valign="top">
    868                         <td>u8Prefix2</td>
    869                        
    870                         <td>
    871                            <code>_________________1_1_1_1_1__1_1_1__1_1_</code>
    872                         </td>
    873                      </tr>
    874                      <tr valign="top">
    875                         <td>u8Scope22</td>
    876                         <td>
    877                            <code>__________________1_1_1_1_1__1_1_1__1_1</code>
    878                            
    879                         </td>
    880                      </tr>
    881                      <tr valign="top">
    882                         <td>ErrorFlag</td>
    883                         <td>
    884                            <code>_______________________________________</code>
    885                         </td>
    886                      </tr>
    887                      
    888                   </tbody>
    889                </table>
    890                
    891 
    892             </para>
    893          </section>
    894 
    895          <section>
    896             <title>UTF-8 Validation Streams</title>
    897             <para> Proper formation of UTF-8 byte sequences requires that the correct number of
    898                suffix bytes always follow a UTF-8 prefix byte, and that certain illegal byte
    899                combinations are ruled out. For example, sequences beginning with the prefix bytes
    900                0xF5 through 0xFF are illegal as they would represent code point values above 10FFFF.
    901                In addition, there are constraints on the first suffix byte following certain special
    902                prefixes, namely that a suffix following the prefix 0xE0 must fall in the range
    903                0xA0–0xBF, a suffix following the prefix 0xED must fall in the range 0x80–0x9F, a
    904                suffix following the prefix 0xF0 must fall in the range 0x90–0xBF and a suffix
    905                following the prefix 0xF4 must fall in the range 0x80–0x8F. The task of ensuring that
    906                each of these constraints hold is known as UTF-8 validation. The bit streams xE0,
    907                xED, xF0, xF4, xA0_xBF, x80_x9F, x90_xBF, and x80_x8F are constructed to flag the
    908                aforementioned UTF-8 validation errors. The result of UTF-8 validation is a UTF-8
    909                error flag bit stream contructed as the ORing of a series of UTF-8 validation tests.
    910             </para>
    911          </section>
    912 
    913          <section>
    914             <title>XML Character Validation Streams</title>
    915             <para>The UTF-8 character sequences <emphasis>0xEF 0xBF 0xBF</emphasis> and
    916                   <emphasis>0xEF 0xBF 0xBE</emphasis> correspond to the Unicode code points 0xFFFE
    917                and 0xFFFF respectively. In XML 1.0, 0xFFFE and 0xFFFF represent characters outside
    918                the legal XML character ranges. As such, bit streams which mark 0xEF, 0xBF, and 0xBE
    919                character are constructed to flag illegal UTF-8 character sequences. </para>
    920          </section>
    921 
    922          <section>
    923             <title>UTF-8 to UTF-16 Transcoding</title>
    924             <para>UTF-8 is often preferred for storage and data exchange, it is suitable for
    925                processing, but it is significantly more complex to process than UTF-16 [<xref
    926                   linkend="Unicode"/>]. As such, XML documents are typically encoded in UTF-8 for
    927                serialization and transport, and subsequently transcoded to UTF-16 for processing
    928                with programming languages such as Java and C#. Following the parallel bit stream
    929                methods developed for the u8u16 transcoder, a high-performance standalone UTF-8 to
    930                UTF-16 transcoder [<xref linkend="u8u16"/>], transcoding to UTF-16 may be achieved by
    931                computing a series of 16 bit streams. One stream for each of the individual bits of a
    932                UTF-16 code unit. </para>
    933             <para>The bit streams for UTF-16 are conveniently divided into groups: the eight streams
    934                u16Hi0, u16Hi1, ..., u16Hi7 for the high byte of each UTF-16 code unit and the eight
    935                streams u16Lo1, ..., u16Lo7 for the low byte. Upon conversion of the parallel bit
    936                stream data back to byte streams, eight sequential byte streams U16h0, U16h1, ...,
    937                U16Hi7 are used for the high byte of each UTF-16 code unit, while U16Lo0, U16Lo1,...,
    938                U16Lo7 are used for the corresponding low byte. Interleaving these streams then
    939                produces the full UTF-16 doublebyte stream.</para>
    940          </section>
    941 
    942          <section>
    943             <title>UTF-8 Indexed UTF-16 Streams</title>
    944             <para>UTF-16 bit streams are initially defined in UTF-8 indexed form. That is, with sets
    945                of bits in one-to-one correspondence with UTF-8 bytes. However, only one set of
    946                UTF-16 bits is required for encoding two or three-byte UTF-8 sequences and only two
    947                sets are required for surrogate pairs corresponding to four-byte UTF-8 sequences. The
    948                u8LastByte (u8UniByte , u8Scope22 , u8Scope33 , and u8Scope44 ) and u8Scope42 streams
    949                mark the positions at which the correct UTF-16 bits are computed. The bit sets at
    950                other positions must be deleted to compress the streams to the UTF-16 indexed form.
    951             </para>
    952          </section>
    953       </section>
    954 
    955       <section>
    956          <title>Control Character Streams</title>
    957          <para>The control character bit streams marks ASCII control characters in the range
    958             0x00-0x1F. Additional control character bit streams mark the tab, carriage return, line
    959             feed, and space character. In addition, a bit stream to mark carriage return line
    960             combinations is also constructed. Presently, control character bit streams support the
    961             operations of XML 1.0 character validation and XML end-of-line handling.</para>
    962 
    963          <section>
    964             <title>XML Character Validation</title>
    965             <para>Legal characters in XML are the tab, carriage return, and line feed characters,
    966                together with all Unicode characters and excluding the surrogate blocks, as well as hexadecimal OxFFFE and
    967                OxFFFF [<xref linkend="XML10"/>]. The x00_x1F bit stream is constructed and used in
    968                combination with the additional control character bit streams to flags the positions
    969                of illegal control characters.</para>
    970          </section>
    971 
    972          <section>
    973             <title>XML 1.0 End-of-line Handling</title>
    974             <para>In XML 1.0 the two-character sequence CR LF (carriage return, line feed) as well as
    975                any CR character not followed by a LF character must be converted to a single LF
    976                character [<xref linkend="XML10"/>].</para>
    977             <para>By defining carriage return, line feed, and carriage return line feed bit streams,
    978                dentoted CR, LF and CRLF respectively, end-of-line normalization processing can be
    979                performed in parallel using only a small number of logical and shift operations.</para>
    980             <para/>
    981             <para>The following example demonstrates the generation of the CRLF deletion mask. In
    982                this example, the position of all CR characters followed by LF characters are marked
    983                for deletion. Isolated carriage returns are then replaced with LF characters.
    984                Completion of this process satisfies the XML 1.0 end-of-line handling requirements.
    985                For clarity, this example encodes input data carriage returns as
    986                <emphasis>C</emphasis> characters, whereas line feed characters are shown as
    987                   <emphasis>L</emphasis> characters.</para>
    988             <para>
    989                <table>
    990                   <caption>
    991                      <para>XML 1.0 End-of-line Handling</para>
    992                   </caption>
    993                   <colgroup>
    994                      <col align="left" valign="top"/>
    995                   </colgroup>
    996                   <tbody>
    997                      <tr valign="top">
    998                         <td>Input Data</td>
    999                         <td>
    1000                            <code>first line C second line CL third line L one more C nothing
    1001                            left</code>
    1002                         </td>
    1003                      </tr>
    1004                      <tr valign="top">
    1005                         <td>CR</td>
    1006                         <td>
    1007                            <code>-----------1-------------1------------------------1-------------</code>
    1008                         </td>
    1009                      </tr>
    1010                      <tr valign="top">
    1011                         <td>LF</td>
    1012                         <td>
    1013                            <code>--------------------------1------------1------------------------</code>
    1014                         </td>
    1015                      </tr>
    1016                      <tr valign="top">
    1017                         <td>DelMask</td>
    1018                         <td>
    1019                            <code>--------------------------1-------------------------------------</code>
    1020                         </td>
    1021                      </tr>
    1022                   </tbody>
    1023                </table>
    1024 
    1025             </para>
    1026          </section>
    1027 
    1028       </section>
    1029 
    1030       <section>
    1031          <title>Call Out Streams</title>
    1032          <para> Call out bit streams mark the extents of XML markup structures such as comments,
    1033             processing instruction and CDATA sections as well as physical structures such as character and
    1034             entity references and general references.  Call out streams are also formed for logical markup structures such
    1035             start tags, end tags and empty element tags. </para>
    1036          <section>
    1037             <title>Comment, Processing Instruction and CDATA Section Call Out Streams</title>
    1038             <para>Comments, processing instructions and CDATA sections call out streams, Ct_Span,
    1039                PI_Span and CD_Span respectively, define sections of an XML document which
    1040                contain markup that is not interpreted by an XML processor. As such, the union of
    1041                Ct_Span, PI_Span and CD_Span streams defines the regions of non-interpreteable markup.
    1042                The stream formed by this union is termed the CtCDPI_Mask.</para>
    1043             <para>The following tables provides an example of constructing the CtCDPI_Mask. </para>
    1044             <table>
    1045                <caption>
    1046                   <para>CtCDPI Mask Generation</para>
    1047                </caption>
    1048                <colgroup><col align="left" valign="top" /><col align="left" valign="top" /></colgroup>
    1049                <tbody> <tr valign="top"><td>Input Data</td><td><code>&lt;?php?&gt; &lt;!-- example --&gt; &lt;![CDATA[ shift: a&lt;&lt;1 ]]&gt;</code></td></tr>
    1050                  
    1051                   <tr valign="top"><td>CD_Span</td><td><code>___________________________1111111111111111111111_</code></td></tr>
    1052                   <tr valign="top"><td>Ct_Span</td><td><code>___________111111111111___________________________</code></td></tr>
    1053                   <tr valign="top"><td>PI_Span</td><td><code>_11111____________________________________________</code></td></tr>
    1054                   <tr valign="top"><td>CtCDPI_Mask</td><td><code>_111111__111111111111111__111111111111111111111111</code></td></tr>
    1055                   <tr valign="top"><td>ErrorFlag</td><td><code>__________________________________________________</code></td></tr>
    1056                  
    1057                </tbody>
    1058             </table>
    1059            
    1060            
    1061            
    1062            
    1063             <para> With the removal of all non-interpreteable markup, several phases of parallel bit
    1064                stream based SIMD operations may follow operating on up to 128 byte positions on
    1065                current commondity processors and assured of XML markup relevancy. For
    1066                example, with the extents identification of comments, processing instructions and
    1067                CDATA secions, XML names may be identified and length sorted for efficient symbol
    1068                table construction. </para>
    1069             <para> As an aside, comments and CDATA sections must first be validated to ensure
    1070                that comments do not contain "--" sequences and that CDATA sections do not contain illegal
    1071                "]]&gt;" sequences prior to ignorable markup stream generation.</para>
    1072          </section>
    1073 
    1074          <section>
    1075             <title>Reference Call Out Streams</title>
    1076             <para>The reference call out streams are the GenRefs, DecRefs, and HexRefs streams. This
    1077                subset of the call out streams marks the extents of all but the closing semicolon of
    1078                general and character references.</para>
    1079             <para>Predefined character
    1080                (<![CDATA[&lt;,&gt;,&amp;,&apos;,&quot;]]>) and numeric character
    1081                references (&amp;#nnnn;, &amp;#xhhhh;) must be replaced by a single character
    1082                   [<xref linkend="XML10"/>]. As previously shown, this subset of call out streams enables the construction of a DelMask for
    1083                references.</para>
    1084          </section>
    1085 
    1086          <section>
    1087             <title>Tag Call Out Streams</title>
    1088             <para>Whereas sequential bit scans over lexical item streams form the basis of XML
    1089                parsing, in the current Parabix parser a new method of parallel parsing has been
    1090                developed and prototyped using the concept of bitstream addition. Fundamental to this
    1091                method is the concept of a <emphasis>cursor</emphasis> stream, a bit stream marking
    1092                the positions of multiple parallel parses currently in process. </para>
    1093             <para>The results of parallel parsing using the bit stream addition technique produces a
    1094                set of tag call out bit streams. These streams mark the extents of each start tag,
    1095                end tag and empty element tag. Within tags, additional streams mark start
    1096                and end positions for tag names, as well as attribute names and values. An error flag
    1097                stream marks the positions of any syntactic errors encountered during parsing.</para>
    1098             <para> The set of tag call out streams consists of the ElemNames, AttNames, AttVals, Tags,
    1099                EmptyTagEnds and EndTags bit streams. The following example demonstrates the bit
    1100                stream output produced which from parallel parsing using bit stream addition. </para>
    1101             <table>
    1102                <caption>
    1103                   <para>Tag Call Out Streams</para>
    1104                </caption>
    1105                <colgroup>
    1106                   <col align="left" valign="top"/>
    1107                   <col align="left" valign="top"/>
    1108                </colgroup>
    1109                <tbody>
    1110                   <tr valign="top">
    1111                      <td>Input Data</td>
    1112                      <td>
    1113                         <code>&lt;root&gt;&lt;t1&gt;text&lt;/t1&gt;&lt;t2
    1114                            a1=&apos;foo&apos; a2 =
    1115                            &apos;fie&apos;&gt;more&lt;/t2&gt;&lt;tag3
    1116                            att3=&apos;b&apos;/&gt;&lt;/root&gt;</code>
    1117                      </td>
    1118                   </tr>
    1119 
    1120                   <tr valign="top">
    1121                      <td>ElemNames</td>
    1122                      <td>
    1123                         <code>_1111__11___________11_______________________________1111__________________</code>
    1124                      </td>
    1125                   </tr>
    1126                   <tr valign="top">
    1127                      <td>AttNames</td>
    1128                      <td>
    1129                         <code>_______________________11_______11________________________1111_____________</code>
    1130                      </td>
    1131                   </tr>
    1132                   <tr valign="top">
    1133                      <td>AttrVals</td>
    1134                      <td>
    1135                         <code>__________________________11111______11111_____________________111_________</code>
    1136                      </td>
    1137                   </tr>
    1138                   <tr valign="top">
    1139                      <td>EmptyTagEnds</td>
    1140                      <td>
    1141                         <code>___________________________________________________________________1_______</code>
    1142                      </td>
    1143                   </tr>
    1144                   <tr valign="top">
    1145                      <td>EndTags</td>
    1146                      <td>
    1147                         <code>_______________111______________________________111__________________11111_</code>
    1148                      </td>
    1149                   </tr>
    1150 
    1151                   <tr valign="top">
    1152                      <td>Start/EmptyTags</td>
    1153                      <td>
    1154                         <code>_1111__11___________1111111111111111111111___________11111111111111________</code>
    1155                      </td>
    1156                   </tr>
    1157                   <tr valign="top">
    1158                      <td>ErrorFlag</td>
    1159                      <td>
    1160                         <code>___________________________________________________________________________</code>
    1161                      </td>
    1162                   </tr>
    1163                </tbody>
    1164             </table>
    1165 
    1166          </section>
    1167       </section>
    1168    </section>
    1169    <section>
    1170       <title>SIMD Beyond Bitstreams: Names and Numbers</title>
    1171 
    1172       <para>Whereas the fundamental innovation of our work is the use of SIMD technology in
    1173          implementing parallel bit streams for XML, there are also important ways in which more
    1174          traditional byte-oriented SIMD operations can be useful in accelerating other aspects of
    1175          XML processing.</para>
    1176 
    1177       <section>
    1178          <title>Name Lookup</title>
    1179          <para>Efficient symbol table mechanisms for looking up element and attribute names is
    1180             important for almost all XML processing applications. It is also an important technique
    1181             merely for assessing well-formedness of an XML document; rather than validating the
    1182             character-by-character composition of each occurrence of an XML name as it is
    1183             encountered, it is more efficient to validate all but the first occurrence by first
    1184             determining whether the name already exists in a table of prevalidated names.</para>
    1185 
    1186          <para>The first symbol table mechanism deployed in the Parabix parser simply used the
    1187             hashmaps of the C++ standard template library, without deploying any SIMD technology.
    1188             However, with the overhead of character validation, transcoding and parsing dramatically
    1189             reduced by parallel bit stream technology, we found that symbol lookups then accounted
    1190             for about half of the remaining execution time in a statistics gathering application
    1191                [<xref linkend="CASCON08"/>]. Thus, symbol table processing was identified as a major
    1192             target for further performance improvement. </para>
    1193          <para> Our first effort to improve symbol table performance was to employ the splash tables
    1194             with cuckoo hashing as described by Ross [<xref linkend="Ross06"/>], using SIMD
    1195             technology for parallel bucket processing. Although this technique did turn out to have
    1196             the advantage of virtually constant-time performance even for very large vocabularies,
    1197             it was not particularly helpful for the relatively small vocabularies typically found in
    1198             XML document processing. </para>
    1199          <para> However, a second approach has been found to be quite useful, taking advantage of
    1200             parallel bit streams for cheap determination of symbol length. In essence, the length of
    1201             a name can be determined very cheaply using a single bit scan operation. This then makes
    1202             it possible to use length-sorted symbol table processing, as follows. First, the
    1203             occurrences of all names are stored in arrays indexed by length. Then the length-sorted
    1204             arrays may each be inserted into the symbol table in turn. The advantage of this is that
    1205             a separate loop may be written for each length. Length sorting makes for very efficient
    1206             name processing. For example hash value computations and name comparisons can be made by
    1207             loading multibyte values and performing appropriate shifting and masking operations,
    1208             without the need for a byte-at-a-time loop. In initial experiments, this length-sorting
    1209             approach was found to reduce symbol lookup cost by a factor of two. </para>
    1210          <para> Current research includes the application of SIMD technology to further enhance the
    1211             performance of length-sorted lookup. We have identified a promising technique for
    1212             parallel processing of multiple name occurrences using a parallel trie lookup technique.
    1213             Given an array of occurrences of names of a particular length, the first one, two or
    1214             four bytes of each name are gathered and stored in a linear array. SIMD techniques are
    1215             then used to compare these prefixes with the possible prefixes for the current position
    1216             within the trie. In general, a very small number of possibilities exist for each trie
    1217             node, allowing for fast linear search through all possibilities. Typically, the
    1218             parallelism is expected to exceed the number of possibilities to search through at each
    1219             node. With length-sorting to separate the top-level trie into many small subtries, we
    1220             expect only a single step of symbol lookup to be needed in most practical instances. </para>
    1221 
    1222          <para>The gather step of this algorithm is actually a common technique in SIMD processing.
    1223             Instruction set support for gather operations is a likely future direction for SIMD
    1224             technology.</para>
    1225       </section>
    1226 
    1227       <section>
    1228          <title>Numeric Processing</title>
    1229          <para> Many XML applications involve numeric data fields as attribute values or element
    1230             content. Although most current XML APIs uniformly return information to applications in
    1231             the form of character strings, it is reasonable to consider direct API support for
    1232             numeric conversions within a high-performance XML engine. With string to numeric
    1233             conversion such a common need, why leave it to application programmers? </para>
    1234          <para> High-performance string to numeric conversion using SIMD operations also can
    1235             considerably outperform the byte-at-a-time loops that most application programmers or
    1236             libraries might employ. A first step is reduction of ASCII bytes to corresponding
    1237             decimal nybbles using a SIMD packing operation. Then an inductive doubling algorithm
    1238             using SIMD operations may be employed. First, 16 sets of adjacent nybble values in the
    1239             range 0-9 can be combined in just a few SIMD operations to 16 byte values in the range
    1240             0-99. Then 8 sets of byte values may similarly be combined with further SIMD processing
    1241             to produce doublebyte values in the range 0-9999. Further combination of doublebyte
    1242             values into 32-bit integers and so on can also be performed using SIMD operations. </para>
    1243          <para> Using appropriate gather operations to bring numeric strings into appropriate array
    1244             structures, an XML engine could offer high-performance numeric conversion services to
    1245             XML application programmers. We expect this to be an important direction for our future
    1246             work, particularly in support of APIs that focus on direct conversion of XML data into
    1247             business objects. </para>
    1248 
    1249       </section>
    1250    </section>
    1251 
    1252    <section>
    1253       <title>APIs and Parallel Bit Streams</title>
    1254 
    1255       <section>
    1256          <title>The ILAX Streaming API</title>
    1257          <para>The In-Line API for XML (ILAX) is the base API provided with the Parabix parser. It
    1258             is intended for low-level extensions compiled right into the engine, with minimum
    1259             possible overhead. It is similar to streaming event-based APIs such as SAX, but
    1260             implemented by inline substitution rather than using callbacks. In essence, an extension
    1261             programmer provides method bodies for event-processing methods declared internal to the
    1262             Parabix parsing engine, compiling the event processing code directly with the core code
    1263             of the engine. </para>
    1264          <para> Although ILAX can be used directly for application programming, its primary use is
    1265             for implementing engine extensions that support higher-level APIs. For example, the
    1266             implementation of C or C++ based streaming APIs based on the Expat [<xref
    1267                linkend="Expat"/>] or general SAX models can be quite directly implemented. C/C++ DOM
    1268             or other tree-based APIs can also be fairly directly implemented. However, delivering
    1269             Parabix performance to Java-based XML applications is challenging due to the
    1270             considerable overhead of crossing the Java Native Interface (JNI) boundary. This issue
    1271             is addressed with the Array Set Model (ASM) concept discussed in the following section. </para>
    1272          <para> With the recent development of parallel parsing using bitstream addition, it is
    1273             likely that the underlying ILAX interface of Parabix will change. In essence, ILAX
    1274             suffers the drawback of all event-based interfaces: they are fundamentally sequential in
    1275             number. As research continues, we expect efficient parallel methods building on parallel
    1276             bit stream foundations to move up the stack of XML processing requirements. Artificially
    1277             imposing sequential processing is thus expected to constrain further advances in XML
    1278             performance. </para>
    1279       </section>
    1280 
    1281       <section>
    1282          <title>Efficient XML in Java Using Array Set Models</title>
    1283          <para> In our GML-to-SVG case study, we identified the lack of high-performance XML
    1284             processing solutions for Java to be of particular interest. Java byte code does not
    1285             provide access to the SIMD capabilities of the underlying machine architecture. Java
    1286             just-in-time compilers might be capable of using some SIMD facilities, but there is no
    1287             real prospect of conventional compiler technology translating byte-at-a-time algorithms
    1288             into parallel bit stream code. So the primary vehicle for delivering high-performance
    1289             XML processing is to call native parallel bit stream code written in C through JNI
    1290             capabilities. </para>
    1291          <para>However, each JNI call is expensive, so it is desirable to minimize the number of
    1292             calls and get as much work done during each call as possible. This mitigates against
    1293             direct implementation of streaming APIs in Java through one-to-one mappings to an
    1294             underlying streaming API in C. Instead, we have concentrated on gathering information on
    1295             the C side into data structures that can then be passed to the Java side. However, using
    1296             either C pointer-based structures or C++ objects is problematic because these are
    1297             difficult to interpret on the Java side and are not amenable to Java's automatic storage
    1298             management system. Similarly, Java objects cannot be conveniently created on the C side.
    1299             However, it is possible to transfer arrays of simple data values (bytes or integers)
    1300             between C and Java, so that makes a reasonable focus for bulk data communication between
    1301             C and Java. </para>
    1302          <para><emphasis>Array Set Models</emphasis> are array-based representations of information
    1303             representing an XML document in accord with XML InfoSet [<xref linkend="InfoSet"/>] or
    1304             other XML data models relevant to particular APIs. As well as providing a mechanism for
    1305             efficient bulk data communication across the JNI boundary, ASMs potentially have a
    1306             number of other benefits in high-performance XML processing. <itemizedlist>
    1307                <listitem>
    1308                   <para>Prefetching. Commodity processors commonly support hardware and/or software
    1309                      prefetching to ensure that data is available in a processor cache when it is
    1310                      needed. In general, prefetching is most effective in conjunction with the
    1311                      continuous sequential memory access patterns associated with array
    1312                   processing.</para>
    1313                </listitem>
    1314                <listitem>
    1315                   <para>DMA. Some processing environments provide Direct Memory Access (DMA)
    1316                      controllers for block data movement in parallel with computation. For example,
    1317                      the Cell Broadband Engine uses DMA controllers to move the data to and from the
    1318                      local stores of the synergistic processing units. Arrays of contiguous data
    1319                      elements are well suited to bulk data movement using DMA.</para>
    1320                </listitem>
    1321                <listitem>
    1322                   <para>SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern
    1323                      processor instruction sets allow simultaneous application of particular
    1324                      instructions to sets of elements from parallel arrays. For effective use of
    1325                      SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS
    1326                      (Array of Structures) model. </para>
    1327                </listitem>
    1328                <listitem>
    1329                   <para>Multicore processors. Array-oriented processing can enable the effective
    1330                      distribution of work to the individual cores of a multicore system in two
    1331                      distinct ways. First, provided that sequential dependencies can be minimized or
    1332                      eliminated, large arrays can be divided into separate segments to be processed
    1333                      in parallel on each core. Second, pipeline parallelism can be used to implement
    1334                      efficient multipass processing with each pass consisting of a processing kernel
    1335                      with array-based input and array-based output. </para>
    1336                </listitem>
    1337                <listitem>
    1338                   <para>Streaming buffers for large XML documents. In the event that an XML document
    1339                      is larger than can be reasonably represented entirely within processor memory,
    1340                      a buffer-based streaming model can be applied to work through a document using
    1341                      sliding windows over arrays of elements stored in document order. </para>
    1342                </listitem>
    1343 
    1344             </itemizedlist>
    1345          </para>
    1346 
    1347          <section>
    1348             <title>Saxon-B TinyTree Example</title>
    1349             <para>As a first example of the ASM concept, current work includes a proof-of-concept to
    1350                deliver a high-performance replacement for building the TinyTree data structure used
    1351                in Saxon-B 6.5.5, an open-source XSLT 2.0 processor written in Java [<xref
    1352                   linkend="Saxon"/>]. Although XSLT stylesheets may be cached for performance, the
    1353                caching of source XML documents is typically not possible. A new TinyTree object to
    1354                represent the XML source document is thus commonly constructed with each new query so
    1355                that the overall performance of simple queries on large source XML documents is
    1356                highly dependent on TinyTree build time. Indeed, in a study of Saxon-SA, the
    1357                commercial version of Saxon, query time was shown to be dominated by TinyTree build
    1358                time [<xref linkend="Kay08"/>]. Similar performance results are demonstrable for the
    1359                Saxon-B XSLT processor as well. </para>
    1360             <para> The Saxon-B processor studied is a pure Java solution, converting a SAX (Simple
    1361                API for XML) event stream into the TinyTree Java object using the efficient Aelfred
    1362                XML parser [<xref linkend="AElfred"/>]. The TinyTree structure is itself an
    1363                array-based structure mapping well suited to the ASM concept. It consists of six
    1364                parallel arrays of integers indexed on node number and containing one entry for each
    1365                node in the source document, with the exception of attribute and namespace nodes
    1366                   [<xref linkend="Saxon"/>]. Four of the arrays respectively provide node kind, name
    1367                code, depth, and next sibling information for each node, while the two others are
    1368                overloaded for different purposes based on node kind value. For example, in the
    1369                context of a text node , one of the overloaded arrays holds the text buffer offset
    1370                value whereas the other holds the text buffer length value. Attributes and namespaces
    1371                are represented using similiar parallel array of values. The stored TinyTree values
    1372                are primarily primitive Java types, however, object types such as Java Strings and
    1373                Java StringBuffers are also used to hold attribute values and comment values
    1374                respectively. </para>
    1375             <para> In addition to the TinyTree object, Saxon-B maintains a NamePool object which
    1376                represents a collection of XML name triplets. Each triplet is composed of a Namespace
    1377                URI, a Namespace prefix and a local name and encoded as an integer value known as a
    1378                namecode. Namecodes permit efficient name search and look-up using integer
    1379                comparison. Namecodes may also be subsequently decoded to recover namespace and local
    1380                name information. </para>
    1381             <para> Using the Parabix ILAX interface, a high-performance reimplementation of TinyTree
    1382                and NamePool data structures was built to compare with the Saxon-B implementation. In
    1383                fact, two functionally equivalent versions of the ASM java class were constructed. An
    1384                initial version was constructed based on a set of primitive Java arrays constructed
    1385                and allocated in the Java heap space via JNI New&lt;PrimitiveType&gt;Array
    1386                method call. In this version, the JVM garbage collector is aware of all memory
    1387                allocated in the native code. However, in this approach, large array copy operations
    1388                limited overall performance to approximately a 2X gain over the Saxon-B build time. </para>
    1389             <para>To further address the performance penalty imposed by copying large array values,
    1390                a second version of the ASM Java object was constructed based on natively backed
    1391                Direct Memory Byte Buffers [<xref linkend="JNI"/>]. In this version the JVM garbage
    1392                collector is unaware any native memory resources backing the Direct Memory Byte
    1393                Buffers. Large JNI-based copy operations are avoided; however, system memory must be
    1394                explicitly deallocated via a Java native method call. Using this approach, our
    1395                preliminary results show an approximate total 2.5X gain over Saxon-B build time.
    1396             </para>
    1397          </section>
    1398       </section>
    1399    </section>
    1400 
    1401 
    1402    <section>
    1403       <title>Compiler Technology</title>
    1404 
    1405       <para> An important focus of our recent work is on the development of compiler technology to
    1406          automatically generate the low-level SIMD code necessary to implement bit stream processing
    1407          given suitable high-level specifications. This has several potential benefits. First, it
    1408          can eliminate the tedious and error-prone programming of bit stream operations in terms of
    1409          register-at-a-time SIMD operations. Second, compilation technology can automatically employ
    1410          a variety of performance improvement techniques that are difficult to apply manually. These
    1411          include algorithms for instruction scheduling and register allocation as well as
    1412          optimization techniques for common subexpression expression elimination and register
    1413          rematerialization among others. Third, compiler technology makes it easier to make changes
    1414          to the low-level code for reasons of perfective or adaptive maintenance.</para>
    1415 
    1416       <para>Beyond these reasons, compiler technology also offers the opportunity for retargetting
    1417          the generation of code to accommodate different processor architectures and API
    1418          requirements. Strategies for efficient parallel bit stream code can vary considerably
    1419          depending on processor resources such as the number of registers available, the particular
    1420          instruction set architecture supported, the size of L1 and L2 data caches, the number of
    1421          available cores and so on. Separate implementation of custom code for each processor
    1422          architecture would thus be likely to be prohibitively expensive, prone to errors and
    1423          inconsistencies and difficult to maintain. Using compilation technology, however, the idea
    1424          would be to implement a variety of processor-specific back-ends all using a common front
    1425          end based on parallel bit streams. </para>
    1426 
    1427       <section>
    1428          <title>Character Class Compiler</title>
    1429 
    1430          <para>The first compiler component that we have implemented is a character class compiler,
    1431             capable of generation all the bit stream logic necessary to produce a set of lexical
    1432             item streams each corresponding to some particular set of characters to be recognized.
    1433             By taking advantage of common patterns between characters within classes, and special
    1434             optimization logic for recognizing character-class ranges, our existing compiler is able
    1435             to generate well-optimized code for complex sets of character classes involving numbers
    1436             of special characters as well as characters within specific sets of ranges. </para>
    1437 
     142         <title>Xerces C++ Structure</title>
     143<para>
     144The Xerces C++ parser
     145<!-- is a widely-used standards-conformant -->
     146<!-- XML parser produced as open-source software -->
     147<!-- by the Apache Software Foundation. -->
     148<!-- It -->
     149features comprehensive support for a variety of character encodings
     150both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
     151multiple XML vocabularies through the XML namespace
     152mechanism, as well as complete implementations
     153of structure and data validation through multiple grammars
     154declared using either legacy DTDs (document type
     155definitions) or modern XML Schema facilities.
     156Xerces also supports several APIs for accessing
     157parser services, including event-based parsing
     158using either pull parsing or SAX/SAX2 push-style
     159parsing as well as a DOM tree-based parsing interface.
     160</para>
     161<para>
     162<!--What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?-->
     163<!--Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars-->
     164<!--Should we show a val-grind summary of a few files in a linechart form?-->
     165Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
     166first to the last byte of input data. Each byte passes through several processing layers and is
     167classified and eventually validated within the context of the document state.
     168This introduces implicit dependencies between the various tasks within the application that make it
     169difficult to optimize for performance.
     170As a complex software system, no one feature dominates the overall parsing performance.
     171Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
     172Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
     173parallelization in isolation would only produce a minute improvement in performance.
     174Unfortunately, early investigation into these functions found
     175that incorporating speculation-free thread-level parallelization was impossible
     176and they were already performing well in their given tasks;
     177thus only trivial enhancements were attainable.
     178In order to obtain a systematic acceleration of Xerces,
     179it should be expected that a comprehensive restructuring
     180is required, involving all aspects of the parser.
     181</para>
     182<para>
     183<!-- In order to obtain systematic acceleration of the Xerces parser,-->
     184<!-- it should be expected that a comprehensive restructuring-->
     185<!-- is required, involving all aspects of the parser.-->
     186<!-- FIGURE
     187\begin{figure}[h]
     188\begin{tabular}{r|l}
     189Time (\%) & Function Name \\
     190\hline
     19113.29   &       XMLUTF8Transcoder::transcodeFrom \\
     1927.45    &       IGXMLScanner::scanCharData \\
     1936.83    &       memcpy \\
     1945.83    &       XMLReader::getNCName \\
     1954.67    &       IGXMLScanner::buildAttList \\
     1964.54    &       RefHashTableOf\verb|<>|::findBucketElem \\
     1974.20    &       IGXMLScanner::scanStartTagNS \\
     1983.75    &       ElemStack::mapPrefixToURI \\
     1993.58    &       ReaderMgr::getNextChar \\
     2003.20    &       IGXMLScanner::basicAttrValueScan \\
     201\end{tabular}
     202\caption{Execution Time of Top 10 Xerces Functions}
     203\label {fig:xerces-profile}
     204\end{figure}
     205-->
     206</para>
    1438207      </section>
    1439208      <section>
    1440          <title>Regular Expression Compilation</title>
     209         <title>The Parabix Framework</title>
     210<para>
     211The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
     212(and other forms of text processing.) The key idea is to exploit the availability of wide
     213SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
     214of input data by using one register bit per single input byte.
     215To facilitate this, the input data is first transposed into a set of basis bit streams.
     216In <!--FIGURE REF Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
     217is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
     218-->
     219<!-- The bits used to construct $\tt b_7$ have been highlighted in this example. -->
     220Boolean-logic operations\footnote{&#8743;, \&#8744; and &#172; denote the boolean AND, OR and NOT operators.}
     221are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
     222characters (or groups of characters) with a $1$.
     223For example, one of the fundamental characters in XML is a left-angle bracket.
     224A character is a<code>&lt; if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
     225b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1</code>.
     226Similarly, a character is numeric
     227<code>[0-9] if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))</code>.
     228An important observation here is that ranges of characters may
     229require fewer operations than individual characters and
     230<!-- the classification cost could be amortized over many character classes.-->
     231multiple classes can share the classification cost.
     232</para>
     233<para>
     234<!-- FIGURE
     235\begin{figure}[h]
     236\begin{center}
     237\begin{tabular}{r c c c c }
     238String & \ttfamily{b} & \ttfamily{7} & \ttfamily{\verb`<`} & \ttfamily{A} \\
     239ASCII & \ttfamily{\footnotesize 0110001{\bfseries 0}} & \ttfamily{\footnotesize 0011011{\bfseries 1}} & \ttfamily{\footnotesize 0011110{\bfseries 0}} & \ttfamily{\footnotesize 0100000{\bfseries 1}} \\
     240\hline
     241\end{tabular}
     242\end{center}
     243\begin{center}
     244\begin{tabular}{r |c |c |c |c |c |c |c |c |}
     245 & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$ & $\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$ \\
     246 & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\
     247 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\
     248 & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \bfseries\ttfamily{0} \\
     249 & \ttfamily{0} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \bfseries\ttfamily{1} \\
     250\end{tabular}
     251\end{center}
     252\caption{8-bit ASCII Basis Bit Streams}
     253\label{fig:BitStreamsExample}
     254\end{figure}
     255-->
     256</para>
     257<!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
     258<!-- bit streams can be transformed into lexical bit streams, where the presense of -->
     259<!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
     260<!-- process, intra-element well-formedness validation is performed on each block -->
     261<!-- of text. -->
     262<para>
     263Consider, for example, the XML source data stream shown in the first line of <!-- FIGURE REF Figure \ref{fig:parabix1} -->.
     264The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
     265parsing, with each bit of each stream in one-to-one correspondence to the source character code units
     266of the input stream.
     267For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
     268The first bit stream shown is that for the opening
     269angle brackets that represent tag openers in XML.
     270The second and third streams show a partition of the
     271tag openers into start tag marks and end tag marks
     272depending on the character immediately following the
     273opener (i.e., ``\verb:/:'') or not.  The remaining three
     274lines show streams that can be computed in subsequent
     275parsing (using the technique
     276of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
     277attribute names and attribute values of tags. 
     278</para>
     279<para>
     280Two intuitions may help explain how the Parabix approach can lead
     281to improved XML parsing performance. The first is that
     282the use of the full register width offers a considerable
     283information advantage over sequential byte-at-a-time
     284parsing.  That is, sequential processing of bytes
     285uses just 8 bits of each register, greatly limiting the
     286processor resources that are effectively being used at any one time.
     287The second is that byte-at-a-time loop scanning loops are actually
     288often just computing a single bit of information per iteration:
     289is the scan complete yet?
     290Rather than computing these individual decision-bits, an approach that computes
     291many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
     292should provide substantial benefit.
     293</para>
     294<para>
     295Previous studies have shown that the Parabix approach improves many aspects of XML processing,
     296including transcoding \cite{Cameron2008}, character classification and validation,
     297tag parsing and well-formedness checking. 
     298The first Parabix parser used processor bit scan instructions to considerably accelerate
     299sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
     300Recent work has incorporated a method of parallel
     301scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as
     302well as combining SIMD methods with 4-stage pipeline parallelism to further improve
     303throughput \cite{HPCA2012}.
     304Although these research prototypes handled the full syntax of schema-less XML documents,
     305they lacked the functionality required by full XML parsers.
     306</para>
     307<para>
     308Commercial XML processors support transcoding of multiple character sets and can parse and
     309validate against multiple document vocabularies.
     310Additionally, they provide API facilities beyond those found in research prototypes,
     311including the widely used SAX, SAX2 and DOM interfaces.
     312</para>
     313      </section>
     314      <section>
     315         <title>Sequential vs. Parallel Paradigm</title>
     316<para>
     317Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents sequentially.
     318Each character is examined to distinguish between the
     319XML-specific markup, such as a left angle bracket <code>&lt;</code>, and the
     320content held within the document. 
     321As the parser progresses through the document, it alternates between markup scanning,
     322validation and content processing modes.
     323</para>
     324<para>
     325In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
     326  Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
     327  current state and the events associated with transitions of states.}.
     328Each state transition indicates the processing context of subsequent characters.
     329Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
     330</para>
     331<para>
     332Parabix-style XML parsers utilize a concept of layered processing.
     333A block of source text is transformed into a set of lexical \bitstream{}s,
     334which undergo a series of operations that can be grouped into logical layers,
     335e.g., transposition, character classification, and lexical analysis.
     336Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
     337To meet the API requirements of the document-ordered Xerces output,
     338the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
     339</para>
     340      </section>                       
     341     </section>                 
     342                <section>
     343                        <title>Architecture</title>             
     344                        <section>
     345                       <title>Overview</title>
     346<!--\def \CSG{Content Stream Generator}-->
     347<para>
     348\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
     349rearchitected with pipeline parallelism in mind.
     350In this section, we highlight the core differences between the two systems.
     351As shown in Figure \ref{fig:xerces-arch}, Xerces
     352is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
     353The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
     354the majority of the character set encoding validation is performed as a byproduct of this process.
     355The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
     356It tracks the current line/column position,
     357<!--(which is reported in the unlikely event that the input contains an error), -->
     358performs line-break normalization and validates context-specific character set issues,
     359such as tokenization of qualified-names.
     360The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
     361of the document; it deals with all issues related to entity expansion, validates
     362the XML well-formedness constraints and any character set encoding issues that cannot
     363be completely handled by the reader or transcoder (e.g., surrogate characters, validation
     364and normalization of character references, etc.)
     365The {\it Namespace Binder} is a core piece of the element stack.
     366It handles namespace scoping issues between different XML vocabularies.
     367This allows the scanner to properly select the correct schema grammar structures.
     368The {\it Validator} takes the IR produced by the Scanner (and
     369potentially annotated by the Namespace Binder) and assesses whether the final output matches
     370the user-defined DTD and schema grammar(s) before passing it to the end-user.
     371</para>
     372<para>
     373<!-- FIGURE
     374\begin{figure}[h]
     375\begin{center}
     376\includegraphics[height=0.45\textheight,keepaspectratio]{plots/xerces.pdf}
     377\caption{Xerces Architecture}
     378\label{fig:xerces-arch}
     379\end{center}
     380\end{figure}
     381-->
     382</para>
     383<para>
     384In \icXML{} functions are grouped into logical components.
     385As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
     386All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
     387The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
     388mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
     389set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
     390These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
     391after additional processing is performed.
     392The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
     393It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
     394significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
     395the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
     396Intra-element well-formedness validation is performed as an artifact of this process.
     397Like Xerces, \icXML{} must provide the Line and Column position of each error.
     398The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
     399optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
     400From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
     401</para>
     402<para>
     403A typical XML file contains few unique element and attribute names&#8212;but each of them will occur frequently.
     404\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
     405Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
     406the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
     407</para>
     408<para>
     409The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
     410The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
     411information provided by the Parallel Markup Parser, and the latter transforms the
     412filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
     413</para>
     414<para>
     415Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
     416The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
     417The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
     418to perform in bit space, such as ensuring every start tag has a matching end tag.
     419Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
     420it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
     421associated with each symbol occurrence.
     422This is discussed in Section \ref{section:arch:namespacehandling}.
     423Finally, the {\it Validation} layer implements the Xerces's validator.
     424However, preprocessing associated with each symbol greatly reduces the work of this stage.
     425</para>
     426<para>
     427<!-- FIGURE
     428\begin{figure}[h]
     429\begin{center}
     430\includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf}
     431\end{center}
     432\caption{\icXML{} Architecture}
     433\label{fig:icxml-arch}
     434\end{figure}
     435-->
     436</para>
     437            </section>
     438                        <section>
     439                       <title>Character Set Adapters</title>
     440<para>
     441In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
     442provide the end-consumer with a single encoding format.
     443In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
     444because of the need to decode and classify each byte of input, mapping variable-length UTF-8
     445byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
     446In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
     447transcoding imposes at least a cost of buffer copying.
     448</para>
     449<para>
     450In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
     451Given a specified input encoding, a CSA is responsible for checking that
     452input code units represent valid characters, mapping the characters of the encoding into
     453the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item
     454streams), as well as supporting ultimate transcoding requirements.   All of this work
     455is performed using the parallel \bitstream{} representation of the source input.
     456</para>
     457<para>
     458An important observation is that many character sets are an
     459extension to the legacy 7-bit ASCII character set.  This includes the
     460various ISO Latin character sets, UTF-8, UTF-16 and many others.
     461Furthermore, all significant characters for parsing XML are confined to the
     462ASCII repertoire.   Thus, a single common set of lexical item calculations
     463serves to compute lexical item streams for all such ASCII-based character sets.
     464</para>
     465<para>
     466A second observation is that&#8212;regardless of which character set is used&#8212;quite
     467often all of the characters in a particular block of input will be within the ASCII range.
     468This is a very simple test to perform using the \bitstream{} representation, simply confirming that the
     469bit 0 stream is zero for the entire block.   For blocks satisfying this test,
     470all logic dealing with non-ASCII characters can simply be skipped.
     471Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the
     472UTF-16 form are each set to zero in this case.
     473</para>
     474<para>
     475A third observation is that repeated transcoding of the names of XML
     476elements, attributes and so on can be avoided by using a look-up mechanism.
     477That is, the first occurrence of each symbol is stored in a look-up
     478table mapping the input encoding to a numeric symbol ID.   Transcoding
     479of the symbol is applied at this time.  Subsequent look-up operations
     480can avoid transcoding by simply retrieving the stored representation.
     481As symbol look up is required to apply various XML validation rules,
     482there is achieves the effect of transcoding each occurrence without
     483additional cost.
     484</para>
     485<para>
     486The cost of individual character transcoding is avoided whenever a block of input is
     487confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
     488Furthermore, when transcoding is required, the parallel \bitstream{} representation
     489supports efficient transcoding operations.   
     490In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s
     491can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
     492and all but the final bytes of multi-byte sequences can be marked for deletion as
     493discussed in the following subsection.
     494In other cases, transcoding within a block only need be applied for non-ASCII
     495bytes, which are conveniently identified by iterating through the bit 0 stream
     496using bit scan operations.
     497</para>
     498            </section>
     499                        <section>
     500                       <title>Combined Parallel Filtering</title>
     501<para>
     502As just mentioned, UTF-8 to UTF-16 transcoding involves marking
     503all but the last bytes of multi-byte UTF-8 sequences as
     504positions for deletion.   For example, the two
     505Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
     506are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
     507and \verb'E5 A5 BD' while the UTF-16 representation must be
     508compressed down to the two code units \verb'4F60' and \verb'597D'.
     509In the bit parallel representation, this corresponds to a reduction
     510from six bit positions representing UTF-8 code units (bytes)
     511down to just two bit positions representing UTF-16 code units
     512(double bytes).   This compression may be achieved by
     513arranging to calculate the correct UTF-16 bits at the
     514final position of each sequence and creating a deletion
     515mask to mark the first two bytes of each 3-byte sequence
     516for deletion.   In this case, the portion of the mask
     517corresponding to these input bytes is the bit sequence
     518\verb'110110'.  Using this approach, transcoding may then be
     519completed by applying parallel deletion and inverse transposition of the
     520UTF-16 \bitstream{}s\cite{Cameron2008}.
     521</para>
     522<para>
     523<!-- FIGURE
     524\begin{figure*}[tbh]
     525\begin{center}
     526\begin{tabular}{rr}\\
     527Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
     528-->
     529<!-- Tag Openers & \verb`1____________1____________________________1____________1__________`\\-->
     530<!-- Start Tag Marks & \verb`_1____________1___________________________________________________`\\-->
     531<!-- End Tag Marks & \verb`___________________________________________1____________1_________`\\-->
     532<!-- Empty Tag Marks & \verb`__________________________________________________________________`\\-->
     533<!-- Element Names & \verb`_11111111_____1111111_____________________________________________`\\-->
     534<!-- Attribute Names & \verb`______________________11_______11_________________________________`\\-->
     535<!-- Attribute Values & \verb`__________________________111________111__________________________`\\-->
     536<!-- FIGURE
     537String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
     538Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
     539Deletion Mask & \verb`_11111111_____1111111111_1____1111_11_______11111111_____111111111`\\
     540Undeleted Data & \verb``{\tt\it 0}\verb`________>fee`{\tt\it 0}\verb`__________=_fie`{\tt\it 0}\verb`____=__foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/________fum`{\tt\it 0}\verb`/_________`
     541\end{tabular}
     542\end{center}
     543\caption{XML Source Data and Derived Parallel Bit Streams}
     544\label{fig:parabix2}
     545\end{figure*}
     546-->
     547</para>
     548<para>
     549Rather than immediately paying the
     550costs of deletion and transposition just for transcoding,
     551however, \icXML{} defers these steps so that the deletion
     552masks for several stages of processing may be combined.
     553In particular, this includes core XML requirements
     554to normalize line breaks and to replace character
     555reference and entity references by their corresponding
     556text.   In the case of line break normalization,
     557all forms of line breaks, including bare carriage
     558returns (CR), line feeds (LF) and CR-LF combinations
     559must be normalized to a single LF character in
     560each case.   In \icXML{}, this is achieved by
     561first marking CR positions, performing two
     562bit parallel operations to transform the marked
     563CRs into LFs, and then marking for deletion any
     564LF that is found immediately after the marked CR
     565as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
     566<!-- FIGURE
     567\begin{figure}
     568\begin{verbatim}
     569# XML 1.0 line-break normalization rules.
     570if lex.CR:
     571# Modify CR (#x0D) to LF (#x0A)
     572  u16lo.bit_5 ^= lex.CR
     573  u16lo.bit_6 ^= lex.CR
     574  u16lo.bit_7 ^= lex.CR
     575  CRLF = pablo.Advance(lex.CR) & lex.LF
     576  callouts.delmask |= CRLF
     577# Adjust LF streams for line/column tracker
     578  lex.LF |= lex.CR
     579  lex.LF ^= CRLF
     580\end{verbatim}
     581\caption{Line Break Normalization Logic}\label{fig:LBnormalization}
     582\end{figure}
     583-->
     584</para>
     585<para>
     586In essence, the deletion masks for transcoding and
     587for line break normalization each represent a bitwise
     588filter; these filters can be combined using bitwise-or
     589so that the parallel deletion algorithm need only be
     590applied once.
     591</para>
     592<para>
     593A further application of combined filtering
     594is the processing of XML character and entity
     595references.   Consider, for example, the references <code>&amp;</code> or <code>&#x3C;</code>.
     596which must be replaced in XML processing with 
     597the single <code>&amp;</code> and <code>&lt;</code> characters, respectively.
     598The approach in \icXML{} is to mark all but the first character
     599positions of each reference for deletion, leaving a
     600single character position unmodified.  Thus, for the
     601references <code>&amp;</code> or <code>&#x3C;</code> the
     602masks <code>01111</code> and <code>011111</code> are formed and
     603combined into the overall deletion mask.   After the
     604deletion and inverse transposition operations are finally
     605applied, a post-processing step inserts the proper character
     606at these positions.   One note about this process is
     607that it is speculative; references are assumed to generally
     608be replaced by a single UTF-16 code unit.   In the case,
     609that this is not true, it is addressed in post-processing.
     610</para>
     611<para>
     612The final step of combined filtering occurs during
     613the process of reducing markup data to tag bytes
     614preceding each significant XML transition as described
     615in section~\ref{section:arch:contentstream}.  Overall, \icXML{}
     616avoids separate buffer copying operations for each of the
     617these filtering steps, paying the cost of parallel
     618deletion and inverse transposition only once. 
     619Currently, \icXML{} employs the parallel-prefix compress algorithm
     620of Steele~\cite{HackersDelight}  Performance
     621is independent of the number of positions deleted.
     622Future versions of \icXML{} are expected to
     623take advantage of the parallel extract operation~\cite{HilewitzLee2006}
     624that Intel is now providing in its Haswell architecture.
     625</para>
     626            </section>
     627                        <section>
     628                       <title>Content Stream</title>
     629<para>
     630A relatively-unique concept for \icXML{} is the use of a filtered content stream.
     631Rather that parsing an XML document in its original format, the input is transformed
     632into one that is easier for the parser to iterate through and produce the sequential
     633output.
     634In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
     635<!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
     636is transformed into
     637<!-- CODE -->
     638<!--``{\tt\it 0}\verb`>fee`{\tt\it 0}\verb`=fie`{\tt\it 0}\verb`=foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/fum`{\tt\it 0}\verb`/`''-->
     639through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
     640</para>
     641<para>
     642Combined with the symbol stream, the parser traverses the content stream to effectively
     643reconstructs the input document in its output form.
     644The initial {\tt\it 0} indicates an empty content string. The following \verb|>|
     645indicates that a start tag without any attributes is the first element in this text and
     646the first unused symbol, <code>document</code>, is the element name.
     647Succeeding that is the content string <code>fee</code>, which is null-terminated in accordance
     648with the Xerces API specification. Unlike Xerces, no memory-copy operations
     649are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
     650accounts for 6.83% of Xerces's execution time.
     651Additionally, it is cheap to locate the terminal character of each string:
     652using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each
     653null character in the content stream in parallel, which in turn means the parser can
     654directly jump to the end of every string without scanning for it.
     655</para>
     656<para>
     657Following ``\verb`fee`'' is a \verb`=`, which marks the existence of an attribute.
     658Because all of the intra-element was performed in the \PS{}, this must be a legal attribute.
     659Since attributes can only occur within start tags and must be accompanied by a textual value,
     660the next symbol in the symbol stream must be the element name of a start tag,
     661and the following one must be the name of the attribute and the string that follows the \verb`=` must be its value.
     662However, the subsequent \verb`=` is not treated as an independent attribute because the parser has yet to
     663read a \verb`>`, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
     664it (along with the string value) is added to the element.
     665Eventually the parser reaches a \verb`/`, which marks the existence of an end tag. Every end tag requires an
     666element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
     667ensure that the appropriate scope-nesting rules have been applied.
     668</para>
     669            </section>
     670                        <section>
     671                       <title>Namespace Handling</title>
     672<!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
     673<para>
     674In XML, namespaces prevents naming conflicts when multiple vocabularies are used together.
     675It is especially important when a vocabulary application-dependant meaning, such as when
     676XML or SVG documents are embedded within XHTML files.
     677Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify
     678specific names or resources.
     679On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML
     680processor to bind the prefix <code>p</code> to the URI &apos;<code>pub.net</code>&apos; and the default (empty)
     681prefix to <code>book.org</code>. Thus to the XML processor, the \verb|title| on line 2 and
     682\verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price|
     683respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as
     684\verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name
     685\verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items
     686because the current vocabulary is determined by the namespace(s) that are in-scope.
     687</para>
     688<para>
     689<!-- FIGURE
     690\begin{figure}[h]
     691\begin{tabular}{l|l}
     6921. & \verb|<book xmlns:p="pub.net" xmlns="book.org">| \\
     6932. & \verb|  <title>BOOK NAME</title>| \\
     6943. & \verb|  <p:name>PUBLISHER NAME</p:name>| \\
     6954. & \verb|  <price>X</price>| \\
     6965. & \verb|  <price xmlns="publisher.net">Y</price>| \\
     6976. & \verb|</book>| \\
     698\end{tabular}
     699\caption{XML Namespace Example}
     700\label {fig:namespace1}
     701\end{figure}
     702-->
     703</para>
     704<para>
     705In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID.
     706These persist for the lifetime of the application through the use of a global URI pool.
     707Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
     708in the document. Because a namespace declaration affects the entire element, it must be processed prior to
     709grammar validation. This is a costly process considering that a typical namespaced XML document only comes
     710in one of two forms:
     711(1) those that declare a set of namespaces upfront and never change them, and
     712(2) those that repeatedly modify the namespaces in predictable patterns.
     713</para>
     714<para>
     715For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform
     716<!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
     717<!-- performance advantage figure?? average cycles/byte cost? -->
     718When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps
     719the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
     720Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
     721NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
     722prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively.
     723To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
     724maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
     725NSID can be found using a bit-scan intrinsic.
     726A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
     727</para>
     728<para>
     729<!-- FIGURE
     730\begin{table}[h]
     731\begin{center}
     732\begin{tabular}{|c||c|c|c|c|}\hline
     733NSID & Prefix & URI & Prefix ID & URI ID \\ \hline\hline
     7340 & {\tt p} & {\tt pub.net} & 0 & 0 \\ \hline
     7351 & {\tt xmlns} & {\tt books.org} & 1 & 1 \\ \hline
     7362 & {\tt xmlns} & {\tt pub.net} & 1 & 0 \\ \hline
     737\end{tabular}
     738\caption{Namespace Binding Table Example}
     739\end{center}
     740\label{tbl:namespace1}
     741\end{table}
     742-->
     743</para>
     744<para>
     745<!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
     746<!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
     747<!-- NSid = bitscan(VisiblePrefixBinding); -->
     748<!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
     749</para>
     750<para>
     751To ensure that scoping rules are adhered to,
     752whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored
     753within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
     754currently visible namespaces is XORed with the vector at the top of the stack.
     755This allows any number of changes to be performed at each scope-level with a constant time.
     756<!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
     757</para>
     758            </section>
     759                        <section>
     760                       <title>Error Handling</title>
     761<para>
     762<!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
     763Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
     764As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
     765if not, the current state determines the type and severity of the error.
     766\icXML{} emits errors in the similar manner&#8212;but how it discovers them is substantially different.
     767Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{},
     768each with its own system for detecting and producing error messages.
     769</para>
     770<para>
     771Within the \PS{}, all computations are performed in parallel, a block at a time.
     772Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block,
     773and the type of error is determined by the equation that discovered it.
     774The difficulty of error processing in this section is that in Xerces the line and column number must be given
     775with every error production. Two major issues exist because of this:
     776(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
     777followed by a line feed, are counted as a single new line character.
     778(2) column position is counted in characters, not bytes or code units;
     779thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
     780Note that typical XML documents are error-free but the calculation of the
     781line/column position is a constant overhead in Xerces. <!-- that must be maintained in the case that one occurs. -->
     782To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and
     783performs the minimal amount of book-keeping necessary to facilitate it.
     784\icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
     785within the Line Column Tracker (LCT).
     786One of the CSA's major responsibilities is transcoding an input text. <!-- from some encoding format to near-output-ready UTF-16. -->
     787During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
     788and validated.
     789A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of
     790this process.
     791Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
     792<!-- The optimization delays the counting process .... -->
     793Column position is more difficult to calculate.
     794It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units)
     795between the position between which an error was detected and the last line feed. However, this distance may exceed
     796than the actual character position for the reasons discussed in (2).
     797To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s,
     798such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
     799normalization process.
     800When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
     801column number.
     802</para>
     803<para>
     804The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
     805However, reporting the correct line/column is a much more difficult problem.
     806The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
     807Each string is normalized in accordance with the XML specification.
     808All symbol data and unnecessary whitespace is eliminated from the stream;
     809thus its impossible to derive the current location using only the content stream.
     810To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
     811the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every
     812datum that was suppressed from the source during the production of the content stream.
     813Armed with these, it is possible to calculate the actual line/column using
     814the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
     815</para>
     816            </section>
     817                </section>
    1441818
    1442          <para>Based on the character class compiler, we are currently investigating the
    1443             construction of a regular expression compiler that can implement bit-stream based
    1444             parallel regular-expression matching similar to that describe previously for parallel
    1445             parsing by bistream addition. This compiler works with the assumption that bitstream
    1446             regular-expression definitions are deterministic; no backtracking is permitted with the
    1447             parallel bit stream representation. In XML applications, this compiler is primarily
    1448             intended to enforce regular-expression constraints on string datatype specifications
    1449             found in XML schema. </para>
     819                <section>
     820                        <title>Multithreading with Pipeline Parallelism</title>         
     821<para>
     822As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
     823These are ``embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
     824However, \icXML{} is designed to organize processing into logical layers.   
     825In particular, layers within the \PS{} are designed to operate
     826over significant segments of input data before passing their outputs on for
     827subsequent processing.  This fits well into the general model of pipeline
     828parallelism, in which each thread is in charge of a single module or group
     829of modules.
     830</para>
     831<para>
     832The most straightforward division of work in \icXML{} is to separate
     833the \PS{} and the \MP{} into distinct logical layers into two separate stages.
     834The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application.
     835In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
     836content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
     837The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation,
     838and the provides parsed XML data to the application through the Xerces API. 
     839The shared data structure is implemented using a ring buffer,
     840where every entry contains an independent set of data streams.
     841In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
     842A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
     843In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
     844thus $T_2$ always waits for $T_1$ to write to the shared memory.
     845Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
     846and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
     847</para>
     848<para>
     849<!-- FIGURE
     850\begin{figure}
     851\subfigure[]{
     852\includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf}
     853\label{threads_timeline1}
     854}
     855\hfill
     856\subfigure[]{
     857\includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf}
     858\label{threads_timeline2}
     859}
     860\caption{Thread Balance in Two-Stage Pipelines}
    1450861
    1451       </section>
     862\end{figure}
     863-->
     864</para>
     865<para>
     866Overall, our design is intended to benefit a range of applications.
     867Conceptually, we consider two design points.
     868The first, the parsing performed by the \PS{} dominates at 67% of the overall cost,
     869with the cost of application processing (including the driver logic within the \MP{}) at 33%.   
     870The second is almost the opposite scenario, the cost of application processing dominates at 60%,
     871while the cost of XML parsing represents an overhead of 40%.
     872</para>
     873<para>
     874Our design is predicated on a goal of using the Parabix
     875framework to achieve a 50% to 100% improvement in the parsing engine itself.   
     876In a best case scenario,
     877a 100% improvement of the \PS{} for the design point in which
     878XML parsing dominates at 67% of the total application cost.
     879In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces
     880so that the total application cost reduces to 67% of the original. 
     881However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads
     882each performing about 33% of the original work.   
     883In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.
     884</para>
     885<para>
     886At the other extreme of our design range, we consider an application
     887in which core parsing cost is 40%.   Assuming the 2x speedup of
     888the \PS{} over the corresponding Xerces core, single-threaded
     889\icXML{} delivers a 25% speedup.   However, the most significant
     890aspect of our two-stage multi-threaded design then becomes the
     891ability to hide the entire latency of parsing within the serial time
     892required by the application.   In this case, we achieve
     893an overall speedup in processing time by 1.67x.
     894</para>
     895<para>
     896Although the structure of the \PS{} allows division of the work into
     897several pipeline stages and has been demonstrated to be effective
     898for four pipeline stages in a research prototype \cite{HPCA2012},
     899our analysis here suggests that the further pipelining of work within
     900the \PS{} is not worthwhile if the cost of application logic is little as
     90133% of the end-to-end cost using Xerces.  To achieve benefits of
     902further parallelization with multi-core technology, there would
     903need to be reductions in the cost of application logic that
     904could match reductions in core parsing cost.
     905</para>
     906                </section>
    1452907
    1453       <section>
    1454          <title>Unbounded Bit Stream Compilation</title>
     908                <section>
     909                        <title>Performance</title>             
     910<para>
     911We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications:
     912the Xerces C++ SAXCount sample application,
     913and a real world GML to SVG transformation application.
     914We investigated XML parser performance using an Intel Core i7 quad-core
     915(Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core),
     91632+32 kB (per core) L1 cache,
     917256 kB (per core) L2 cache,
     9188 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
     919</para>
     920<para>
     921We analyzed the execution profiles of each XML parser
     922using the performance counters found in the processor.
     923We chose several key hardware events that provide insight into the profile of each
     924application and indicate if the processor is doing useful work. 
     925The set of events included in our study are:
     926processor cycles, branch instructions, branch mispredictions,
     927and cache misses. The Performance Application Programming Interface
     928(PAPI) Version 5.5.0 \cite{papi} toolkit
     929was installed on the test system to facilitate the
     930collection of hardware performance monitoring
     931statistics. In addition, we used the Linux perf \cite{perf} utility
     932to collect per core hardware events.
     933</para>
     934                        <section>
     935                       <title>Xerces C++ SAXCount</title>
     936<para>
     937Xerces comes with sample applications that demonstrate salient features of the parser.
     938SAXCount is the simplest such application:
     939it counts the elements, attributes and characters of a given XML file using the (event based) SAX API
     940and prints out the totals.
     941</para>
     942<para>
     943<!-- TABLE
     944\begin{table}
     945\begin{center}
     946{
     947\footnotesize
     948\begin{tabular}{|l||l|l|l|l|l|}
     949\hline
     950File Name               & jaw.xml               & road.gml      & po.xml        & soap.xml \\ \hline   
     951File Type               & document              & data          & data          & data   \\ \hline     
     952File Size (kB)          & 7343                  & 11584         & 76450         & 2717 \\ \hline
     953Markup Item Count       & 74882                 & 280724        & 4634110       & 18004 \\ \hline
     954Markup Density          & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
     955\end{tabular}
     956}
     957\end{center}
     958\caption{XML Document Characteristics}
     959\label{XMLDocChars}
     960\end{table}
     961-->
     962</para>
     963<para>
     964Table \ref{XMLDocChars} shows the document characteristics of the XML input
     965files selected for the Xerces C++ SAXCount benchmark. The jaw.xml
     966represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence
     967required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented
     968XML documents and consist entirely of single byte encoded ASCII characters.
     969</para>
     970<para>
     971A key predictor of the overall parsing performance of an XML file is markup density\footnote{
     972  Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}.
     973This metric has substantial influence on the performance of traditional recursive descent XML parsers
     974because it directly corresponds to the number of state transitions that occur when parsing a document.
     975We use a mixture of document-oriented and
     976data-oriented XML files to analyze performance over a spectrum
     977of markup densities.
     978</para>
     979<para>
     980Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of
     981CPU cycles per byte for the SAXCount application.
     982The speedup for \icXML{} over Xerces is 1.3x to 1.8x.
     983With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x.
     984Xerces is substantially slowed by dense markup
     985but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques.
     986\icXMLp{} performs better as markup-density increases because the work performed by each stage is
     987well balanced in this application.
     988</para>
     989<para>
     990<!-- FIGURE
     991\begin{figure}
     992\includegraphics[width=0.5\textwidth]{plots/perf_SAX.pdf}
     993\caption{SAXCount Performance Comparison}
     994\label{perf_SAX}
     995\end{figure}
     996-->
     997</para>
     998            </section>
     999                        <section>
     1000                       <title>GML2SVG</title>
     1001                       <para></para>
     1002            </section>
     1003                </section>
    14551004
    1456          <para>The Catalog of XML Bit Streams presented earlier consist of a set of abstract,
    1457             unbounded bit streams, each in one-to-one correspondence with input bytes of a text
    1458             file. Determining how these bit streams are implemented using fixed-width SIMD
    1459             registers, and possibly processed in fixed-length buffers that represent some multiple
    1460             of the register width is a source of considerable programming complexity. The general
    1461             goal of our compilation strategy in this case is to allow operations to be programmed in
    1462             terms of unbounded bit streams and then automatically reduced to efficient low-level
    1463             code with the application of a systematic code generation strategy for handling block
    1464             and buffer boundary crossing. This is work currently in progress. </para>
     1005                <section>
     1006                        <title>Conclusion and Future Work</title>               
     1007<para>
     1008This paper is the first case study documenting the significant
     1009performance benefits that may be realized through the integration
     1010of parallel \bitstream{} technology into existing widely-used software libraries.
     1011In the case of the Xerces-C++ XML parser, the
     1012combined integration of SIMD and multicore parallelism was
     1013shown capable of dramatic producing dramatic increases in
     1014throughput and reductions in branch mispredictions and cache misses.
     1015The modified parser, going under the name \icXML{} is designed
     1016to provide the full functionality of the original Xerces library
     1017with complete compatibility of APIs.  Although substantial
     1018re-engineering was required to realize the
     1019performance potential of parallel technologies, this
     1020is an important case study demonstrating the general
     1021feasibility of these techniques.
     1022</para>
     1023<para>
     1024The further development of \icXML{} to move beyond 2-stage
     1025pipeline parallelism is ongoing, with realistic prospects for
     1026four reasonably balanced stages within the library.  For
     1027applications such as GML2SVG which are dominated by time
     1028spent on XML parsing, such a multistage pipelined parsing
     1029library should offer substantial benefits. 
     1030</para>
     1031<para>
     1032The example of XML parsing may be considered prototypical
     1033of finite-state machines applications which have sometimes
     1034been considered ``embarassingly sequential'' and so
     1035difficult to parallelize that ``nothing works.''  So the
     1036case study presented here should be considered an important
     1037data point in making the case that parallelization can
     1038indeed be helpful across a broad array of application types.
     1039</para>
     1040<para>
     1041To overcome the software engineering challenges in applying
     1042parallel \bitstream{} technology to existing software systems,
     1043it is clear that better library and tool support is needed.
     1044The techniques used in the implementation of \icXML{} and
     1045documented in this paper could well be generalized for
     1046applications in other contexts and automated through
     1047the creation of compiler technology specifically supporting
     1048parallel \bitstream{} programming.
     1049</para>
     1050                </section>
    14651051
    1466       </section>
    1467    </section>
    1468 
    1469    <section>
    1470       <title>Conclusion</title>
    1471       <para>Parallel bit stream technology offers the opportunity to dramatically speed up the core
    1472          XML processing components used to implement virtually any XML API. Character validation and
    1473          transcoding, whitespace processing, and parsing up to including the full validation of tag
    1474          syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the
    1475          positions of all element names, attribute names and attribute values can also be produced,
    1476          followed by fast bit scan operations to generate position and length values. Beyond bit
    1477          streams, byte-oriented SIMD processing of names and numerals can also accelerate
    1478          performance beyond sequential byte-at-a-time methods. </para>
    1479       <para>Advances in processor architecture are likely to further amplify the performance of
    1480          parallel bit stream technology over traditional byte-at-a-time processing over the next
    1481          decade. Improvements to SIMD register width, register complement and operation format can
    1482          all result in further gains. New SIMD instruction set features such as inductive doubling
    1483          support, parallel extract and deposit instructions, bit interleaving and scatter/gather
    1484          capabilities should also result in significant speed-ups. Leveraging the intraregister
    1485          parallelism of parallel bit stream technology within SIMD registers to take of intrachip
    1486          parallelism on multicore processors should accelerate processing further. </para>
    1487       <para>Technology transfer using a patent-based open-source business model is a further goal of
    1488          our work with a view to widespread deployment of parallel bit stream technology in XML
    1489          processing stacks implementing a variety of APIs. The feasibility of substantial
    1490          performance improvement in replacement of technology implementing existing APIs has been
    1491          demonstrated even in complex software architectures involving delivery of performance
    1492          benefits across the JNI boundary. We are seeking to accelerate these deployment efforts
    1493          both through the development of compiler technology to reliably apply these methods to a
    1494          variety of architectures as well as to identify interested collaborators using open-source
    1495          or commercial models. </para>
    1496    </section>
    1497 
     1052<!--     
    14981053   <section>
    14991054      <title>Acknowledgments</title>
    1500       <para>This work is supported in part by research grants and scholarships from the Natural
    1501          Sciences and Engineering Research Council of Canada, the Mathematics of Information
    1502          Technology and Complex Systems Network and the British Columbia Innovation Council. </para>
    1503       <para>We thank our colleague Dan Lin (Linda) for her work in high-performance symbol table
    1504          processing. </para>
     1055      <para></para>
    15051056   </section>
    1506 
     1057-->
    15071058   <bibliography>
    15081059      <title>Bibliography</title>
Note: See TracChangeset for help on using the changeset viewer.