source: docs/Balisage13/Bal2013came0601/Bal2013came0601.html @ 3038

Last change on this file since 3038 was 3038, checked in by ksherdy, 6 years ago

Moved over Bal 09 document to use as a template. Document validates agains 1-3.dtd.

File size: 85.8 KB
1<html lang="en">
3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
4<title>Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</title>
5<link rel="stylesheet" href="balisage-plain.css" type="text/css">
6<meta name="keywords" content=", , ">
9<div id="balisage-header"><h1 style="text-align: right; font-family: serif; margin:0.25em">
10<i>Balisage:</i> <small>The Markup Conference</small>
12<div lang="en" class="article">
13<div class="titlepage">
14<h2 class="article-title" id="idp346280">Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</h2>
15<div class="author">
16<h3 class="author">Rob Cameron</h3>
17<div class="affiliation">
18<p class="jobtitle">Professor of Computing Science</p>
19<p class="orgname">Simon Fraser University</p>
21<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
23<div class="author">
24<h3 class="author">Ken Herdy</h3>
25<div class="affiliation">
26<p class="jobtitle">Graduate Student, School of Computing Science</p>
27<p class="orgname">Simon Fraser University </p>
29<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
31<div class="author">
32<h3 class="author">Ehsan Amiri</h3>
33<div class="affiliation">
34<p class="jobtitle">Graduate Student, School of Computing Science</p>
35<p class="orgname">Simon Fraser University</p>
37<h5 class="author-email"><code class="email">&lt;<a class="email" href=""></a>&gt;</code></h5>
39<div class="legalnotice" id="idp498304"><p id="idp498432">Copyright © 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
40            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
41            Works 2.5 Canada License.</p></div>
42<div class="abstract">
43<p class="title"><b>Abstract</b></p>
44<p id="idp514976">By first transforming the octets (bytes) of XML texts into eight parallel bit
45            streams, the SIMD features of commodity processors can be exploited for parallel
46            processing of blocks of 128 input bytes at a time. Established transcoding and parsing
47            techniques are reviewed followed by new techniques including parsing with bitstream
48            addition. Further opportunities are discussed in light of expected advances in CPU
49            architecture and compiler technology. Implications for various APIs and information
50            models are presented as well opportunities for collaborative open-source
51         development.</p>
55<div class="toc">
56<p><b>Table of Contents</b></p>
58<dt><span class="section"><a href="#idp499912" class="toc">Introduction</a></span></dt>
59<dt><span class="section"><a href="#idp644272" class="toc">A Catalog of Parallel Bit Streams for XML</a></span></dt>
61<dt><span class="section"><a href="#idp644640" class="toc">Introduction</a></span></dt>
62<dt><span class="section"><a href="#idp646032" class="toc">Basis Bit Streams</a></span></dt>
63<dt><span class="section"><a href="#idp696728" class="toc">General Streams</a></span></dt>
65<dt><span class="section"><a href="#idp697328" class="toc">Deletion Mask Streams</a></span></dt>
66<dt><span class="section"><a href="#idp707448" class="toc">Error Flag Streams </a></span></dt>
68<dt><span class="section"><a href="#idp716704" class="toc">Lexical Item Streams</a></span></dt>
69<dt><span class="section"><a href="#idp755744" class="toc">UTF-8 Byte Classification, Scope and Validation Streams</a></span></dt>
71<dt><span class="section"><a href="#idp757104" class="toc">UTF-8 Byte Classification Streams</a></span></dt>
72<dt><span class="section"><a href="#idp758312" class="toc">UTF-8 Scope Streams</a></span></dt>
73<dt><span class="section"><a href="#idp778672" class="toc">UTF-8 Validation Streams</a></span></dt>
74<dt><span class="section"><a href="#idp781056" class="toc">XML Character Validation Streams</a></span></dt>
75<dt><span class="section"><a href="#idp782512" class="toc">UTF-8 to UTF-16 Transcoding</a></span></dt>
76<dt><span class="section"><a href="#idp785384" class="toc">UTF-8 Indexed UTF-16 Streams</a></span></dt>
78<dt><span class="section"><a href="#idp786760" class="toc">Control Character Streams</a></span></dt>
80<dt><span class="section"><a href="#idp787768" class="toc">XML Character Validation</a></span></dt>
81<dt><span class="section"><a href="#idp789136" class="toc">XML 1.0 End-of-line Handling</a></span></dt>
83<dt><span class="section"><a href="#idp797864" class="toc">Call Out Streams</a></span></dt>
85<dt><span class="section"><a href="#idp798776" class="toc">Comment, Processing Instruction and CDATA Section Call Out Streams</a></span></dt>
86<dt><span class="section"><a href="#idp807344" class="toc">Reference Call Out Streams</a></span></dt>
87<dt><span class="section"><a href="#idp809320" class="toc">Tag Call Out Streams</a></span></dt>
90<dt><span class="section"><a href="#idp822632" class="toc">SIMD Beyond Bitstreams: Names and Numbers</a></span></dt>
92<dt><span class="section"><a href="#idp823448" class="toc">Name Lookup</a></span></dt>
93<dt><span class="section"><a href="#idp830280" class="toc">Numeric Processing</a></span></dt>
95<dt><span class="section"><a href="#idp833080" class="toc">APIs and Parallel Bit Streams</a></span></dt>
97<dt><span class="section"><a href="#idp833440" class="toc">The ILAX Streaming API</a></span></dt>
98<dt><span class="section"><a href="#idp836736" class="toc">Efficient XML in Java Using Array Set Models</a></span></dt>
99<dd><dl><dt><span class="section"><a href="#idp845360" class="toc">Saxon-B TinyTree Example</a></span></dt></dl></dd>
101<dt><span class="section"><a href="#idp853000" class="toc">Compiler Technology</a></span></dt>
103<dt><span class="section"><a href="#idp855592" class="toc">Character Class Compiler</a></span></dt>
104<dt><span class="section"><a href="#idp857544" class="toc">Regular Expression Compilation</a></span></dt>
105<dt><span class="section"><a href="#idp858856" class="toc">Unbounded Bit Stream Compilation</a></span></dt>
107<dt><span class="section"><a href="#idp860384" class="toc">Conclusion</a></span></dt>
108<dt><span class="section"><a href="#idp864448" class="toc">Acknowledgments</a></span></dt>
111<div class="section" id="idp499912">
112<h2 class="title" style="clear: both">Introduction</h2>
113<p id="idp500256"> While particular XML applications may benefit from special-purpose hardware such as XML
114         chips [<a class="xref" href="#XMLChip09" id="idp500496">Leventhal and Lemoine 2009</a>] or appliances [<a class="xref" href="#Datapower09" id="idp506976">Salz, Achilles and Maze 2009</a>], the bulk
115         of the world's XML processing workload will continue to be handled by XML software stacks
116         on commodity processors. Exploiting the SIMD capabilities of such processors such as the
117         SSE instructions of x86 chips, parallel bit stream technology offers the potential of
118         dramatic improvement over byte-at-a-time processing for a variety of XML processing tasks.
119         Character set issues such as Unicode validation and transcoding [<a class="xref" href="#PPoPP08" id="idp507840">Cameron 2007</a>], normalization of line breaks and white space and XML character validation can be
120         handled fully in parallel using this representation. Lexical item streams, such as the bit
121         stream marking the positions of opening angle brackets, can also be formed in parallel.
122         Bit-scan instructions of commodity processors may then be used on lexical item streams to
123         implement rapid single-instruction scanning across variable-length multi-byte text blocks
124         as in the Parabix XML parser [<a class="xref" href="#CASCON08" id="idp508744">Cameron, Herdy and Lin 2008</a>]. Overall, these techniques may be
125         combined to yield end-to-end performance that may be 1.5X to 15X faster than alternatives
126            [<a class="xref" href="#SVGOpen08" id="idp509256">Herdy, Burggraf and Cameron 2008</a>].</p>
127<p id="idp509720">Continued research in parallel bit stream techniques as well as more conventional
128         application of SIMD techniques in XML processing offers further prospects for improvement
129         of core XML components as well as for tackling performance-critical tasks further up the
130         stack. A newly prototyped technique for parallel tag parsing using bitstream addition is
131         expected to improve parsing performance even beyond that achieved using sequential bit
132         scans. Several techniques for improved symbol table performance are being investigated,
133         including parallel hash value calculation and length-based sorting using the cheap length
134         determination afforded by bit scans. To deliver the benefits of parallel bit stream
135         technology to the Java world, we are developing Array Set Model (ASM) representations of
136         XML Infoset and other XML information models for efficient transmission across the JNI
137         boundary.</p>
138<p id="idp510920">Amplifying these software advances, continuing hardware advances in commodity processors
139         increase the relative advantage of parallel bit stream techniques over traditional
140         byte-at-a-time processors. For example, the Intel Core architecture improved SSE processing
141         to give superscalar execution of bitwise logic operations (3 instructions per cycle vs. 1
142         in Pentium 4). Upcoming 256-bit AVX technology extends the register set and replaces
143         destructive two-operand instructions with a nondestructive three-operand form. General
144         purpose programming on graphic processing units (GPGPU) such as the upcoming 512-bit
145         Larrabee processor may also be useful for XML applications using parallel bit streams. New
146         instruction set architectures may also offer dramatic improvements in core algorithms.
147         Using the relatively simple extensions to support the principle of inductive doubling, a 3X
148         improvement in several core parallel bit stream algorithms may be achieved [<a class="xref" href="#ASPLOS09" id="idp638256">Cameron and Lin 2009</a>]. Other possibilities include direct implementation of parallel
149         extract and parallel deposit (pex/pdep) instructions [<a class="xref" href="#Pex06" id="idp638640">Hilewitz and Lee 2006</a>], and
150         bit-level interleave operations as in Larrabee, each of which would have important
151         application to parallel bit stream processing.</p>
152<p id="idp639128">Further prospects for XML performance improvement arise from leveraging the
153         intraregister parallelism of parallel bit stream technology to exploit the interchip
154         parallelism of multicore computing. Parallel bit stream techniques can support multicore
155         parallelism in both data partitioning and task partitioning models. For example, the
156         datasection partitioning approach of Wu, Zhang, Yu and Li may be used to partition blocks
157         for speculative parallel parsing on separate cores followed by a postprocessing step to
158         join partial S-trees [<a class="xref" href="#Wu08" id="idp639856">Wu et al. 2008</a>].</p>
159<p id="idp640296">In our view, the established and expected performance advantages of parallel bit stream
160         technology over traditional byte-at-a-time processing are so compelling that parallel bit
161         stream technology should ultimately form the foundation of every high-performance XML
162         software stack. We envision a common high-performance XML kernel that may be customized to
163         a variety of processor architectures and that supports a wide range of existing and new XML
164         APIs. Widespread deployment of this technology should greatly benefit the XML community in
165         addressing both the deserved and undeserved criticism of XML on performance grounds. A
166         further benefit of improved performance is a substantial greening of XML technologies.</p>
167<p id="idp641272">To complement our research program investigating fundamental algorithms and issues in
168         high-performance XML processing, our work also involves development of open source software
169         implementing these algorithms, with a goal of full conformance to relevant specifications.
170         From the research perspective, this approach is valuable in ensuring that the full
171         complexity of required XML processing is addressed in reporting and assessing processing
172         results. However, our goal is also to use this open source software as a basis of
173         technology transfer. A Simon Fraser University spin-off company, called International
174         Characters, Inc., has been created to commercialize the results of this work using a
175         patent-based open source model.</p>
176<p id="idp642272">To date, we have not yet been successful in establishing a broader community of
177         participation with our open source code base. Within open-source communities, there is
178         often a general antipathy towards software patents; this may limit engagement with our
179         technology, even though it has been dedicated for free use in open source. </p>
180<p id="idp642832">A further complication is the inherent difficulty of SIMD programming in general, and
181         parallel bit stream programming in particular. Considerable work is required with each new
182         algorithmic technique being investigated as well as in retargetting our techniques for each
183         new development in SIMD and multicore processor technologies. To address these concerns, we
184         have increasingly shifted the emphasis of our research program towards compiler technology
185         capable of generating parallel bit stream code from higher-level specifications.</p>
187<div class="section" id="idp644272">
188<h2 class="title" style="clear: both">A Catalog of Parallel Bit Streams for XML</h2>
189<div class="section" id="idp644640">
190<h3 class="title" style="clear: both">Introduction</h3>
191<p id="idp644984">In this section, we introduce the fundamental concepts of parallel bit stream
192            technology and present a comprehensive catalog of parallel bit streams for use in XML
193            processing. In presenting this catalog, the focus is on the specification of the bit
194            streams as data streams in one-to-one correspondence with the character code units of an
195            input XML stream. The goal is to define these bit streams in the abstract without
196            initially considering memory layouts, register widths or other issues related to
197            particular target architectures. In cataloging these techniques, we also hope to convey
198            a sense of the breadth of applications of parallel bit stream technology to XML
199            processing tasks. </p>
201<div class="section" id="idp646032">
202<h3 class="title" style="clear: both">Basis Bit Streams</h3>
203<p id="idp646376">Given a byte-oriented text stream represented in UTF-8, for example, we define a
204            transform representation of this text consisting of a set of eight parallel bit streams
205            for the individual bits of each byte. Thus, the <code class="code">Bit0</code> stream is the stream
206            of bits consisting of bit 0 of each byte in the input byte stream, <code class="code">Bit1</code> is
207            the bit stream consisting of bit 1 of each byte in the input stream and so on. The set
208            of streams <code class="code">Bit0</code> through <code class="code">Bit7</code> are known as the <span class="ital">basis
209               streams</span> of the parallel bit stream representation. The following table
210            shows an example XML character stream together with its representation as a set of 8
211            basis streams. <div class="table-wrapper" id="idp648288">
212<p class="title">Table I</p>
213<div class="caption"><p id="idp451472">XML Character Stream Transposition.</p></div>
214<table class="table">
215<colgroup span="1">
216<col align="left" valign="top" span="1">
217<col align="left" valign="top" span="1">
218<col align="left" valign="top" span="1">
219<col align="left" valign="top" span="1">
220<col align="left" valign="top" span="1">
221<col align="left" valign="top" span="1">
224<tr valign="top">
225<td>Input Data</td>
227                        <code class="code">&lt;</code>
228                     </td>
230                        <code class="code">t</code>
231                     </td>
233                        <code class="code">a</code>
234                     </td>
236                        <code class="code">g</code>
237                     </td>
239                        <code class="code">/</code>
240                     </td>
242                        <code class="code">&gt;</code>
243                     </td>
245<tr valign="top">
248                        <code class="code">00111100</code>
249                     </td>
251                        <code class="code">01110100</code>
252                     </td>
254                        <code class="code">01100001</code>
255                     </td>
257                        <code class="code">01100111</code>
258                     </td>
260                        <code class="code">00101111</code>
261                     </td>
263                        <code class="code">00111110</code>
264                     </td>
266<tr valign="top">
269                        <code class="code">0</code>
270                     </td>
272                        <code class="code">0</code>
273                     </td>
275                        <code class="code">0</code>
276                     </td>
278                        <code class="code">0</code>
279                     </td>
281                        <code class="code">0</code>
282                     </td>
284                        <code class="code">0</code>
285                     </td>
287<tr valign="top">
290                        <code class="code">0</code>
291                     </td>
293                        <code class="code">1</code>
294                     </td>
296                        <code class="code">1</code>
297                     </td>
299                        <code class="code">1</code>
300                     </td>
302                        <code class="code">0</code>
303                     </td>
305                        <code class="code">0</code>
306                     </td>
308<tr valign="top">
311                        <code class="code">1</code>
312                     </td>
314                        <code class="code">1</code>
315                     </td>
317                        <code class="code">1</code>
318                     </td>
320                        <code class="code">1</code>
321                     </td>
323                        <code class="code">1</code>
324                     </td>
326                        <code class="code">1</code>
327                     </td>
329<tr valign="top">
332                        <code class="code">1</code>
333                     </td>
335                        <code class="code">1</code>
336                     </td>
338                        <code class="code">0</code>
339                     </td>
341                        <code class="code">0</code>
342                     </td>
344                        <code class="code">0</code>
345                     </td>
347                        <code class="code">1</code>
348                     </td>
350<tr valign="top">
353                        <code class="code">1</code>
354                     </td>
356                        <code class="code">0</code>
357                     </td>
359                        <code class="code">0</code>
360                     </td>
362                        <code class="code">0</code>
363                     </td>
365                        <code class="code">1</code>
366                     </td>
368                        <code class="code">1</code>
369                     </td>
371<tr valign="top">
374                        <code class="code">1</code>
375                     </td>
377                        <code class="code">1</code>
378                     </td>
380                        <code class="code">0</code>
381                     </td>
383                        <code class="code">1</code>
384                     </td>
386                        <code class="code">1</code>
387                     </td>
389                        <code class="code">1</code>
390                     </td>
392<tr valign="top">
395                        <code class="code">0</code>
396                     </td>
398                        <code class="code">0</code>
399                     </td>
401                        <code class="code">0</code>
402                     </td>
404                        <code class="code">1</code>
405                     </td>
407                        <code class="code">1</code>
408                     </td>
410                        <code class="code">1</code>
411                     </td>
413<tr valign="top">
416                        <code class="code">0</code>
417                     </td>
419                        <code class="code">0</code>
420                     </td>
422                        <code class="code">1</code>
423                     </td>
425                        <code class="code">1</code>
426                     </td>
428                        <code class="code">1</code>
429                     </td>
431                        <code class="code">0</code>
432                     </td>
437         </p>
438<p id="idp694048"> Depending on the features of a particular processor architecture, there are a number
439            of algorithms for transposition to parallel bit stream form. Several of these algorithms
440            employ a three-stage structure. In the first stage, the input byte stream is divided
441            into a pair of half-length streams consisting of four bits for each byte, for example,
442            one stream for the high nybble of each byte and another for the low nybble of each byte.
443            In the second stage, these streams of four bits per byte are each divided into streams
444            consisting of two bits per original byte, for example streams for the
445            <code class="code">Bit0/Bit1</code>, <code class="code">Bit2/Bit3</code>, <code class="code">Bit4/Bit5</code>, and
446               <code class="code">Bit6/Bit7</code> pairs. In the final stage, the streams are further subdivided
447            in the individual bit streams. </p>
448<p id="idp695744"> Using SIMD capabilities, this process is quite efficient, with an amortized cost of
449            1.1 CPU cycles per input byte on Intel Core 2 with SSE, or 0.6 CPU cycles per input byte
450            on Power PC G4 with Altivec. With future advances in processor technology, this
451            transposition overhead is expected to reduce, possibly taking advantage of upcoming
452            parallel extract (pex) instructions on Intel technology. In the ideal, only 24
453            instructions are needed to transform a block of 128 input bytes using 128-bit SSE
454            registers using the inductive doubling instruction set architecture, representing an
455            overhead of less than 0.2 instructions per input byte. </p>
457<div class="section" id="idp696728">
458<h3 class="title" style="clear: both">General Streams</h3>
459<p id="idp697048">This section describes bit streams which support basic processing operations.</p>
460<div class="section" id="idp697328">
461<h4 class="title" style="clear: both">Deletion Mask Streams</h4>
462<p id="idp697648">DelMask (deletion mask) streams marks character code unit positions for deletion.
463               Since the deletion operation is dependency free across many stages of XML processing,
464               it is possible to simply mark and record deletion positions as deletion mask streams for future processing. A single
465               invocation of a SIMD based parallel deletion algorithm can then perform the deletion of
466               positions accumulated across a number of stages through a bitwise ORing of deletion
467               masks. For example, deletion arises in the replacement of predefined entities with a
468               single character, such as in the replacement of the &amp;amp; entity, with the
469               &amp; character. Deletion also arises in XML
470               end-of-line handling, and CDATA section delimeter processing. Several algorithms to
471               delete bits at positions marked by DelMask are possible [<a class="xref" href="#u8u16" id="idp699152">Cameron 2008</a>]. </p>
472<p id="idp699464">The following table provides an example of generating a DelMask in the context of
473               bit stream based parsing of well-formed character references and predefined entities.
474               The result is the generation of a DelMask stream. <div class="table-wrapper" id="idp699848">
475<p class="title">Table II</p>
476<div class="caption"><p id="idp700104">DelMask Stream Generation</p></div>
477<table class="table">
478<colgroup span="1">
479<col align="left" valign="top" span="1">
480<col align="left" valign="top" span="1">
483<tr valign="top">
484<td>Input Data</td>
486                           <code class="code">&amp;gt; &amp;#13; &amp;#x0a;</code>
487                        </td>
489<tr valign="top">
492                           <code class="code">_11______________</code>
493                        </td>
495<tr valign="top">
498                           <code class="code">_______11________</code>
499                        </td>
501<tr valign="top">
504                           <code class="code">______________11_</code>
505                        </td>
507<tr valign="top">
510                           <code class="code">111__1111__11111_</code>
511                        </td>
513<tr valign="top">
516                           <code class="code">_________________</code>
517                        </td>
522            </p>
524<div class="section" id="idp707448">
525<h4 class="title" style="clear: both">Error Flag Streams </h4>
526<p id="idp707768">Error flag streams indicates the character code unit positions of syntactical
527               errors. XML processing examples which benefit from the marking of error positions
528               include UTF-8 character sequence validation and XML parsing [<a class="xref" href="#u8u16" id="idp708152">Cameron 2008</a>].</p>
529<p id="idp708464">The following table provides an example of using bit streams to parse character
530               references and predefined entities which fail to meet the XML 1.0 well-formedness
531               constraints. The result is the generation of an error flag stream that marks the
532               positions of mal-formed decimal and hexical character references respectively. <div class="table-wrapper" id="idp708968">
533<p class="title">Table III</p>
534<div class="caption"><p id="idp709224">Error Flag Stream Generation</p></div>
535<table class="table">
536<colgroup span="1">
537<col align="left" valign="top" span="1">
538<col align="left" valign="top" span="1">
541<tr valign="top">
542<td>Input Data</td>
544                           <code class="code">&amp;gt; &amp;#, &amp;#x; </code>
545                        </td>
547<tr valign="top">
550                           <code class="code">_11___________</code>
551                        </td>
553<tr valign="top">
556                           <code class="code">______________</code>
557                        </td>
559<tr valign="top">
562                           <code class="code">______________</code>
563                        </td>
565<tr valign="top">
568                           <code class="code">111__11__111__</code>
569                        </td>
571<tr valign="top">
574                           <code class="code">_______1____1_</code>
575                        </td>
580            </p>
583<div class="section" id="idp716704">
584<h3 class="title" style="clear: both">Lexical Item Streams</h3>
585<p id="idp717056">Lexical item streams differ from traditional streams of tokens in that they are bit
586            streams that mark the positions of tokens, whitespace or delimiters. Additional bit
587            streams, such as the reference streams and callout streams, are subsequently constructed
588            based on the information held within the set of lexical items streams. Differentiation
589            between the actual tokens that may occur at a particular point (e.g., the different XML
590            tokens that begin “&lt;”) may be performed using multicharacter recognizers on the
591            bytestream representation [<a class="xref" href="#CASCON08" id="idp718216">Cameron, Herdy and Lin 2008</a>].</p>
592<p id="idp718568">A key role of lexical item streams in XML parsing is to facilitate fast scanning
593            operations. For example, a left angle bracket lexical item stream may be formed to
594            identify those character code unit positions at which a “&lt;” character occurs.
595            Hardware register bit scan operations may then be used by the XML parser on the left
596            angle bracket stream to efficiently identify the position of the next “&lt;”. Based
597            on the capabilities of current commodity processors, a single register bit scan
598            operation may effectively scan up to 64 byte positions with a single instruction.</p>
599<p id="idp720152">Overall, the construction of the full set of lexical item stream computations
600            requires approximately 1.0 CPU cycles per byte when implemented for 128 positions at a
601            time using 128-bit SSE registers on Intel Core2 processors [<a class="xref" href="#CASCON08" id="idp720536">Cameron, Herdy and Lin 2008</a>].
602            The following table defines the core lexical item streams defined by the Parabix XML
603            parser.</p>
604<p id="idp721048">
605            <div class="table-wrapper" id="idp721176">
606<p class="title">Table IV</p>
607<div class="caption"><p id="idp721432">Lexical item stream descriptions.</p></div>
608<table class="table"><tbody>
610<td align="left"> LAngle </td>
611<td align="left"> Marks the position of any left angle bracket character.</td>
614<td align="left"> RAngle </td>
615<td align="left"> Marks the position of any right angle bracket character.</td>
618<td align="left"> LBracket </td>
619<td align="left"> Marks the position of any left square bracker character.</td>
622<td align="left"> RBracket </td>
623<td align="left"> Marks the position of any right square bracket
624                     character.</td>
627<td align="left"> Exclam </td>
628<td align="left"> Marks the position of any exclamation mark character.</td>
631<td align="left"> QMark </td>
632<td align="left"> Marks the position of any question mark character.</td>
635<td align="left"> Hyphen </td>
636<td align="left"> Marks the position of any hyphen character.</td>
639<td align="left"> Equals </td>
640<td align="left"> Marks the position of any equal sign character.</td>
643<td align="left"> SQuote </td>
644<td align="left"> Marks the position of any single quote character.</td>
647<td align="left"> DQuote </td>
648<td align="left"> Marks the position of any double quote character.</td>
651<td align="left"> Slash </td>
652<td align="left"> Marks the position of any forward slash character</td>
655<td align="left"> NameScan </td>
656<td align="left"> Marks the position of any XML name character.</td>
659<td align="left"> WS </td>
660<td align="left"> Marks the position of any XML 1.0 whitespace character.</td>
663<td align="left"> PI_start </td>
664<td align="left"> Marks the position of the start of any processing instruction
665                        at the '?' character position.</td>
668<td align="left"> PI_end </td>
669<td align="left"> Marks the position of any end of any processing instruction
670                        at the '&gt;' character position.</td>
673<td align="left"> CtCD_start </td>
674<td align="left"> Marks the position of the start of any comment or CDATA
675                        section at the '!' character position.</td>
678<td align="left"> EndTag_start </td>
679<td align="left"> Marks the position of any end tag at the '/' character
680                        position.</td>
683<td align="left"> CD_end </td>
684<td align="left"> Marks the position of the end of any CDATA section at the '&gt;'
685                        character position. </td>
688<td align="left"> DoubleHyphen </td>
689<td align="left"> Marks the position of any double hyphen character.</td>
692<td align="left"> RefStart </td>
693<td align="left"> Marks the position of any ampersand character.</td>
696<td align="left"> Hash </td>
697<td align="left"> Marks the position of any hash character.</td>
700<td align="left"> x </td>
701<td align="left"> Marks the position of any 'x' character.</td>
704<td align="left"> Digit </td>
705<td align="left"> Marks the position of any digit.</td>
708<td align="left"> Hex </td>
709<td align="left"> Marks the position of any hexidecimal character.</td>
712<td align="left"> Semicolon </td>
713<td align="left"> Marks the position of any semicolon character.</td>
717         </p>
718<p id="idp745408"> The following illustrates a number of the lexical item streams. </p>
719<p id="idp745672">
720            <div class="table-wrapper" id="idp745800">
721<p class="title">Table V</p>
722<div class="caption"><p id="idp746056">Lexical Item Streams</p></div>
723<table class="table">
724<colgroup span="1">
725<col align="left" valign="top" span="1">
726<col align="left" valign="top" span="1">
729<tr valign="top">
730<td>Input Data</td>
732                        <code class="code">&lt;tag&gt;&lt;tag&gt; text &amp;lt;
733                           &amp;#x3e; &lt;/tag&gt;&lt;/tag&gt;</code>
734                     </td>
736<tr valign="top">
739                        <code class="code">1____1______________________1_____1_____</code>
740                     </td>
742<tr valign="top">
745                        <code class="code">____1____1_______________________1_____1</code>
746                     </td>
748<tr valign="top">
751                        <code class="code">__________1____1____1______1____________</code>
752                     </td>
754<tr valign="top">
757                        <code class="code">________________1____1__________________</code>
758                     </td>
760<tr valign="top">
763                        <code class="code">__1____1____1___________11_____1_____1__</code>
764                     </td>
766<tr valign="top">
769                        <code class="code">___________________1______1_____________</code>
770                     </td>
772<tr valign="top">
775                        <code class="code">_____________________________1_____1____</code>
776                     </td>
781         </p>
783<div class="section" id="idp755744">
784<h3 class="title" style="clear: both">UTF-8 Byte Classification, Scope and Validation Streams</h3>
785<p id="idp756128"> An XML parser must accept the UTF-8 encoding of Unicode [<a class="xref" href="#XML10" id="idp756320">XML 1.0</a>].
786            It is a fatal error if an XML document determined to be in UTF-8 contains byte sequences
787            that are not legal in that encoding. UTF-8 byte classification, scope, XML character
788            validation and error flag bit streams are defined to validate UTF-8 byte sequences and
789            support transcoding to UTF-16.</p>
790<div class="section" id="idp757104">
791<h4 class="title" style="clear: both">UTF-8 Byte Classification Streams</h4>
792<p id="idp757464">UTF-8 byte classification bit streams classify UTF-8 bytes based on their role in
793               forming single and multibyte sequences. The u8Prefix and u8Suffix bit streams
794               identify bytes that represent, respectively, prefix or suffix bytes of multibyte
795               UTF-8 sequences. The u8UniByte bit stream identifies those bytes that may be
796               considered single-byte sequences. The u8Prefix2, u8Prefix3, and u8Prefix4 refine the
797               u8Prefix respectively indicating prefixes of two, three or four byte
798            sequences respectively.</p>
800<div class="section" id="idp758312">
801<h4 class="title" style="clear: both">UTF-8 Scope Streams</h4>
802<p id="idp758656"> Scope streams represent expectations established by UTF-8 prefix bytes. For
803               example, the u8Scope22 bit stream represents the positions at which the second byte of a
804               two-byte sequence is expected based on the occurrence of a two-byte prefix in the
805               immediately preceding position. The u8scope32, u8Scope33, u8Scope42, u8scope43, and
806               u8Scope44 complete the set of UTF-8 scope streams.</p>
807<p id="idp477976"> The following example demonstrates the UTF-8 character encoding validation
808               process using parallel bit stream techniques. The result of this validation process
809               is an error flag stream identifying the positions at which errors occur.</p>
810<p id="idp478440">
811               <div class="table-wrapper" id="idp478568">
812<p class="title">Table VI</p>
813<div class="caption"><p id="idp478824">UTF-8 Scope Streams</p></div>
814<table class="table">
815<colgroup span="1">
816<col align="left" valign="top" span="1">
817<col align="left" valign="top" span="1">
820<tr valign="top">
821<td>Input Data</td>
822<td><code class="code">A Text in Farsi: ى ك  م ت ن  ف ا ر س ى</code></td>
824<tr valign="top">
825<td>High Nybbles</td>
827                           <code class="code">42567726624677632D8DBDBDAD82D8DAD82D8D8</code>
828                        </td>
830<tr valign="top">
831<td>Low Nybbles</td>
833                           <code class="code">10458409E061239A099838187910968A9509399</code>
834                        </td>
836<tr valign="top">
839                           <code class="code">11111111111111111__________1______1____</code>
841                        </td>
843<tr valign="top">
846                           <code class="code">_________________1_1_1_1_1__1_1_1__1_1_</code>
847                        </td>
849<tr valign="top">
852                           <code class="code">__________________1_1_1_1_1__1_1_1__1_1</code>
853                        </td>
855<tr valign="top">
858                           <code class="code">_________________1_1_1_1_1__1_1_1__1_1_</code>
859                        </td>
861<tr valign="top">
864                           <code class="code">__________________1_1_1_1_1__1_1_1__1_1</code>
866                        </td>
868<tr valign="top">
871                           <code class="code">_______________________________________</code>
872                        </td>
879            </p>
881<div class="section" id="idp778672">
882<h4 class="title" style="clear: both">UTF-8 Validation Streams</h4>
883<p id="idp779024"> Proper formation of UTF-8 byte sequences requires that the correct number of
884               suffix bytes always follow a UTF-8 prefix byte, and that certain illegal byte
885               combinations are ruled out. For example, sequences beginning with the prefix bytes
886               0xF5 through 0xFF are illegal as they would represent code point values above 10FFFF.
887               In addition, there are constraints on the first suffix byte following certain special
888               prefixes, namely that a suffix following the prefix 0xE0 must fall in the range
889               0xA0–0xBF, a suffix following the prefix 0xED must fall in the range 0x80–0x9F, a
890               suffix following the prefix 0xF0 must fall in the range 0x90–0xBF and a suffix
891               following the prefix 0xF4 must fall in the range 0x80–0x8F. The task of ensuring that
892               each of these constraints hold is known as UTF-8 validation. The bit streams xE0,
893               xED, xF0, xF4, xA0_xBF, x80_x9F, x90_xBF, and x80_x8F are constructed to flag the
894               aforementioned UTF-8 validation errors. The result of UTF-8 validation is a UTF-8
895               error flag bit stream contructed as the ORing of a series of UTF-8 validation tests.
896            </p>
898<div class="section" id="idp781056">
899<h4 class="title" style="clear: both">XML Character Validation Streams</h4>
900<p id="idp781416">The UTF-8 character sequences <span class="ital">0xEF 0xBF 0xBF</span> and
901                  <span class="ital">0xEF 0xBF 0xBE</span> correspond to the Unicode code points 0xFFFE
902               and 0xFFFF respectively. In XML 1.0, 0xFFFE and 0xFFFF represent characters outside
903               the legal XML character ranges. As such, bit streams which mark 0xEF, 0xBF, and 0xBE
904               character are constructed to flag illegal UTF-8 character sequences. </p>
906<div class="section" id="idp782512">
907<h4 class="title" style="clear: both">UTF-8 to UTF-16 Transcoding</h4>
908<p id="idp782864">UTF-8 is often preferred for storage and data exchange, it is suitable for
909               processing, but it is significantly more complex to process than UTF-16 [<a class="xref" href="#Unicode" id="idp783160">Unicode</a>]. As such, XML documents are typically encoded in UTF-8 for
910               serialization and transport, and subsequently transcoded to UTF-16 for processing
911               with programming languages such as Java and C#. Following the parallel bit stream
912               methods developed for the u8u16 transcoder, a high-performance standalone UTF-8 to
913               UTF-16 transcoder [<a class="xref" href="#u8u16" id="idp783920">Cameron 2008</a>], transcoding to UTF-16 may be achieved by
914               computing a series of 16 bit streams. One stream for each of the individual bits of a
915               UTF-16 code unit. </p>
916<p id="idp784488">The bit streams for UTF-16 are conveniently divided into groups: the eight streams
917               u16Hi0, u16Hi1, ..., u16Hi7 for the high byte of each UTF-16 code unit and the eight
918               streams u16Lo1, ..., u16Lo7 for the low byte. Upon conversion of the parallel bit
919               stream data back to byte streams, eight sequential byte streams U16h0, U16h1, ...,
920               U16Hi7 are used for the high byte of each UTF-16 code unit, while U16Lo0, U16Lo1,...,
921               U16Lo7 are used for the corresponding low byte. Interleaving these streams then
922               produces the full UTF-16 doublebyte stream.</p>
924<div class="section" id="idp785384">
925<h4 class="title" style="clear: both">UTF-8 Indexed UTF-16 Streams</h4>
926<p id="idp785744">UTF-16 bit streams are initially defined in UTF-8 indexed form. That is, with sets
927               of bits in one-to-one correspondence with UTF-8 bytes. However, only one set of
928               UTF-16 bits is required for encoding two or three-byte UTF-8 sequences and only two
929               sets are required for surrogate pairs corresponding to four-byte UTF-8 sequences. The
930               u8LastByte (u8UniByte , u8Scope22 , u8Scope33 , and u8Scope44 ) and u8Scope42 streams
931               mark the positions at which the correct UTF-16 bits are computed. The bit sets at
932               other positions must be deleted to compress the streams to the UTF-16 indexed form.
933            </p>
936<div class="section" id="idp786760">
937<h3 class="title" style="clear: both">Control Character Streams</h3>
938<p id="idp787112">The control character bit streams marks ASCII control characters in the range
939            0x00-0x1F. Additional control character bit streams mark the tab, carriage return, line
940            feed, and space character. In addition, a bit stream to mark carriage return line
941            combinations is also constructed. Presently, control character bit streams support the
942            operations of XML 1.0 character validation and XML end-of-line handling.</p>
943<div class="section" id="idp787768">
944<h4 class="title" style="clear: both">XML Character Validation</h4>
945<p id="idp788120">Legal characters in XML are the tab, carriage return, and line feed characters,
946               together with all Unicode characters and excluding the surrogate blocks, as well as hexadecimal OxFFFE and
947               OxFFFF [<a class="xref" href="#XML10" id="idp788480">XML 1.0</a>]. The x00_x1F bit stream is constructed and used in
948               combination with the additional control character bit streams to flags the positions
949               of illegal control characters.</p>
951<div class="section" id="idp789136">
952<h4 class="title" style="clear: both">XML 1.0 End-of-line Handling</h4>
953<p id="idp789496">In XML 1.0 the two-character sequence CR LF (carriage return, line feed) as well as
954               any CR character not followed by a LF character must be converted to a single LF
955               character [<a class="xref" href="#XML10" id="idp789840">XML 1.0</a>].</p>
956<p id="idp790224">By defining carriage return, line feed, and carriage return line feed bit streams,
957               dentoted CR, LF and CRLF respectively, end-of-line normalization processing can be
958               performed in parallel using only a small number of logical and shift operations.</p>
959<p id="idp790704"></p>
960<p id="idp790832">The following example demonstrates the generation of the CRLF deletion mask. In
961               this example, the position of all CR characters followed by LF characters are marked
962               for deletion. Isolated carriage returns are then replaced with LF characters.
963               Completion of this process satisfies the XML 1.0 end-of-line handling requirements.
964               For clarity, this example encodes input data carriage returns as
965               <span class="ital">C</span> characters, whereas line feed characters are shown as
966                  <span class="ital">L</span> characters.</p>
967<p id="idp792008">
968               <div class="table-wrapper" id="idp792136">
969<p class="title">Table VII</p>
970<div class="caption"><p id="idp792392">XML 1.0 End-of-line Handling</p></div>
971<table class="table">
972<colgroup span="1"><col align="left" valign="top" span="1"></colgroup>
974<tr valign="top">
975<td>Input Data</td>
977                           <code class="code">first line C second line CL third line L one more C nothing
978                           left</code>
979                        </td>
981<tr valign="top">
984                           <code class="code">-----------1-------------1------------------------1-------------</code>
985                        </td>
987<tr valign="top">
990                           <code class="code">--------------------------1------------1------------------------</code>
991                        </td>
993<tr valign="top">
996                           <code class="code">--------------------------1-------------------------------------</code>
997                        </td>
1003            </p>
1006<div class="section" id="idp797864">
1007<h3 class="title" style="clear: both">Call Out Streams</h3>
1008<p id="idp798208"> Call out bit streams mark the extents of XML markup structures such as comments,
1009            processing instruction and CDATA sections as well as physical structures such as character and
1010            entity references and general references.  Call out streams are also formed for logical markup structures such
1011            start tags, end tags and empty element tags. </p>
1012<div class="section" id="idp798776">
1013<h4 class="title" style="clear: both">Comment, Processing Instruction and CDATA Section Call Out Streams</h4>
1014<p id="idp799168">Comments, processing instructions and CDATA sections call out streams, Ct_Span,
1015               PI_Span and CD_Span respectively, define sections of an XML document which
1016               contain markup that is not interpreted by an XML processor. As such, the union of
1017               Ct_Span, PI_Span and CD_Span streams defines the regions of non-interpreteable markup.
1018               The stream formed by this union is termed the CtCDPI_Mask.</p>
1019<p id="idp799816">The following tables provides an example of constructing the CtCDPI_Mask. </p>
1020<div class="table-wrapper" id="idp800088">
1021<p class="title">Table VIII</p>
1022<div class="caption"><p id="idp800344">CtCDPI Mask Generation</p></div>
1023<table class="table">
1024<colgroup span="1">
1025<col align="left" valign="top" span="1">
1026<col align="left" valign="top" span="1">
1029<tr valign="top">
1030<td>Input Data</td>
1031<td><code class="code">&lt;?php?&gt; &lt;!-- example --&gt; &lt;![CDATA[ shift: a&lt;&lt;1 ]]&gt;</code></td>
1033<tr valign="top">
1035<td><code class="code">___________________________1111111111111111111111_</code></td>
1037<tr valign="top">
1039<td><code class="code">___________111111111111___________________________</code></td>
1041<tr valign="top">
1043<td><code class="code">_11111____________________________________________</code></td>
1045<tr valign="top">
1047<td><code class="code">_111111__111111111111111__111111111111111111111111</code></td>
1049<tr valign="top">
1051<td><code class="code">__________________________________________________</code></td>
1056<p id="idp805984"> With the removal of all non-interpreteable markup, several phases of parallel bit
1057               stream based SIMD operations may follow operating on up to 128 byte positions on
1058               current commondity processors and assured of XML markup relevancy. For
1059               example, with the extents identification of comments, processing instructions and
1060               CDATA secions, XML names may be identified and length sorted for efficient symbol
1061               table construction. </p>
1062<p id="idp806680"> As an aside, comments and CDATA sections must first be validated to ensure
1063               that comments do not contain "--" sequences and that CDATA sections do not contain illegal
1064               "]]&gt;" sequences prior to ignorable markup stream generation.</p>
1066<div class="section" id="idp807344">
1067<h4 class="title" style="clear: both">Reference Call Out Streams</h4>
1068<p id="idp807696">The reference call out streams are the GenRefs, DecRefs, and HexRefs streams. This
1069               subset of the call out streams marks the extents of all but the closing semicolon of
1070               general and character references.</p>
1071<p id="idp808128">Predefined character
1072               (&amp;lt;,&amp;gt;,&amp;amp;,&amp;apos;,&amp;quot;) and numeric character
1073               references (&amp;#nnnn;, &amp;#xhhhh;) must be replaced by a single character
1074                  [<a class="xref" href="#XML10" id="idp808304">XML 1.0</a>]. As previously shown, this subset of call out streams enables the construction of a DelMask for
1075               references.</p>
1077<div class="section" id="idp809320">
1078<h4 class="title" style="clear: both">Tag Call Out Streams</h4>
1079<p id="idp809672">Whereas sequential bit scans over lexical item streams form the basis of XML
1080               parsing, in the current Parabix parser a new method of parallel parsing has been
1081               developed and prototyped using the concept of bitstream addition. Fundamental to this
1082               method is the concept of a <span class="ital">cursor</span> stream, a bit stream marking
1083               the positions of multiple parallel parses currently in process. </p>
1084<p id="idp810520">The results of parallel parsing using the bit stream addition technique produces a
1085               set of tag call out bit streams. These streams mark the extents of each start tag,
1086               end tag and empty element tag. Within tags, additional streams mark start
1087               and end positions for tag names, as well as attribute names and values. An error flag
1088               stream marks the positions of any syntactic errors encountered during parsing.</p>
1089<p id="idp811648"> The set of tag call out streams consists of the ElemNames, AttNames, AttVals, Tags,
1090               EmptyTagEnds and EndTags bit streams. The following example demonstrates the bit
1091               stream output produced which from parallel parsing using bit stream addition. </p>
1092<div class="table-wrapper" id="idp812120">
1093<p class="title">Table IX</p>
1094<div class="caption"><p id="idp812376">Tag Call Out Streams</p></div>
1095<table class="table">
1096<colgroup span="1">
1097<col align="left" valign="top" span="1">
1098<col align="left" valign="top" span="1">
1101<tr valign="top">
1102<td>Input Data</td>
1104                        <code class="code">&lt;root&gt;&lt;t1&gt;text&lt;/t1&gt;&lt;t2
1105                           a1='foo' a2 =
1106                           'fie'&gt;more&lt;/t2&gt;&lt;tag3
1107                           att3='b'/&gt;&lt;/root&gt;</code>
1108                     </td>
1110<tr valign="top">
1113                        <code class="code">_1111__11___________11_______________________________1111__________________</code>
1114                     </td>
1116<tr valign="top">
1119                        <code class="code">_______________________11_______11________________________1111_____________</code>
1120                     </td>
1122<tr valign="top">
1125                        <code class="code">__________________________11111______11111_____________________111_________</code>
1126                     </td>
1128<tr valign="top">
1131                        <code class="code">___________________________________________________________________1_______</code>
1132                     </td>
1134<tr valign="top">
1137                        <code class="code">_______________111______________________________111__________________11111_</code>
1138                     </td>
1140<tr valign="top">
1143                        <code class="code">_1111__11___________1111111111111111111111___________11111111111111________</code>
1144                     </td>
1146<tr valign="top">
1149                        <code class="code">___________________________________________________________________________</code>
1150                     </td>
1158<div class="section" id="idp822632">
1159<h2 class="title" style="clear: both">SIMD Beyond Bitstreams: Names and Numbers</h2>
1160<p id="idp822952">Whereas the fundamental innovation of our work is the use of SIMD technology in
1161         implementing parallel bit streams for XML, there are also important ways in which more
1162         traditional byte-oriented SIMD operations can be useful in accelerating other aspects of
1163         XML processing.</p>
1164<div class="section" id="idp823448">
1165<h3 class="title" style="clear: both">Name Lookup</h3>
1166<p id="idp823784">Efficient symbol table mechanisms for looking up element and attribute names is
1167            important for almost all XML processing applications. It is also an important technique
1168            merely for assessing well-formedness of an XML document; rather than validating the
1169            character-by-character composition of each occurrence of an XML name as it is
1170            encountered, it is more efficient to validate all but the first occurrence by first
1171            determining whether the name already exists in a table of prevalidated names.</p>
1172<p id="idp825088">The first symbol table mechanism deployed in the Parabix parser simply used the
1173            hashmaps of the C++ standard template library, without deploying any SIMD technology.
1174            However, with the overhead of character validation, transcoding and parsing dramatically
1175            reduced by parallel bit stream technology, we found that symbol lookups then accounted
1176            for about half of the remaining execution time in a statistics gathering application
1177               [<a class="xref" href="#CASCON08" id="idp825712">Cameron, Herdy and Lin 2008</a>]. Thus, symbol table processing was identified as a major
1178            target for further performance improvement. </p>
1179<p id="idp826232"> Our first effort to improve symbol table performance was to employ the splash tables
1180            with cuckoo hashing as described by Ross [<a class="xref" href="#Ross06" id="idp826512">Ross 2006</a>], using SIMD
1181            technology for parallel bucket processing. Although this technique did turn out to have
1182            the advantage of virtually constant-time performance even for very large vocabularies,
1183            it was not particularly helpful for the relatively small vocabularies typically found in
1184            XML document processing. </p>
1185<p id="idp827288"> However, a second approach has been found to be quite useful, taking advantage of
1186            parallel bit streams for cheap determination of symbol length. In essence, the length of
1187            a name can be determined very cheaply using a single bit scan operation. This then makes
1188            it possible to use length-sorted symbol table processing, as follows. First, the
1189            occurrences of all names are stored in arrays indexed by length. Then the length-sorted
1190            arrays may each be inserted into the symbol table in turn. The advantage of this is that
1191            a separate loop may be written for each length. Length sorting makes for very efficient
1192            name processing. For example hash value computations and name comparisons can be made by
1193            loading multibyte values and performing appropriate shifting and masking operations,
1194            without the need for a byte-at-a-time loop. In initial experiments, this length-sorting
1195            approach was found to reduce symbol lookup cost by a factor of two. </p>
1196<p id="idp828544"> Current research includes the application of SIMD technology to further enhance the
1197            performance of length-sorted lookup. We have identified a promising technique for
1198            parallel processing of multiple name occurrences using a parallel trie lookup technique.
1199            Given an array of occurrences of names of a particular length, the first one, two or
1200            four bytes of each name are gathered and stored in a linear array. SIMD techniques are
1201            then used to compare these prefixes with the possible prefixes for the current position
1202            within the trie. In general, a very small number of possibilities exist for each trie
1203            node, allowing for fast linear search through all possibilities. Typically, the
1204            parallelism is expected to exceed the number of possibilities to search through at each
1205            node. With length-sorting to separate the top-level trie into many small subtries, we
1206            expect only a single step of symbol lookup to be needed in most practical instances. </p>
1207<p id="idp829808">The gather step of this algorithm is actually a common technique in SIMD processing.
1208            Instruction set support for gather operations is a likely future direction for SIMD
1209            technology.</p>
1211<div class="section" id="idp830280">
1212<h3 class="title" style="clear: both">Numeric Processing</h3>
1213<p id="idp830624"> Many XML applications involve numeric data fields as attribute values or element
1214            content. Although most current XML APIs uniformly return information to applications in
1215            the form of character strings, it is reasonable to consider direct API support for
1216            numeric conversions within a high-performance XML engine. With string to numeric
1217            conversion such a common need, why leave it to application programmers? </p>
1218<p id="idp831280"> High-performance string to numeric conversion using SIMD operations also can
1219            considerably outperform the byte-at-a-time loops that most application programmers or
1220            libraries might employ. A first step is reduction of ASCII bytes to corresponding
1221            decimal nybbles using a SIMD packing operation. Then an inductive doubling algorithm
1222            using SIMD operations may be employed. First, 16 sets of adjacent nybble values in the
1223            range 0-9 can be combined in just a few SIMD operations to 16 byte values in the range
1224            0-99. Then 8 sets of byte values may similarly be combined with further SIMD processing
1225            to produce doublebyte values in the range 0-9999. Further combination of doublebyte
1226            values into 32-bit integers and so on can also be performed using SIMD operations. </p>
1227<p id="idp832336"> Using appropriate gather operations to bring numeric strings into appropriate array
1228            structures, an XML engine could offer high-performance numeric conversion services to
1229            XML application programmers. We expect this to be an important direction for our future
1230            work, particularly in support of APIs that focus on direct conversion of XML data into
1231            business objects. </p>
1234<div class="section" id="idp833080">
1235<h2 class="title" style="clear: both">APIs and Parallel Bit Streams</h2>
1236<div class="section" id="idp833440">
1237<h3 class="title" style="clear: both">The ILAX Streaming API</h3>
1238<p id="idp833792">The In-Line API for XML (ILAX) is the base API provided with the Parabix parser. It
1239            is intended for low-level extensions compiled right into the engine, with minimum
1240            possible overhead. It is similar to streaming event-based APIs such as SAX, but
1241            implemented by inline substitution rather than using callbacks. In essence, an extension
1242            programmer provides method bodies for event-processing methods declared internal to the
1243            Parabix parsing engine, compiling the event processing code directly with the core code
1244            of the engine. </p>
1245<p id="idp834592"> Although ILAX can be used directly for application programming, its primary use is
1246            for implementing engine extensions that support higher-level APIs. For example, the
1247            implementation of C or C++ based streaming APIs based on the Expat [<a class="xref" href="#Expat" id="idp834992">Expat</a>] or general SAX models can be quite directly implemented. C/C++ DOM
1248            or other tree-based APIs can also be fairly directly implemented. However, delivering
1249            Parabix performance to Java-based XML applications is challenging due to the
1250            considerable overhead of crossing the Java Native Interface (JNI) boundary. This issue
1251            is addressed with the Array Set Model (ASM) concept discussed in the following section. </p>
1252<p id="idp835872"> With the recent development of parallel parsing using bitstream addition, it is
1253            likely that the underlying ILAX interface of Parabix will change. In essence, ILAX
1254            suffers the drawback of all event-based interfaces: they are fundamentally sequential in
1255            number. As research continues, we expect efficient parallel methods building on parallel
1256            bit stream foundations to move up the stack of XML processing requirements. Artificially
1257            imposing sequential processing is thus expected to constrain further advances in XML
1258            performance. </p>
1260<div class="section" id="idp836736">
1261<h3 class="title" style="clear: both">Efficient XML in Java Using Array Set Models</h3>
1262<p id="idp837112"> In our GML-to-SVG case study, we identified the lack of high-performance XML
1263            processing solutions for Java to be of particular interest. Java byte code does not
1264            provide access to the SIMD capabilities of the underlying machine architecture. Java
1265            just-in-time compilers might be capable of using some SIMD facilities, but there is no
1266            real prospect of conventional compiler technology translating byte-at-a-time algorithms
1267            into parallel bit stream code. So the primary vehicle for delivering high-performance
1268            XML processing is to call native parallel bit stream code written in C through JNI
1269            capabilities. </p>
1270<p id="idp838000">However, each JNI call is expensive, so it is desirable to minimize the number of
1271            calls and get as much work done during each call as possible. This mitigates against
1272            direct implementation of streaming APIs in Java through one-to-one mappings to an
1273            underlying streaming API in C. Instead, we have concentrated on gathering information on
1274            the C side into data structures that can then be passed to the Java side. However, using
1275            either C pointer-based structures or C++ objects is problematic because these are
1276            difficult to interpret on the Java side and are not amenable to Java's automatic storage
1277            management system. Similarly, Java objects cannot be conveniently created on the C side.
1278            However, it is possible to transfer arrays of simple data values (bytes or integers)
1279            between C and Java, so that makes a reasonable focus for bulk data communication between
1280            C and Java. </p>
1281<p id="idp839192"><span class="ital">Array Set Models</span> are array-based representations of information
1282            representing an XML document in accord with XML InfoSet [<a class="xref" href="#InfoSet" id="idp839600">XML Infoset</a>] or
1283            other XML data models relevant to particular APIs. As well as providing a mechanism for
1284            efficient bulk data communication across the JNI boundary, ASMs potentially have a
1285            number of other benefits in high-performance XML processing. <div class="itemizedlist" id="idp840256"><ul>
1286<li id="idp840384"><p id="idp840512">Prefetching. Commodity processors commonly support hardware and/or software
1287                     prefetching to ensure that data is available in a processor cache when it is
1288                     needed. In general, prefetching is most effective in conjunction with the
1289                     continuous sequential memory access patterns associated with array
1290                  processing.</p></li>
1291<li id="idp841160"><p id="idp841288">DMA. Some processing environments provide Direct Memory Access (DMA)
1292                     controllers for block data movement in parallel with computation. For example,
1293                     the Cell Broadband Engine uses DMA controllers to move the data to and from the
1294                     local stores of the synergistic processing units. Arrays of contiguous data
1295                     elements are well suited to bulk data movement using DMA.</p></li>
1296<li id="idp842000"><p id="idp842128">SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern
1297                     processor instruction sets allow simultaneous application of particular
1298                     instructions to sets of elements from parallel arrays. For effective use of
1299                     SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS
1300                     (Array of Structures) model. </p></li>
1301<li id="idp842800"><p id="idp842928">Multicore processors. Array-oriented processing can enable the effective
1302                     distribution of work to the individual cores of a multicore system in two
1303                     distinct ways. First, provided that sequential dependencies can be minimized or
1304                     eliminated, large arrays can be divided into separate segments to be processed
1305                     in parallel on each core. Second, pipeline parallelism can be used to implement
1306                     efficient multipass processing with each pass consisting of a processing kernel
1307                     with array-based input and array-based output. </p></li>
1308<li id="idp844472"><p id="idp844600">Streaming buffers for large XML documents. In the event that an XML document
1309                     is larger than can be reasonably represented entirely within processor memory,
1310                     a buffer-based streaming model can be applied to work through a document using
1311                     sliding windows over arrays of elements stored in document order. </p></li>
1313         </p>
1314<div class="section" id="idp845360">
1315<h4 class="title" style="clear: both">Saxon-B TinyTree Example</h4>
1316<p id="idp845712">As a first example of the ASM concept, current work includes a proof-of-concept to
1317               deliver a high-performance replacement for building the TinyTree data structure used
1318               in Saxon-B 6.5.5, an open-source XSLT 2.0 processor written in Java [<a class="xref" href="#Saxon" id="idp846112">Saxon</a>]. Although XSLT stylesheets may be cached for performance, the
1319               caching of source XML documents is typically not possible. A new TinyTree object to
1320               represent the XML source document is thus commonly constructed with each new query so
1321               that the overall performance of simple queries on large source XML documents is
1322               highly dependent on TinyTree build time. Indeed, in a study of Saxon-SA, the
1323               commercial version of Saxon, query time was shown to be dominated by TinyTree build
1324               time [<a class="xref" href="#Kay08" id="idp847040">Kay 2008</a>]. Similar performance results are demonstrable for the
1325               Saxon-B XSLT processor as well. </p>
1326<p id="idp847568"> The Saxon-B processor studied is a pure Java solution, converting a SAX (Simple
1327               API for XML) event stream into the TinyTree Java object using the efficient Aelfred
1328               XML parser [<a class="xref" href="#AElfred" id="idp847912">Ælfred</a>]. The TinyTree structure is itself an
1329               array-based structure mapping well suited to the ASM concept. It consists of six
1330               parallel arrays of integers indexed on node number and containing one entry for each
1331               node in the source document, with the exception of attribute and namespace nodes
1332                  [<a class="xref" href="#Saxon" id="idp848624">Saxon</a>]. Four of the arrays respectively provide node kind, name
1333               code, depth, and next sibling information for each node, while the two others are
1334               overloaded for different purposes based on node kind value. For example, in the
1335               context of a text node , one of the overloaded arrays holds the text buffer offset
1336               value whereas the other holds the text buffer length value. Attributes and namespaces
1337               are represented using similiar parallel array of values. The stored TinyTree values
1338               are primarily primitive Java types, however, object types such as Java Strings and
1339               Java StringBuffers are also used to hold attribute values and comment values
1340               respectively. </p>
1341<p id="idp849784"> In addition to the TinyTree object, Saxon-B maintains a NamePool object which
1342               represents a collection of XML name triplets. Each triplet is composed of a Namespace
1343               URI, a Namespace prefix and a local name and encoded as an integer value known as a
1344               namecode. Namecodes permit efficient name search and look-up using integer
1345               comparison. Namecodes may also be subsequently decoded to recover namespace and local
1346               name information. </p>
1347<p id="idp850488"> Using the Parabix ILAX interface, a high-performance reimplementation of TinyTree
1348               and NamePool data structures was built to compare with the Saxon-B implementation. In
1349               fact, two functionally equivalent versions of the ASM java class were constructed. An
1350               initial version was constructed based on a set of primitive Java arrays constructed
1351               and allocated in the Java heap space via JNI New&lt;PrimitiveType&gt;Array
1352               method call. In this version, the JVM garbage collector is aware of all memory
1353               allocated in the native code. However, in this approach, large array copy operations
1354               limited overall performance to approximately a 2X gain over the Saxon-B build time. </p>
1355<p id="idp851584">To further address the performance penalty imposed by copying large array values,
1356               a second version of the ASM Java object was constructed based on natively backed
1357               Direct Memory Byte Buffers [<a class="xref" href="#JNI" id="idp851944">Hitchens 2002</a>]. In this version the JVM garbage
1358               collector is unaware any native memory resources backing the Direct Memory Byte
1359               Buffers. Large JNI-based copy operations are avoided; however, system memory must be
1360               explicitly deallocated via a Java native method call. Using this approach, our
1361               preliminary results show an approximate total 2.5X gain over Saxon-B build time.
1362            </p>
1366<div class="section" id="idp853000">
1367<h2 class="title" style="clear: both">Compiler Technology</h2>
1368<p id="idp853344"> An important focus of our recent work is on the development of compiler technology to
1369         automatically generate the low-level SIMD code necessary to implement bit stream processing
1370         given suitable high-level specifications. This has several potential benefits. First, it
1371         can eliminate the tedious and error-prone programming of bit stream operations in terms of
1372         register-at-a-time SIMD operations. Second, compilation technology can automatically employ
1373         a variety of performance improvement techniques that are difficult to apply manually. These
1374         include algorithms for instruction scheduling and register allocation as well as
1375         optimization techniques for common subexpression expression elimination and register
1376         rematerialization among others. Third, compiler technology makes it easier to make changes
1377         to the low-level code for reasons of perfective or adaptive maintenance.</p>
1378<p id="idp854496">Beyond these reasons, compiler technology also offers the opportunity for retargetting
1379         the generation of code to accommodate different processor architectures and API
1380         requirements. Strategies for efficient parallel bit stream code can vary considerably
1381         depending on processor resources such as the number of registers available, the particular
1382         instruction set architecture supported, the size of L1 and L2 data caches, the number of
1383         available cores and so on. Separate implementation of custom code for each processor
1384         architecture would thus be likely to be prohibitively expensive, prone to errors and
1385         inconsistencies and difficult to maintain. Using compilation technology, however, the idea
1386         would be to implement a variety of processor-specific back-ends all using a common front
1387         end based on parallel bit streams. </p>
1388<div class="section" id="idp855592">
1389<h3 class="title" style="clear: both">Character Class Compiler</h3>
1390<p id="idp855944">The first compiler component that we have implemented is a character class compiler,
1391            capable of generation all the bit stream logic necessary to produce a set of lexical
1392            item streams each corresponding to some particular set of characters to be recognized.
1393            By taking advantage of common patterns between characters within classes, and special
1394            optimization logic for recognizing character-class ranges, our existing compiler is able
1395            to generate well-optimized code for complex sets of character classes involving numbers
1396            of special characters as well as characters within specific sets of ranges. </p>
1398<div class="section" id="idp857544">
1399<h3 class="title" style="clear: both">Regular Expression Compilation</h3>
1400<p id="idp857904">Based on the character class compiler, we are currently investigating the
1401            construction of a regular expression compiler that can implement bit-stream based
1402            parallel regular-expression matching similar to that describe previously for parallel
1403            parsing by bistream addition. This compiler works with the assumption that bitstream
1404            regular-expression definitions are deterministic; no backtracking is permitted with the
1405            parallel bit stream representation. In XML applications, this compiler is primarily
1406            intended to enforce regular-expression constraints on string datatype specifications
1407            found in XML schema. </p>
1409<div class="section" id="idp858856">
1410<h3 class="title" style="clear: both">Unbounded Bit Stream Compilation</h3>
1411<p id="idp859216">The Catalog of XML Bit Streams presented earlier consist of a set of abstract,
1412            unbounded bit streams, each in one-to-one correspondence with input bytes of a text
1413            file. Determining how these bit streams are implemented using fixed-width SIMD
1414            registers, and possibly processed in fixed-length buffers that represent some multiple
1415            of the register width is a source of considerable programming complexity. The general
1416            goal of our compilation strategy in this case is to allow operations to be programmed in
1417            terms of unbounded bit streams and then automatically reduced to efficient low-level
1418            code with the application of a systematic code generation strategy for handling block
1419            and buffer boundary crossing. This is work currently in progress. </p>
1422<div class="section" id="idp860384">
1423<h2 class="title" style="clear: both">Conclusion</h2>
1424<p id="idp860704">Parallel bit stream technology offers the opportunity to dramatically speed up the core
1425         XML processing components used to implement virtually any XML API. Character validation and
1426         transcoding, whitespace processing, and parsing up to including the full validation of tag
1427         syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the
1428         positions of all element names, attribute names and attribute values can also be produced,
1429         followed by fast bit scan operations to generate position and length values. Beyond bit
1430         streams, byte-oriented SIMD processing of names and numerals can also accelerate
1431         performance beyond sequential byte-at-a-time methods. </p>
1432<p id="idp861640">Advances in processor architecture are likely to further amplify the performance of
1433         parallel bit stream technology over traditional byte-at-a-time processing over the next
1434         decade. Improvements to SIMD register width, register complement and operation format can
1435         all result in further gains. New SIMD instruction set features such as inductive doubling
1436         support, parallel extract and deposit instructions, bit interleaving and scatter/gather
1437         capabilities should also result in significant speed-ups. Leveraging the intraregister
1438         parallelism of parallel bit stream technology within SIMD registers to take of intrachip
1439         parallelism on multicore processors should accelerate processing further. </p>
1440<p id="idp862592">Technology transfer using a patent-based open-source business model is a further goal of
1441         our work with a view to widespread deployment of parallel bit stream technology in XML
1442         processing stacks implementing a variety of APIs. The feasibility of substantial
1443         performance improvement in replacement of technology implementing existing APIs has been
1444         demonstrated even in complex software architectures involving delivery of performance
1445         benefits across the JNI boundary. We are seeking to accelerate these deployment efforts
1446         both through the development of compiler technology to reliably apply these methods to a
1447         variety of architectures as well as to identify interested collaborators using open-source
1448         or commercial models. </p>
1450<div class="section" id="idp864448">
1451<h2 class="title" style="clear: both">Acknowledgments</h2>
1452<p id="idp864792">This work is supported in part by research grants and scholarships from the Natural
1453         Sciences and Engineering Research Council of Canada, the Mathematics of Information
1454         Technology and Complex Systems Network and the British Columbia Innovation Council. </p>
1455<p id="idp865264">We thank our colleague Dan Lin (Linda) for her work in high-performance symbol table
1456         processing. </p>
1458<div class="bibliography" id="idp865632">
1459<h2 class="title" style="clear:both">Bibliography</h2>
1460<p class="bibliomixed" id="XMLChip09"><a href="#idp500496">[Leventhal and Lemoine 2009] </a>Leventhal, Michael and
1461         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
1462         Processing XML Efficiently 2009, Montréal.</p>
1463<p class="bibliomixed" id="Datapower09"><a href="#idp506976">[Salz, Achilles and Maze 2009] </a>Salz, Richard,
1464         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
1465         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
1466         Efficiently 2009, Montréal.</p>
1467<p class="bibliomixed" id="PPoPP08"><a href="#idp507840">[Cameron 2007] </a>Cameron, Robert D. 2007. A Case Study
1468         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
1469         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
1470         Lake City, Utah. On the Web at <a href="" class="link" target="_new"></a>.</p>
1471<p class="bibliomixed" id="CASCON08"><a href="#idp508744">[Cameron, Herdy and Lin 2008] </a>Cameron, Robert D.,
1472         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
1473         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
1474         Practice of Parallel Programming 2008, Toronto.</p>
1475<p class="bibliomixed" id="SVGOpen08"><a href="#idp509256">[Herdy, Burggraf and Cameron 2008] </a>Herdy, Kenneth
1476         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
1477         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
1478         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
1479         Nuremburg. On the Web at
1480            <a href="" class="link" target="_new"></a>.</p>
1481<p class="bibliomixed" id="Ross06"><a href="#idp826512">[Ross 2006] </a>Ross, Kenneth A. 2006. Efficient hash
1482         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
1483            <a href="" class="link" target="_new"></a>.</p>
1484<p class="bibliomixed" id="ASPLOS09"><a href="#idp638256">[Cameron and Lin 2009] </a>Cameron, Robert D. and Dan
1485         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
1486         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</p>
1487<p class="bibliomixed" id="Wu08"><a href="#idp639856">[Wu et al. 2008] </a>Wu, Yu, Qi Zhang, Zhiqiang Yu and
1488         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
1489         Proceedings of Balisage 2008, Montréal. On the Web at
1490            <a href="" class="link" target="_new"></a>.</p>
1491<p class="bibliomixed" id="u8u16"><a href="#idp699152">[Cameron 2008] </a>u8u16 - A High-Speed UTF-8 to UTF-16
1492         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
1493         Science Simon Fraser University, June 21 2007.</p>
1494<p class="bibliomixed" id="XML10"><a href="#idp756320">[XML 1.0] </a>Extensible Markup Language (XML) 1.0 (Fifth
1495         Edition) W3C Recommendation 26 November 2008. On the Web at
1496            <a href="" class="link" target="_new"></a>.</p>
1497<p class="bibliomixed" id="Unicode"><a href="#idp783160">[Unicode] </a>The Unicode Consortium. 2009. On the Web at
1498            <a href="" class="link" target="_new"></a>.</p>
1499<p class="bibliomixed" id="Pex06"><a href="#idp638640">[Hilewitz and Lee 2006] </a> Hilewitz, Y. and Ruby B. Lee.
1500         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
1501         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
1502         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</p>
1503<p class="bibliomixed" id="InfoSet"><a href="#idp839600">[XML Infoset] </a>XML Information Set (Second Edition) W3C
1504         Recommendation 4 February 2004. On the Web at
1505         <a href="" class="link" target="_new"></a>.</p>
1506<p class="bibliomixed" id="Saxon"><a href="#idp846112">[Saxon] </a>SAXON The XSLT and XQuery Processor. On the Web
1507         at <a href="" class="link" target="_new"></a>.</p>
1508<p class="bibliomixed" id="Kay08"><a href="#idp847040">[Kay 2008] </a> Kay, Michael Y. 2008. Ten Reasons Why Saxon
1509         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</p>
1510<p class="bibliomixed" id="AElfred"><a href="#idp847912">[Ælfred] </a> The Ælfred XML Parser. On the Web at
1511            <a href="" class="link" target="_new"></a>.</p>
1512<p class="bibliomixed" id="JNI"><a href="#idp851944">[Hitchens 2002] </a>Hitchens, Ron. Java NIO. O'Reilly, 2002.</p>
1513<p class="bibliomixed" id="Expat"><a href="#idp834992">[Expat] </a>The Expat XML Parser.
1514            <a href="" class="link" target="_new"></a>.</p>
1517<div id="balisage-footer"><h3 style="font-family: serif; margin:0.25em">
1518<i>Balisage:</i> <small>The Markup Conference</small>
Note: See TracBrowser for help on using the repository browser.