source: docs/Balisage09/Bal2009came0601.xml @ 281

Last change on this file since 281 was 281, checked in by cameron, 10 years ago

Balisage Intl. Symp on Processing XML Efficiently - Cameron/Herdy/Amiri? paper

File size: 76.4 KB
Line 
1<?xml version="1.0" encoding="UTF-8"?>
2<!-- MODIFIED DTD LOCATION -->
3<!DOCTYPE article SYSTEM "balisage-1-1.dtd">
4<article xmlns="http://docbook.org/ns/docbook" version="5.0-subset Balisage-1.1"
5   xml:id="HR-23632987-8973">
6   <title>Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</title>
7   <info>
8      <confgroup>
9         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
10            Space, Time, or Bandwidth</conftitle>
11         <confdates>August 10 2009</confdates>
12      </confgroup>
13      <abstract>
14         <para>By first transforming the octets (bytes) of XML texts into eight parallel bit
15            streams, the SIMD features of commodity processors can be exploited for parallel
16            processing of blocks of 128 input bytes at a time. Established transcoding and parsing
17            techniques are reviewed followed by new techniques including parsing with bitstream
18            addition. Further opportunities are discussed in light of expected advances in CPU
19            architecture and compiler technology. Implications for various APIs and information
20            models are presented as well opportunities for collaborative open-source
21         development.</para>
22      </abstract>
23      <author>
24         <personname>
25            <firstname>Rob</firstname>
26            <surname>Cameron</surname>
27         </personname>
28         <personblurb>
29                 <para>Dr. Rob Cameron is Professor and Director of Computing Science
30                         at Simon Fraser University.   With a broad spectrum of research
31                         interests related to programming languages, software engineering and
32                         sociotechnical design of public computing infrastructure, he has
33                         recently been focusing on high performance text processing using
34                         parallel bit stream technology and its applications to XML.
35                         He is also a patentleft evangelist, advocating university-based
36                         technology transfer models dedicated to free use in open source.
37                 </para>
38
39         </personblurb>
40         <affiliation>
41            <jobtitle>Professor of Computing Science</jobtitle>
42            <orgname>Simon Fraser University</orgname>
43         </affiliation>
44         <email>cameron@cs.sfu.ca</email>
45      </author>
46      <author>
47         <personname>
48            <firstname>Ken</firstname>
49            <surname>Herdy</surname>
50         </personname>
51         <personblurb>
52                 <para>
53                         Ken Herdy completed an Advanced Diploma of Technology in Geographical
54                         Information Systems at the British Columbia Institute of Technology in 2003
55                         and earned a Bachelor of Science in Computing Science with a Certificate in
56                         Spatial Information Systems at Simon Fraser University in 2005.
57                 </para>
58                 <para>
59                         Ken is currently pursuing graduate studies in Computing Science at Simon
60                         Fraser University with industrial scholarship support from the Natural
61                         Sciences and Engineering Research Council of Canada, the Mathematics of
62                         Information Technology and Complex Systems NCE, and the BC Innovation
63                         Council. His research focus is an analysis of the principal techniques that
64                         may be used to improve XML processing performance in the context of the
65                         Geography Markup Language (GML).
66                 </para>
67
68         </personblurb>
69         <affiliation>
70            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
71            <orgname>Simon Fraser University </orgname>
72         </affiliation>
73         <email>ksherdy@cs.sfu.ca</email>
74      </author>
75      <author>
76         <personname>
77            <firstname>Ehsan</firstname>
78            <surname>Amiri</surname>
79         </personname>
80         <personblurb>
81                 <para>Ehsan Amiri is a PhD student of Computer Science at Simon Fraser University. Before that he studied at Sharif University of Technology, Tehran, Iran. While his graduate research has been focused on theoretical problems like fingerprinting, Ehsan has worked on some software projects like development of a multi-node firewall as well. More recently he has been developing compiler technology for automatic generation of bit stream processing code. </para>
82
83         </personblurb>
84         <affiliation>
85            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
86            <orgname>Simon Fraser University</orgname>
87         </affiliation>
88         <email>eamiri@cs.sfu.ca</email>
89      </author>
90      <legalnotice>
91         <para>Copyright &#x000A9; 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
92                 This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Canada License.</para>
93      </legalnotice>
94      <keywordset role="author">
95         <keyword/>
96         <keyword/>
97         <keyword/>
98      </keywordset>
99   </info>
100   <section>
101      <title>Introduction</title>
102      <para> While particular XML applications may benefit from special-purpose hardware such as XML
103         chips [<xref linkend="XMLChip09"/>] or appliances [<xref linkend="Datapower09"/>], the bulk
104         of the world's XML processing workload will continue to be handled by XML software stacks
105         on commodity processors. Exploiting the SIMD capabilities of such processors such as the
106         SSE instructions of x86 chips, parallel bit stream technology offers the potential of
107         dramatic improvement over byte-at-a-time processing for a variety of XML processing tasks.
108         Character set issues such as Unicode validation and transcoding [<xref linkend="PPoPP08"
109         />], normalization of line breaks and white space and XML character validation can be
110         handled fully in parallel using this representation. Lexical item streams, such as the bit
111         stream marking the positions of opening angle brackets, can also be formed in parallel.
112         Bit-scan instructions of commodity processors may then be used on lexical item streams to
113         implement rapid single-instruction scanning across variable-length multi-byte text blocks
114         as in the Parabix XML parser [<xref linkend="CASCON08"/>]. Overall, these techniques may be
115         combined to yield end-to-end performance that may be 1.5X to 15X faster than alternatives
116            [<xref linkend="SVGOpen08"/>].</para>
117      <para>Continued research in parallel bit stream techniques as well as more conventional
118         application of SIMD techniques in XML processing offers further prospects for improvement
119         of core XML components as well as for tackling performance-critical tasks further up the
120         stack. A newly prototyped technique for parallel tag parsing using bitstream addition is
121         expected to improve parsing performance even beyond that achieved using sequential bit
122         scans. Several techniques for improved symbol table performance are being investigated,
123         including parallel hash value calculation and length-based sorting using the cheap length
124         determination afforded by bit scans. To deliver the benefits of parallel bit stream
125         technology to the Java world, we are developing Array Set Model (ASM) representations of
126         XML Infoset and other XML information models for efficient transmission across the JNI
127         boundary.</para>
128
129      <para>Amplifying these software advances, continuing hardware advances in commodity processors
130         increase the relative advantage of parallel bit stream techniques over traditional
131         byte-at-a-time processors. For example, the Intel Core architecture improved SSE processing
132         to give superscalar execution of bitwise logic operations (3 instructions per cycle vs. 1
133         in Pentium 4). Upcoming 256-bit AVX technology extends the register set and replaces
134         destructive two-operand instructions with a nondestructive three-operand form. General
135         purpose programming on graphic processing units (GPGPU) such as the upcoming 512-bit
136         Larrabee processor may also be useful for XML applications using parallel bit streams. New
137         instruction set architectures may also offer dramatic improvements in core algorithms.
138         Using the relatively simple extensions to support the principle of inductive doubling, a 3X
139         improvement in several core parallel bit stream algorithms may be achieved [<xref
140            linkend="ASPLOS09"/>]. Other possibilities include direct implementation of parallel
141         extract and parallel deposit (pex/pdep) instructions [<xref linkend="Pex06"/>], and
142         bit-level interleave operations as in Larrabee, each of which would have important
143         application to parallel bit stream processing.</para>
144
145      <para>Further prospects for XML performance improvement arise from leveraging the
146         intraregister parallelism of parallel bit stream technology to exploit the interchip
147         parallelism of multicore computing. Parallel bit stream techniques can support multicore
148         parallelism in both data partitioning and task partitioning models. For example, the
149         datasection partitioning approach of Wu, Zhang, Yu and Li may be used to partition blocks
150         for speculative parallel parsing on separate cores followed by a postprocessing step to
151         join partial S-trees [<xref linkend="Wu08"/>].</para>
152
153      <para>In our view, the established and expected performance advantages of parallel bit stream
154         technology over traditional byte-at-a-time processing are so compelling that parallel bit
155         stream technology should ultimately form the foundation of every high-performance XML
156         software stack. We envision a common high-performance XML kernel that may be customized to
157         a variety of processor architectures and that supports a wide range of existing and new XML
158         APIs. Widespread deployment of this technology should greatly benefit the XML community in
159         addressing both the deserved and undeserved criticism of XML on performance grounds. A
160         further benefit of improved performance is a substantial greening of XML technologies.</para>
161
162      <para>To complement our research program investigating fundamental algorithms and issues in
163         high-performance XML processing, our work also involves development of open source software
164         implementing these algorithms, with a goal of full conformance to relevant specifications.
165         From the research perspective, this approach is valuable in ensuring that the full
166         complexity of required XML processing is addressed in reporting and assessing processing
167         results. However, our goal is also to use this open source software as a basis of
168         technology transfer. A Simon Fraser University spin-off company, called International
169         Characters, Inc., has been created to commercialize the results of this work using a
170         patent-based open source model.</para>
171
172      <para>To date, we have not yet been successful in establishing a broader community of
173         participation with our open source code base. Within open-source communities, there is
174         often a general antipathy towards software patents; this may limit engagement with our
175         technology, even though it has been dedicated for free use in open source. </para>
176
177      <para>A further complication is the inherent difficulty of SIMD programming in general, and
178         parallel bit stream programming in particular. Considerable work is required with each new
179         algorithmic technique being investigated as well as in retargetting our techniques for each
180         new development in SIMD and multicore processor technologies. To address these concerns, we
181         have increasingly shifted the emphasis of our research program towards compiler technology
182         capable of generating parallel bit stream code from higher-level specifications.</para>
183   </section>
184
185   <section>
186      <title>A Catalog of Parallel Bit Streams for XML</title>
187      <section>
188         <title>Introduction</title>
189         <para>In this section, we introduce the fundamental concepts of parallel bit stream
190            technology and present a comprehensive catalog of parallel bit streams for use in XML
191            processing. In presenting this catalog, the focus is on the specification of the bit
192            streams as data streams in one-to-one correspondence with the character code units of an
193            input XML stream. The goal is to define these bit streams in the abstract without
194            initially considering memory layouts, register widths or other issues related to
195            particular target architectures. In cataloging these techniques, we also hope to convey
196            a sense of the breadth of applications of parallel bit stream technology to XML
197            processing tasks. </para>
198      </section>
199
200      <section>
201         <title>Basis Bit Streams</title>
202         <para>Given a byte-oriented text stream represented in UTF-8, for example, we define a
203            transform representation of this text consisting of a set of eight parallel bit streams
204            for the individual bits of each byte. Thus, the <code>Bit0</code> stream is the stream
205            of bits consisting of bit 0 of each byte in the input byte stream, <code>Bit1</code> is
206            the bit stream consisting of bit 1 of each byte in the input stream and so on. The set
207            of streams <code>Bit0</code> through <code>Bit7</code> are known as the <emphasis>basis
208               streams</emphasis> of the parallel bit stream representation. The following table
209            shows an example XML character stream together with its representation as a set of 8
210            basis streams. <table>
211               <caption>
212                  <para>XML Character Stream Transposition.</para>
213               </caption>
214               <colgroup><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /></colgroup>
215               <tbody>  <tr valign="top"><td>XML</td><td><code>&lt;</code></td><td><code>t</code></td><td><code>a</code></td><td><code>g</code></td><td><code>/</code></td><td><code>&gt;</code></td></tr>
216                  <tr valign="top"><td>ASCII</td><td><code>00111100</code></td><td><code>01110100</code></td><td><code>01100001</code></td><td><code>01100111</code></td><td><code>00101111</code></td><td><code>00111110</code></td></tr>
217                  <tr valign="top"><td>Bit0</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td></tr>
218                  <tr valign="top"><td>Bit1</td><td><code>0</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td></tr>
219                  <tr valign="top"><td>Bit2</td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td></tr>
220                  <tr valign="top"><td>Bit3</td><td><code>1</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td></tr>
221                  <tr valign="top"><td>Bit4</td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>1</code></td></tr>
222                  <tr valign="top"><td>Bit5</td><td><code>1</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td></tr>
223                  <tr valign="top"><td>Bit6</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td></tr>
224                  <tr valign="top"><td>Bit7</td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>0</code></td></tr>
225               </tbody>
226            </table>
227           
228         </para>
229         <para> Depending on the features of a particular processor architecture, there are a number
230            of algorithms for transposition to parallel bit stream form. Several of these algorithms
231            employ a three-stage structure. In the first stage, the input byte stream is divided
232            into a pair of half-length streams consisting of four bits for each byte, for example,
233            one stream for the high nybble of each byte and another for the low nybble of each byte.
234            In the second stage, these streams of four bits per byte are each divided into streams
235            consisting of two bits per original byte, for example streams for the
236            <code>Bit0/Bit1</code>, <code>Bit2/Bit3</code>, <code>Bit4/Bit5</code>, and
237               <code>Bit6/Bit7</code> pairs. In the final stage, the streams are further subdivided
238            in the individual bit streams. </para>
239         <para> Using SIMD capabilities, this process is quite efficient, with an amortized cost of
240            1.1 CPU cycles per input byte on Intel Core 2 with SSE, or 0.6 CPU cycles per input byte
241            on Power PC G4 with Altivec. With future advances in processor technology, this
242            transposition overhead is expected to reduce, possibly taking advantage of upcoming
243            parallel extract (pex) instructions on Intel technology. In the ideal, only 24
244            instructions are needed to transform a block of 128 input bytes using 128-bit SSE
245            registers using the inductive doubling instruction set architecture, representing an
246            overhead of less than 0.2 instructions per input byte. </para>
247      </section>
248
249      <section>
250         <title>General Streams</title>
251
252         <section>
253            <title>Error Flag Streams</title>
254            <para>The error flag stream indicates the character code unit positions of errors. XML
255               processing examples which benefit from the marking error positions include UTF-8
256               character sequence validation and XML parsing [<xref linkend="u8u16"/>].</para>
257            <para>The following table provides an example of predefined entity reference parsing. <table>
258               <caption>
259                  <para>Parsing Entity Refereces</para>
260               </caption>
261               <colgroup><col align="left" valign="top" /></colgroup>
262               <tbody>  <tr valign="top"><td>XML</td><td><code>Well Formed &amp;lt; Erroneous &amp;gt!</code></td></tr>
263                  <tr valign="top"><td>RefStart</td><td><code>------------1--------------1---</code></td></tr>
264                  <tr valign="top"><td>RefEnd</td><td><code>---------------1---------------</code></td></tr>
265                  <tr valign="top"><td>RefError</td><td><code>------------------------------1</code></td></tr>
266               </tbody>
267            </table>
268            </para>
269
270         </section>
271         <section>
272            <title>Deletion Mask Streams</title>
273            <para>The marking and subsequent deletion of source stream character code unit positions
274               represents a core XML processing operation. The delmask (deletion mask) stream marks
275               character code unit positions for deletion. Several cases arise commonly in XML
276               processing. Examples include, UTF-8 to UTF-16 transcoding, XML end-of-line handling,
277               predefined entity replacement, and CDATA section delimeters. Several algorithms to
278               delete bits at positions marked by delmask are possible [<xref linkend="u8u16"/>]. A
279               bitwise ORing of any number of deletion masks implies that a single invocation of a
280               SIMD based parallel deletion may perform deletions accumulated across a number of XML
281               processing stages. </para>
282         </section>
283
284      </section>
285
286      <section>
287         <title>Lexical Item Streams</title>
288         <para>Lexical item streams differ from traditional streams of tokens in that they are bit
289            streams that mark the positions of tokens, whitespace or delimiters. Differentiation
290            between the actual tokens that may occur at a particular point (e.g., the different XML
291            tokens that begin “&lt;”) may be performed using multicharacter recognizers on the
292            bytestream representation [<xref linkend="CASCON08"/>]. </para>
293         <para>A key role of lexical item streams in XML parsing is to facilitate fast scanning
294            operations. For example, a LeftAngle lexical item stream may be formed to identify those
295            character code unit positions at which a “&lt;” character occurs. Hardware register
296            bit scan operations may then be used by the XML parser on the LeftAngle stream to
297            efficiently identify the position of the next “&lt;”. Based on the capabilities of
298            current commodity processors, a single register bit scan operation may effectively scan
299            up to 64 byte positions with a single instruction.</para>
300         <para>Overall, the construction of the full set of lexical item stream computations
301            requires approximately 1.0 CPU cycles per byte when implemented for 128 positions at a
302            time using 128-bit SSE registers on Intel Core2 processors [<xref linkend="CASCON08"/>].
303            The following table describes the core lexical item streams defined by the Parabix XML
304            parser. </para>
305         <para>
306            <table>
307               <caption>
308                  <para>Lexical item stream descriptions.</para>
309               </caption>
310               <tbody>
311                  <tr>
312                     <td align="left">
313                        NonWS
314                     </td>
315                     <td align="left">
316                        Marks the position any non-whitespace character.
317                     </td>
318                  </tr>
319                  <tr>
320                     <td align="left">
321                        MarkupStart
322                     </td>
323                     <td align="left">
324                        Marks the position of the start of XML markup.
325                     </td>
326                  </tr>
327                  <tr>
328                     <td align="left">
329                        CDATAEnd
330                     </td>
331                     <td align="left">
332                        Marks the position of the end of any CDATA section and identifies
333                           positions where " ]]&gt; " appears in XML.
334                        .
335                     </td>
336                  </tr>
337                  <tr>
338                     <td align="left">
339                        Hyphen
340                     </td>
341                     <td align="left">
342                        Marks the position of any hyphen character.
343                     </td>
344                  </tr>
345                  <tr>
346                     <td align="left">
347                        QMark
348                     </td>
349                     <td align="left">
350                        Marks the position of any question mark character.
351                     </td>
352                  </tr>
353                  <tr>
354                     <td align="left">
355                        Quote
356                     </td>
357                     <td align="left">
358                        Marks the position of any single or double quote character.
359                     </td>
360                  </tr>
361                  <tr>
362                     <td align="left">
363                        NameFollow
364                     </td>
365                     <td align="left">
366                        Marks the position of any character that can follow an XML name in a
367                           well-formed XML document.
368                     </td>
369                  </tr>
370               </tbody>
371            </table>
372         </para>
373         <para> The following table illustrates the various lexical items. <table>
374            <caption>
375               <para>Lexical Item Streams</para>
376            </caption>
377            <colgroup><col align="left" valign="top" /></colgroup>
378            <tbody>     <tr valign="top"><td>XML</td><td><code>&lt;tag attrib=&apos;value&apos;&gt; -- ]]&gt; &lt;nested  attribute=&quot;value&quot;&gt;&lt;/tag&gt;</code></td></tr>
379               <tr valign="top"><td>LAngle</td><td><code>1---------------------------1--------------------------1-----</code></td></tr>
380               <tr valign="top"><td>Hyphen</td><td><code>---------------------11--------------------------------------</code></td></tr>
381               <tr valign="top"><td>QMark</td><td><code>-------------------------------------------------------------</code></td></tr>
382               <tr valign="top"><td>NonWS</td><td><code>1111-111111111111111-11-111-1111111--111111111111111111111111</code></td></tr>
383               <tr valign="top"><td>Quote</td><td><code>------------1-----1----------------------------1-----1-------</code></td></tr>
384               <tr valign="top"><td>CDATA</td><td><code>--------------------------1----------------------------------</code></td></tr>
385               <tr valign="top"><td>NameFollow</td><td><code>----1------1-------11--1--11-------11---------1-------1-1---1</code></td></tr>
386            </tbody>
387         </table>
388         </para>
389      </section>
390
391      <section>
392         <title>UTF-8 Classification and Validation Streams</title>
393         <para> An XML parser must accept the UTF-8 encoding of Unicode [<xref linkend="XML10"/>].
394            It is a fatal error if an XML document determined to be in UTF-8 contains byte sequences
395            that are not legal in that encoding. UTF-8 byte classification, scope and error flag bit
396            streams are defined to validate UTF-8 byte sequences as well as to support transcoding
397            to UTF-16, if desired.</para>
398
399         <section>
400            <title>UTF-8 Byte Classification Streams</title>
401            <para>UTF-8 byte classification bit streams classify UTF-8 bytes based on their role in
402               forming single and multibyte sequences. The u8Prefix and u8Suffix bit streams
403               identify bytes that represent, respectively, prefix or suffix bytes of multibyte
404               sequences. The u8UniByte bit stream identifies those bytes that may be considered
405               single-byte sequences. The u8Prefix2, u8Prefix3, and u8Prefix4 refine the u8Prefix
406               respectively indicating prefixes of two, three or four byte sequences.</para>
407         </section>
408
409         <section>
410            <title>UTF-8 Scope Streams</title>
411            <para> Scope streams represent expectations established by prefix bytes. For example,
412               bit stream u8Scope22 represents the positions at which a second byte of a two-byte
413               sequence is expected based on the occurrence of a two-byte prefix in the immediately
414               preceding positions. The u8scope32, u8Scope33, u8Scope42, u8scope43, and u8Scope44
415               complete the set of UTF-8 scope streams.</para>
416            <para> The following examples demonstrate the UTF-8 character encoding validation
417               process using parallel bit stream techniques. The result of this validation process
418               is an error flag stream identifying those positions at which errors are identified.</para>
419            <para> 
420               <table>
421                  <caption>
422                     <para>UTF-8 Scope Streams</para>
423                  </caption>
424                  <colgroup><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /><col align="left" valign="top" /></colgroup>
425                  <tbody>       <tr valign="top"><td>XML</td><td colspan="1"><code>A</code></td><td colspan="1"><code> </code></td><td colspan="1"><code>T</code></td><td colspan="1"><code>e</code></td><td colspan="1"><code>x</code></td><td colspan="1"><code>t</code></td><td colspan="1"><code> </code></td><td colspan="1"><code>i</code></td><td colspan="1"><code>n</code></td><td colspan="1"><code> </code></td><td colspan="1"><code>F</code></td><td colspan="1"><code>a</code></td><td colspan="1"><code>r</code></td><td colspan="1"><code>s</code></td><td colspan="1"><code>i</code></td><td colspan="1"><code>:</code></td><td colspan="2"><code>ى</code></td><td colspan="2"><code>س</code></td><td colspan="2"><code>ر</code></td><td colspan="2"><code>ا</code></td><td colspan="2"><code>ف</code></td><td colspan="1"><code> </code></td><td colspan="2"><code>ن</code></td><td colspan="2"><code>ت</code></td><td colspan="2"><code>م</code></td><td colspan="1"><code> </code></td><td colspan="2"><code>ك</code></td><td colspan="2"><code>ى</code></td></tr>
426                     <tr valign="top"><td>UTF-8</td><td><code>41</code></td><td><code>20</code></td><td><code>54</code></td><td><code>65</code></td><td><code>78</code></td><td><code>74</code></td><td><code>20</code></td><td><code>69</code></td><td><code>6E</code></td><td><code>20</code></td><td><code>46</code></td><td><code>61</code></td><td><code>72</code></td><td><code>73</code></td><td><code>69</code></td><td><code>3A</code></td><td><code>D9</code></td><td><code>89</code></td><td><code>D8</code></td><td><code>B3</code></td><td><code>D8</code></td><td><code>B1</code></td><td><code>D8</code></td><td><code>A7</code></td><td><code>D9</code></td><td><code>81</code></td><td><code>20</code></td><td><code>D9</code></td><td><code>86</code></td><td><code>D8</code></td><td><code>AA</code></td><td><code>D9</code></td><td><code>85</code></td><td><code>20</code></td><td><code>D9</code></td><td><code>83</code></td><td><code>D9</code></td><td><code>89</code></td></tr>
427                     <tr valign="top"><td>u8UniByte</td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td></tr>
428                     <tr valign="top"><td>u8Prefix</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td></tr>
429                     <tr valign="top"><td>u8Suffix</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td></tr>
430                     <tr valign="top"><td>u8Prefix2</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td></tr>
431                     <tr valign="top"><td>u8Scope22</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>0</code></td><td><code>1</code></td><td><code>0</code></td><td><code>1</code></td></tr>
432                     <tr valign="top"><td>u8Error</td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td><td><code>0</code></td></tr>
433                  </tbody>
434               </table>
435               
436            </para>
437         </section>
438
439         <section>
440            <title>UTF-8 to UTF-16 Transcoding</title>
441            <para>UTF-8 is often preferred for storage and data exchange, it is suitable for
442               processing, but it is significantly more complex to process than UTF-16 [<xref
443                  linkend="Unicode"/>]. Consequently, XML documents are often encoded in UTF-8 for
444               serialization and transport and then transcoded to UTF-16 for processing with
445               languages such as Java and C#. Following the parallel bit stream methods developed
446               for u8u16, a high-performance standalone UTF-8 to UTF-16 transcoder [<xref
447                  linkend="u8u16"/>], transcoding to UTF-16 may be achieved by computing a series of
448               16 bit streams for the individual bits of each code unit. </para>
449            <para>The bit streams for UTF-16 are conveniently divided into groups: the eight streams
450               U16Hi0, U16Hi1, ..., U16Hi7 for the high byte of each UTF-16 code unit and the eight
451               streams U16Lo1, ..., U16Lo7 for the low byte. Upon conversion of the parallel bit
452               stream data back to byte streams, eight sequential byte streams U16h0, U16h1, ...,
453               U16Hi7 are used for the high byte of each UTF-16 code unit, while U16Lo0, U16Lo1,...,
454               U16Lo7 are used for the corresponding low byte. Interleaving these streams then
455               produces the full UTF-16 doublebyte stream.</para>
456         </section>
457
458         <section>
459            <title>UTF-8 Indexed UTF-16 Streams</title>
460            <para>UTF-16 bit streams are initially defined in UTF-8 indexed form. That is, with sets
461               of bits in one-to-one correspondence with UTF-8 bytes. However, only one set of
462               UTF-16 bits is required for encoding two or three-byte UTF-8 sequences and only two
463               sets are required for surrogate pairs corresponding to four-byte UTF-8 sequences. The
464               u8LastByte (UniByte , u8Scope22 , u8Scope33 , and u8Scope44 ) and u8Scope42 streams
465               mark the positions at which the correct UTF-16 bits are computed. The bit sets at
466               other positions must be deleted to compress the streams to UTF-16 indexed form.
467            </para>
468         </section>
469      </section>
470
471      <section>
472         <title>XML Character Error Streams</title>
473         <para>Legal characters in XML are the tab, carriage return, and line feed characters
474            together with all Unicode characters excluding the surrogate blocks, hexidecimal FFFE
475            and hexidecimal FFFF [<xref linkend="XML10"/>]. The XML character error stream marks the
476            position of all characters external to this set and defines error positions in the
477            source XML byte stream.</para>
478      </section>
479
480      <section>
481         <title>XML 1.0 End-of-line Handling Streams</title>
482         <para>In XML 1.0 the two-character sequence CR LF (carriage return, line feed) together
483            with any CR character not followed by a LF character must be converted to a single LF
484            character [<xref linkend="XML10"/>].</para>
485         <para>By defining carriage return, line feed, and carriage return line feed bit streams,
486            dentoted CR, LF and CRLF respectively, end-of-line normalization processing can be
487            performed in parallel, using only a small number of logical and shift operations.</para>
488         <para/>
489         <para>The following example demonstrates the generation of the CRLF deletion mask. In this
490            example, the position of all CR characters followed by LF characters are marked for
491            deletion. Isolated carriage returns are then replaced with LF characters. Completion of
492            this process satisfies the XML 1.0 end-of-line handling requirements.</para>
493         <para>
494            <table>
495               <caption>
496                  <para>XML 1.0 End-of-line Handling</para>
497               </caption>
498               <colgroup><col align="left" valign="top" /></colgroup>
499               <tbody>  <tr valign="top"><td>XML</td><td><code>first line C second line CL third line L one more C nothing left</code></td></tr>
500                  <tr valign="top"><td>CR</td><td><code>-----------1-------------1------------------------1-------------</code></td></tr>
501                  <tr valign="top"><td>LF</td><td><code>--------------------------1------------1------------------------</code></td></tr>
502                  <tr valign="top"><td>Delmask</td><td><code>-------------------------1--------------------------------------</code></td></tr>
503               </tbody>
504            </table>
505           
506         </para>
507      </section>
508
509      <section>
510         <title>Comment, Processing Instruction, CDATA Section Streams</title>
511         <para>Comments, processing instructions and CDATA sections represent sections of an XML
512            document which may contain markup that is not interpreted by the XML processor. As such,
513            the union of comment, processing Instruction and CDATA section extents define regions of
514            non-interpreteable markup in an XML document. The stream formed by this union is termed
515            the ignorable markup stream. The purpose of the the ignorable markup stream is to mark
516            the positions of all non-interpreted XML markup for deletion.</para>
517         <para>The following tables provides an example of marking comment extents. <table>
518            <caption>
519               <para>Comment, Processing Instuction and CDATA Streams</para>
520            </caption>
521            <colgroup><col align="left" valign="top" /></colgroup>
522            <tbody>     <tr valign="top"><td>XML</td><td><code>&lt;!-- do a&amp;b --&gt; &lt;?php f(a&amp;b) ?&gt; &lt;!-- show x&lt;&lt;1 --&gt;&lt;![CDATA[abcdedf x&lt;&lt;1 ]]&gt;</code></td></tr>
523               <tr valign="top"><td>Comment</td><td><code>111111111111111-----------------111111111111111111-------------------------</code></td></tr>
524               <tr valign="top"><td>CDATA</td><td><code>--------------------------------------------------1111111111111111111111111</code></td></tr>
525               <tr valign="top"><td>PI</td><td><code>----------------111111111111111--------------------------------------------</code></td></tr>
526            </tbody>
527         </table>
528           
529         </para>
530         <para> With the removal of all non-interpreteable markup, several phases of parallel bit
531            stream based SIMD operations may follow, operating on up to 128 byte positions on
532            current commondity processors, assured of XML markup relevancy, and in parallel. For
533            example, with the removal of comments, processing instructions and CDATA secions, XML
534            names may be identified and length sorted for efficient symbol table construction. </para>
535         <para> As an aside, comments and CDATA sections must be first be validated to ensure that
536            comments do not contain "--" sequences and that CDATA sections do not contain
537            "]]&gt;" sequences prior to ignorable markup stream generation.</para>
538      </section>
539
540
541      <section>
542         <title>Predefined Entity Deletion Streams</title>
543         <para>Predefined character (<![CDATA[&lt;,&gt;,&amp;,&apos;,&quot;]]>)
544            and numeric character references (&amp;#nnnn;, &amp;#xhhhh;) must be replaced by
545            a single character [<xref linkend="XML10"/>]. Using a strategy analogous as that used
546            for comment, processing instruction and CDATA sections, the marking of the union of all
547            references byte position extents in bit space, with the exception of the final bit
548            position of each reference, defines the deletion mask stream for predefined
549         entities.</para>
550      </section>
551
552      <section>
553         <title>Parallel Parsing with Bit Stream Addition Streams</title>
554         <para>Whereas sequential bit scans over lexical item streams form the basis of XML parsing
555            in the current Parabix parser, a new method of parallel parsing has been developed and
556            prototyped using the concept of bitstream addition. Fundamental to this method is the
557            concept of a <emphasis>cursor</emphasis> stream, a bit stream marking the positions of
558            multiple parallel parses currently in process. </para>
559         <para>The results of parsing using the bit stream addition technique are producing using a
560            series of <emphasis>call-out</emphasis> bit streams. These streams mark the beginning
561            and end of each start tag, end tag and empty tag. Within tags, additional streams exist
562            to mark start and end positions for tag names, attribute names and attribute valus. An
563            error flag stream marks the positions of any syntactic errors encountered during
564            parsing.</para>
565         <para>
566            <table>
567               <caption>
568                  <para>Call Out Streams for Parallel Parsing</para>
569               </caption>
570               <colgroup><col align="left" valign="top" /></colgroup>
571               <tbody>  <tr valign="top"><td>XML</td><td><code>&lt;first att1=&quot;val1&quot;&gt;&lt;second/&gt;&lt;third wrong=value&gt;some text&lt;/third&gt;&lt;/first/&gt;</code></td></tr>
572                  <tr valign="top"><td>ElemNamePositions</td><td><code>-1------------------1--------1-------------------------------------------</code></td></tr>
573                  <tr valign="top"><td>ElemNameFollows</td><td><code>------1-------------------1-------1--------------------------------------</code></td></tr>
574                  <tr valign="top"><td>STagEnds</td><td><code>------------------1------------------------------------------------------</code></td></tr>
575                  <tr valign="top"><td>EmptyTagEnds</td><td><code>---------------------------1---------------------------------------------</code></td></tr>
576                  <tr valign="top"><td>ParseError</td><td><code>-----------------------------------------1-----------------------------1-</code></td></tr>
577                  <tr valign="top"><td>AttNameStarts</td><td><code>-------1---------------------------1-------------------------------------</code></td></tr>
578                  <tr valign="top"><td>AttNameFollows</td><td><code>-----------1----------------------------1--------------------------------</code></td></tr>
579                  <tr valign="top"><td>AttValStarts</td><td><code>------------1----------------------------1-------------------------------</code></td></tr>
580                  <tr valign="top"><td>AttValEnds</td><td><code>-----------------1-------------------------------------------------------</code></td></tr>
581                  <tr valign="top"><td>EndTagSeconds</td><td><code>---------------------------------------------------------1-------1-------</code></td></tr>
582                  <tr valign="top"><td>EndTagEnds</td><td><code>---------------------------------------------------------------1-------1-</code></td></tr>
583               </tbody>
584            </table>
585         </para>
586
587      </section>
588
589   </section>
590   <section>
591      <title>SIMD Beyond Bitstreams: Names and Numbers</title>
592
593      <para>Whereas the fundamental innovation of our work is the use of SIMD technology in
594         implementing parallel bit streams for XML, there are also important ways in which more
595         traditional byte-oriented SIMD operations can be useful in accelerating other aspects of
596         XML processing.</para>
597
598      <section>
599         <title>Name Lookup</title>
600         <para>Efficient symbol table mechanisms for looking up element and attribute names is
601            important for almost all XML processing applications. It is also an important technique
602            merely for assessing well-formedness of an XML document; rather than validating the
603            character-by-character composition of each occurrence of an XML name as it is
604            encountered, it is more efficient to validate all but the first occurrence by first
605            determining whether the name already exists in a table of prevalidated names.</para>
606
607         <para>The first symbol table mechanism deployed in the Parabix parser simply used the
608            hashmaps of the C++ standard template library, without deploying any SIMD technology.
609            However, with the overhead of character validation, transcoding and parsing dramatically
610            reduced by parallel bit stream technology, we found that symbol lookups then accounted
611            for about half of the remaining execution time in a statistics gathering application
612               [<xref linkend="CASCON08"/>]. Thus, symbol table processing was identified as a major
613            target for further performance improvement. </para>
614         <para> Our first effort to improve symbol table performance was to employ the splash tables
615            with cuckoo hashing as described by Ross [<xref linkend="Ross06"/>], using SIMD
616            technology for parallel bucket processing. Although this technique did turn out to have
617            the advantage of virtually constant-time performance even for very large vocabularies,
618            it was not particularly helpful for the relatively small vocabularies typically found in
619            XML document processing. </para>
620         <para> However, a second approach has been found to be quite useful, taking advantage of
621            parallel bit streams for cheap determination of symbol length. In essence, the length of
622            a name can be determined very cheaply using a single bit scan operation. This then makes
623            it possible to use length-sorted symbol table processing, as follows. First, the
624            occurrences of all names are stored in arrays indexed by length. Then the length-sorted
625            arrays may each be inserted into the symbol table in turn. The advantage of this is that
626            a separate loop may be written for each length. Length sorting makes for very efficient
627            name processing. For example hash value computations and name comparisons can be made by
628            loading multibyte values and performing appropriate shifting and masking operations,
629            without the need for a byte-at-a-time loop. In initial experiments, this length-sorting
630            approach was found to reduce symbol lookup cost by a factor of two. </para>
631         <para> Current research includes the application of SIMD technology to further enhance the
632            performance of length-sorted lookup. We have identified a promising technique for
633            parallel processing of multiple name occurrences using a parallel trie lookup technique.
634            Given an array of occurrences of names of a particular length, the first one, two or
635            four bytes of each name are gathered and stored in a linear array. SIMD techniques are
636            then used to compare these prefixes with the possible prefixes for the current position
637            within the trie. In general, a very small number of possibilities exist for each trie
638            node, allowing for fast linear search through all possibilities. Typically, the
639            parallelism is expected to exceed the number of possibilities to search through at each
640            node. With length-sorting to separate the top-level trie into many small subtries, we
641            expect only a single step of symbol lookup to be needed in most practical instances. </para>
642
643         <para>The gather step of this algorithm is actually a common technique in SIMD processing.
644            Instruction set support for gather operations is a likely future direction for SIMD
645            technology.</para>
646      </section>
647
648      <section>
649         <title>Numeric Processing</title>
650         <para> Many XML applications involve numeric data fields as attribute values or element
651            content. Although most current XML APIs uniformly return information to applications in
652            the form of character strings, it is reasonable to consider direct API support for
653            numeric conversions within a high-performance XML engine. With string to numeric
654            conversion such a common need, why leave it to application programmers? </para>
655         <para> High-performance string to numeric conversion using SIMD operations also can
656            considerably outperform the byte-at-a-time loops that most application programmers or
657            libraries might employ. A first step is reduction of ASCII bytes to corresponding
658            decimal nybbles using a SIMD packing operation. Then an inductive doubling algorithm
659            using SIMD operations may be employed. First, 16 sets of adjacent nybble values in the
660            range 0-9 can be combined in just a few SIMD operations to 16 byte values in the range
661            0-99. Then 8 sets of byte values may similarly be combined with further SIMD processing
662            to produce doublebyte values in the range 0-9999. Further combination of doublebyte
663            values into 32-bit integers and so on can also be performed using SIMD operations. </para>
664         <para> Using appropriate gather operations to bring numeric strings into appropriate array
665            structures, an XML engine could offer high-performance numeric conversion services to
666            XML application programmers. We expect this to be an important direction for our future
667            work, particularly in support of APIs that focus on direct conversion of XML data into
668            business objects. </para>
669
670      </section>
671   </section>
672
673   <section>
674      <title>APIs and Parallel Bit Streams</title>
675
676      <section>
677         <title>The ILAX Streaming API</title>
678         <para>The In-Line API for XML (ILAX) is the base API provided with the Parabix parser. It
679            is intended for low-level extensions compiled right into the engine, with minimum
680            possible overhead. It is similar to streaming event-based APIs such as SAX, but
681            implemented by inline substitution rather than using callbacks. In essence, an extension
682            programmer provides method bodies for event-processing methods declared internal to the
683            Parabix parsing engine, compiling the event processing code directly with the core code
684            of the engine. </para>
685         <para> Although ILAX can be used directly for application programming, its primary use is
686            for implementing engine extensions that support higher-level APIs. For example, the
687            implementation of C or C++ based streaming APIs based on the Expat [<xref
688               linkend="Expat"/>] or general SAX models can be quite directly implemented. C/C++ DOM
689            or other tree-based APIs can also be fairly directly implemented. However, delivering
690            Parabix performance to Java-based XML applications is challenging due to the
691            considerable overhead of crossing the Java Native Interface (JNI) boundary. This issue
692            is addressed with the Array Set Model (ASM) concept discussed in the following section. </para>
693         <para> With the recent development of parallel parsing using bitstream addition, it is
694            likely that the underlying ILAX interface of Parabix will change. In essence, ILAX
695            suffers the drawback of all event-based interfaces: they are fundamentally sequential in
696            number. As research continues, we expect efficient parallel methods building on parallel
697            bit stream foundations to move up the stack of XML processing requirements. Artificially
698            imposing sequential processing is thus expected to constrain further advances in XML
699            performance. </para>
700      </section>
701
702      <section>
703         <title>Efficient XML in Java Using Array Set Models</title>
704         <para> In our GML-to-SVG case study, we identified the lack of high-performance XML
705            processing solutions for Java to be of particular interest. Java byte code does not
706            provide access to the SIMD capabilities of the underlying machine architecture. Java
707            just-in-time compilers might be capable of using some SIMD facilities, but there is no
708            real prospect of conventional compiler technology translating byte-at-a-time algorithms
709            into parallel bit stream code. So the primary vehicle for delivering high-performance
710            XML processing is to call native parallel bit stream code written in C through JNI
711            capabilities. </para>
712         <para>However, each JNI call is expensive, so it is desirable to minimize the number of
713            calls and get as much work done during each call as possible. This mitigates against
714            direct implementation of streaming APIs in Java through one-to-one mappings to an
715            underlying streaming API in C. Instead, we have concentrated on gathering information on
716            the C side into data structures that can then be passed to the Java side. However, using
717            either C pointer-based structures or C++ objects is problematic because these are
718            difficult to interpret on the Java side and are not amenable to Java's automatic storage
719            management system. Similarly, Java objects cannot be conveniently created on the C side.
720            However, it is possible to transfer arrays of simple data values (bytes or integers)
721            between C and Java, so that makes a reasonable focus for bulk data communication between
722            C and Java. </para>
723         <para><emphasis>Array Set Models</emphasis> are array-based representations of information
724            representing an XML document in accord with XML InfoSet [<xref linkend="InfoSet"/>] or
725            other XML data models relevant to particular APIs. As well as providing a mechanism for
726            efficient bulk data communication across the JNI boundary, ASMs potentially have a
727            number of other benefits in high-performance XML processing. <itemizedlist>
728               <listitem>
729                  <para>Prefetching. Commodity processors commonly support hardware and/or software
730                     prefetching to ensure that data is available in a processor cache when it is
731                     needed. In general, prefetching is most effective in conjunction with the
732                     continuous sequential memory access patterns associated with array
733                  processing.</para>
734               </listitem>
735               <listitem>
736                  <para>DMA. Some processing environments provide Direct Memory Access (DMA)
737                     controllers for block data movement in parallel with computation. For example,
738                     the Cell Broadband Engine uses DMA controllers to move the data to and from the
739                     local stores of the synergistic processing units. Arrays of contiguous data
740                     elements are well suited to bulk data movement using DMA.</para>
741               </listitem>
742               <listitem>
743                  <para>SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern
744                     processor instruction sets allow simultaneous application of particular
745                     instructions to sets of elements from parallel arrays. For effective use of
746                     SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS
747                     (Array of Structures) model. </para>
748               </listitem>
749               <listitem>
750                  <para>Multicore processors. Array-oriented processing can enable the effective
751                     distribution of work to the individual cores of a multicore system in two
752                     distinct ways. First, provided that sequential dependencies can be minimized or
753                     eliminated, large arrays can be divided into separate segments to be processed
754                     in parallel on each core. Second, pipeline parallelism can be used to implement
755                     efficient multipass processing with each pass consisting of a processing kernel
756                     with array-based input and array-based output. </para>
757               </listitem>
758               <listitem>
759                  <para>Streaming buffers for large XML documents. In the event that an XML document
760                     is larger than can be reasonably represented entirely within processor memory,
761                     a buffer-based streaming model can be applied to work through a document using
762                     sliding windows over arrays of elements stored in document order. </para>
763               </listitem>
764
765            </itemizedlist>
766         </para>
767
768         <section>
769            <title>Saxon-B TinyTree Example</title>
770            <para>As a first example of the ASM concept, current work includes a proof-of-concept to
771               deliver a high-performance replacement for building the TinyTree data structure used
772               in Saxon-B 6.5.5, an open-source XSLT 2.0 processor written in Java [<xref
773                  linkend="Saxon"/>]. Although XSLT stylesheets may be cached for performance, the
774               caching of source XML documents is typically not possible. A new TinyTree object to
775               represent the XML source document is thus commonly constructed with each new query so
776               that the overall performance of simple queries on large source XML documents is
777               highly dependent on TinyTree build time. Indeed, in a study of Saxon-SA, the
778               commercial version of Saxon, query time was shown to be dominated by TinyTree build
779               time [<xref linkend="Kay08"/>]. Similar performance results are demonstrable for the
780               Saxon-B XSLT processor as well. </para>
781            <para> The Saxon-B processor studied is a pure Java solution, converting a SAX (Simple
782               API for XML) event stream into the TinyTree Java object using the efficient Aelfred
783               XML parser [<xref linkend="AElfred"/>]. The TinyTree structure is itself an
784               array-based structure mapping well suited to the ASM concept. It consists of six
785               parallel arrays of integers indexed on node number and containing one entry for each
786               node in the source document, with the exception of attribute and namespace nodes
787                  [<xref linkend="Saxon"/>]. Four of the arrays respectively provide node kind, name
788               code, depth, and next sibling information for each node, while the two others are
789               overloaded for different purposes based on node kind value. For example, in the
790               context of a text node , one of the overloaded arrays holds the text buffer offset
791               value whereas the other holds the text buffer length value. Attributes and namespaces
792               are represented using similiar parallel array of values. The stored TinyTree values
793               are primarily primitive Java types, however, object types such as Java Strings and
794               Java StringBuffers are also used to hold attribute values and comment values
795               respectively. </para>
796            <para> In addition to the TinyTree object, Saxon-B maintains a NamePool object which
797               represents a collection of XML name triplets. Each triplet is composed of a Namespace
798               URI, a Namespace prefix and a local name and encoded as an integer value known as a
799               namecode. Namecodes permit efficient name search and look-up using integer
800               comparison. Namecodes may also be subsequently decoded to recover namespace and local
801               name information. </para>
802            <para> Using the Parabix ILAX interface, a high-performance reimplementation of TinyTree
803               and NamePool data structures was built to compare with the Saxon-B implementation. In
804               fact, two functionally equivalent versions of the ASM java class were constructed. An
805               initial version was constructed based on a set of primitive Java arrays constructed
806               and allocated in the Java heap space via JNI New&lt;PrimitiveType&gt;Array
807               method call. In this version, the JVM garbage collector is aware of all memory
808               allocated in the native code. However, in this approach, large array copy operations
809               limited overall performance to approximately a 2X gain over the Saxon-B build time. </para>
810            <para>To further address the performance penalty imposed by copying large array values,
811               a second version of the ASM Java object was constructed based on natively backed
812               Direct Memory Byte Buffers [<xref linkend="JNI"/>]. In this version the JVM garbage
813               collector is unaware any native memory resources backing the Direct Memory Byte
814               Buffers. Large JNI-based copy operations are avoided; however, system memory must be
815               explicitly deallocated via a Java native method call. Using this approach, our
816               preliminary results show an approximate total 2.5X gain over Saxon-B build time.
817            </para>
818         </section>
819      </section>
820   </section>
821
822
823   <section>
824      <title>Compiler Technology</title>
825
826      <para> An important focus of our recent work is on the development of compiler technology to
827         automatically generate the low-level SIMD code necessary to implement bit stream processing
828         given suitable high-level specifications. This has several potential benefits. First, it
829         can eliminate the tedious and error-prone programming of bit stream operations in terms of
830         register-at-a-time SIMD operations. Second, compilation technology can automatically employ
831         a variety of performance improvement techniques that are difficult to apply manually. These
832         include algorithms for instruction scheduling and register allocation as well as
833         optimization techniques for common subexpression expression elimination and register
834         rematerialization among others. Third, compiler technology makes it easier to make changes
835         to the low-level code for reasons of perfective or adaptive maintenance.</para>
836
837      <para>Beyond these reasons, compiler technology also offers the opportunity for retargetting
838         the generation of code to accommodate different processor architectures and API
839         requirements. Strategies for efficient parallel bit stream code can vary considerably
840         depending on processor resources such as the number of registers available, the particular
841         instruction set architecture supported, the size of L1 and L2 data caches, the number of
842         available cores and so on. Separate implementation of custom code for each processor
843         architecture would thus be likely to be prohibitively expensive, prone to errors and
844         inconsistencies and difficult to maintain. Using compilation technology, however, the idea
845         would be to implement a variety of processor-specific back-ends all using a common front
846         end based on parallel bit streams. </para>
847
848      <section>
849         <title>Character Class Compiler</title>
850
851         <para>The first compiler component that we have implemented is a character class compiler,
852            capable of generation all the bit stream logic necessary to produce a set of lexical
853            item streams each corresponding to some particular set of characters to be recognized.
854            By taking advantage of common patterns between characters within classes, and special
855            optimization logic for recognizing character-class ranges, our existing compiler is able
856            to generate well-optimized code for complex sets of character classes involving numbers
857            of special characters as well as characters within specific sets of ranges. </para>
858
859      </section>
860      <section>
861         <title>Regular Expression Compilation</title>
862
863         <para>Based on the character class compiler, we are currently investigating the
864            construction of a regular expression compiler that can implement bit-stream based
865            parallel regular-expression matching similar to that describe previously for parallel
866            parsing by bistream addition. This compiler works with the assumption that bitstream
867            regular-expression definitions are deterministic; no backtracking is permitted with the
868            parallel bit stream representation. In XML applications, this compiler is primarily
869            intended to enforce regular-expression constraints on string datatype specifications
870            found in XML schema. </para>
871
872      </section>
873
874      <section>
875         <title>Unbounded Bit Stream Compilation</title>
876
877         <para>The Catalog of XML Bit Streams presented earlier consist of a set of abstract,
878            unbounded bit streams, each in one-to-one correspondence with input bytes of a text
879            file. Determining how these bit streams are implemented using fixed-width SIMD
880            registers, and possibly processed in fixed-length buffers that represent some multiple
881            of the register width is a source of considerable programming complexity. The general
882            goal of our compilation strategy in this case is to allow operations to be programmed in
883            terms of unbounded bit streams and then automatically reduced to efficient low-level
884            code with the application of a systematic code generation strategy for handling block
885            and buffer boundary crossing. This is work currently in progress. </para>
886
887      </section>
888   </section>
889
890   <section>
891      <title>Conclusion</title>
892      <para>Parallel bit stream technology offers the opportunity to dramatically speed up the core
893         XML processing components used to implement virtually any XML API. Character validation and
894         transcoding, whitespace processing, and parsing up to including the full validation of tag
895         syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the
896         positions of all element names, attribute names and attribute values can also be produced,
897         followed by fast bit scan operations to generate position and length values. Beyond bit
898         streams, byte-oriented SIMD processing of names and numerals can also accelerate
899         performance beyond sequential byte-at-a-time methods. </para>
900      <para>Advances in processor architecture are likely to further amplify the performance of
901         parallel bit stream technology over traditional byte-at-a-time processing over the next
902         decade. Improvements to SIMD register width, register complement and operation format can
903         all result in further gains. New SIMD instruction set features such as inductive doubling
904         support, parallel extract and deposit instructions, bit interleaving and scatter/gather
905         capabilities should also result in significant speed-ups. Leveraging the intraregister
906         parallelism of parallel bit stream technology within SIMD registers to take of intrachip
907         parallelism on multicore processors should accelerate processing further. </para>
908      <para>Technology transfer using a patent-based open-source business model is a further goal of
909         our work with a view to widespread deployment of parallel bit stream technology in XML
910         processing stacks implementing a variety of APIs. The feasibility of substantial
911         performance improvement in replacement of technology implementing existing APIs has been
912         demonstrated even in complex software architectures involving delivery of performance
913         benefits across the JNI boundary. We are seeking to accelerate these deployment efforts
914         both through the development of compiler technology to reliably apply these methods to a
915         variety of architectures as well as to identify interested collaborators using open-source
916         or commercial models. </para>
917   </section>
918   
919   <section>
920      <title>Acknowledgments</title>
921      <para>This work is supported in part by research grants and scholarships from the Natural
922         Sciences and Engineering Research Council of Canada, the Mathematics of Information
923         Technology and Complex Systems Network and the British Columbia Innovation Council. </para>
924      <para>We thank our colleague Dan Lin (Linda) for her work in high-performance symbol table
925         processing. </para>
926   </section>
927   
928   <bibliography>
929      <title>Bibliography</title>
930      <bibliomixed xml:id="XMLChip09" xreflabel="Leventhal and Lemoine 2009">Leventhal, Michael and
931         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
932         Processing XML Efficiently 2009, Montréal.</bibliomixed>
933      <bibliomixed xml:id="Datapower09" xreflabel="Salz, Achilles and Maze 2009">Salz, Richard,
934         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
935         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
936         Efficiently 2009, Montréal.</bibliomixed>
937      <bibliomixed xml:id="PPoPP08" xreflabel="Cameron 2007">Cameron, Robert D. 2007. A Case Study
938         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
939         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
940         Lake City, Utah. On the Web at <link>http://research.ihost.com/ppopp08/</link>.</bibliomixed>
941      <bibliomixed xml:id="CASCON08" xreflabel="Cameron, Herdy and Lin 2008">Cameron, Robert D.,
942         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
943         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
944         Practice of Parallel Programming 2008, Toronto.</bibliomixed>
945      <bibliomixed xml:id="SVGOpen08" xreflabel="Herdy, Burggraf and Cameron 2008">Herdy, Kenneth
946         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
947         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
948         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
949         Nuremburg. On the Web at
950            <link>http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/</link>.</bibliomixed>
951      <bibliomixed xml:id="Ross06" xreflabel="Ross 2006">Ross, Kenneth A. 2006. Efficient hash
952         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
953            <link>www.cs.columbia.edu/~kar/pubsk/icde2007.pdf</link>.</bibliomixed>
954      <bibliomixed xml:id="ASPLOS09" xreflabel="Cameron and Lin 2009">Cameron, Robert D. and Dan
955         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
956         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</bibliomixed>
957      <bibliomixed xml:id="Wu08" xreflabel="Wu et al. 2008">Wu, Yu, Qi Zhang, Zhiqiang Yu and
958         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
959         Proceedings of Balisage 2008, Montréal. On the Web at
960            <link>http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html</link>.</bibliomixed>
961      <bibliomixed xml:id="u8u16" xreflabel="Cameron 2008">u8u16 - A High-Speed UTF-8 to UTF-16
962         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
963         Science Simon Fraser University, June 21 2007.</bibliomixed>
964      <bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth
965         Edition) W3C Recommendation 26 November 2008. On the Web at
966            <link>http://www.w3.org/TR/REC-xml/</link>.</bibliomixed>
967      <bibliomixed xml:id="Unicode" xreflabel="Unicode">The Unicode Consortium. 2009. On the Web at
968            <link>http://unicode.org/</link>.</bibliomixed>
969      <bibliomixed xml:id="Pex06" xreflabel="Hilewitz and Lee 2006"> Hilewitz, Y. and Ruby B. Lee. 2006.
970         Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions.
971         Proceedings of the IEEE 17th International Conference on Application-Specific Systems,
972         Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</bibliomixed>
973      <bibliomixed xml:id="InfoSet" xreflabel="XML Infoset">XML Information Set (Second Edition) W3C
974         Recommendation 4 February 2004. On the Web at
975         <link>http://www.w3.org/TR/xml-infoset/</link>.</bibliomixed>
976      <bibliomixed xml:id="Saxon" xreflabel="Saxon">SAXON The XSLT and XQuery Processor. On the Web
977         at <link>http://saxon.sourceforge.net/</link>.</bibliomixed>
978      <bibliomixed xml:id="Kay08" xreflabel="Kay 2008"> Kay, Michael Y. 2008. Ten Reasons Why Saxon
979         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</bibliomixed>
980      <bibliomixed xml:id="AElfred" xreflabel="Ælfred"> The Ælfred XML Parser. On the Web at
981            <link>http://saxon.sourceforge.net/aelfred.html</link>.</bibliomixed>
982      <bibliomixed xml:id="JNI" xreflabel="Hitchens 2002">Hitchens, Ron. Java NIO. O'Reilly, 2002.</bibliomixed>
983      <bibliomixed xml:id="Expat" xreflabel="Expat">The Expat XML Parser.
984            <link>http://expat.sourceforge.net/</link>.</bibliomixed>
985   </bibliography>
986
987</article>
Note: See TracBrowser for help on using the repository browser.