source: docs/Balisage09/Bal2009came0601.xml @ 283

Last change on this file since 283 was 283, checked in by ksherdy, 10 years ago

Update Deletion Streams, Error Streams, Lexical Items Streams, other minor changes. Requires table generation.

File size: 113.7 KB
Line 
1<?xml version="1.0" encoding="UTF-8"?>
2<!-- MODIFIED DTD LOCATION -->
3<!DOCTYPE article SYSTEM "balisage-1-1.dtd">
4<article xmlns="http://docbook.org/ns/docbook" version="5.0-subset Balisage-1.1"
5   xml:id="HR-23632987-8973">
6   <title>Parallel Bit Stream Technology as a Foundation for XML Parsing Performance</title>
7   <info>
8      <confgroup>
9         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
10            Space, Time, or Bandwidth</conftitle>
11         <confdates>August 10 2009</confdates>
12      </confgroup>
13      <abstract>
14         <para>By first transforming the octets (bytes) of XML texts into eight parallel bit
15            streams, the SIMD features of commodity processors can be exploited for parallel
16            processing of blocks of 128 input bytes at a time. Established transcoding and parsing
17            techniques are reviewed followed by new techniques including parsing with bitstream
18            addition. Further opportunities are discussed in light of expected advances in CPU
19            architecture and compiler technology. Implications for various APIs and information
20            models are presented as well opportunities for collaborative open-source
21         development.</para>
22      </abstract>
23      <author>
24         <personname>
25            <firstname>Rob</firstname>
26            <surname>Cameron</surname>
27         </personname>
28         <personblurb>
29            <para>Dr. Rob Cameron is Professor and Director of Computing Science at Simon Fraser
30               University. With a broad spectrum of research interests related to programming
31               languages, software engineering and sociotechnical design of public computing
32               infrastructure, he has recently been focusing on high performance text processing
33               using parallel bit stream technology and its applications to XML. He is also a
34               patentleft evangelist, advocating university-based technology transfer models
35               dedicated to free use in open source. </para>
36
37         </personblurb>
38         <affiliation>
39            <jobtitle>Professor of Computing Science</jobtitle>
40            <orgname>Simon Fraser University</orgname>
41         </affiliation>
42         <email>cameron@cs.sfu.ca</email>
43      </author>
44      <author>
45         <personname>
46            <firstname>Ken</firstname>
47            <surname>Herdy</surname>
48         </personname>
49         <personblurb>
50            <para> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
51               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
52               of Science in Computing Science with a Certificate in Spatial Information Systems at
53               Simon Fraser University in 2005. </para>
54            <para> Ken is currently pursuing graduate studies in Computing Science at Simon Fraser
55               University with industrial scholarship support from the Natural Sciences and
56               Engineering Research Council of Canada, the Mathematics of Information Technology and
57               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
58               of the principal techniques that may be used to improve XML processing performance in
59               the context of the Geography Markup Language (GML). </para>
60
61         </personblurb>
62         <affiliation>
63            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
64            <orgname>Simon Fraser University </orgname>
65         </affiliation>
66         <email>ksherdy@cs.sfu.ca</email>
67      </author>
68      <author>
69         <personname>
70            <firstname>Ehsan</firstname>
71            <surname>Amiri</surname>
72         </personname>
73         <personblurb>
74            <para>Ehsan Amiri is a PhD student of Computer Science at Simon Fraser University.
75               Before that he studied at Sharif University of Technology, Tehran, Iran. While his
76               graduate research has been focused on theoretical problems like fingerprinting, Ehsan
77               has worked on some software projects like development of a multi-node firewall as
78               well. More recently he has been developing compiler technology for automatic
79               generation of bit stream processing code. </para>
80
81         </personblurb>
82         <affiliation>
83            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
84            <orgname>Simon Fraser University</orgname>
85         </affiliation>
86         <email>eamiri@cs.sfu.ca</email>
87      </author>
88      <legalnotice>
89         <para>Copyright &#x000A9; 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
90            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
91            Works 2.5 Canada License.</para>
92      </legalnotice>
93      <keywordset role="author">
94         <keyword/>
95         <keyword/>
96         <keyword/>
97      </keywordset>
98   </info>
99   <section>
100      <title>Introduction</title>
101      <para> While particular XML applications may benefit from special-purpose hardware such as XML
102         chips [<xref linkend="XMLChip09"/>] or appliances [<xref linkend="Datapower09"/>], the bulk
103         of the world's XML processing workload will continue to be handled by XML software stacks
104         on commodity processors. Exploiting the SIMD capabilities of such processors such as the
105         SSE instructions of x86 chips, parallel bit stream technology offers the potential of
106         dramatic improvement over byte-at-a-time processing for a variety of XML processing tasks.
107         Character set issues such as Unicode validation and transcoding [<xref linkend="PPoPP08"
108         />], normalization of line breaks and white space and XML character validation can be
109         handled fully in parallel using this representation. Lexical item streams, such as the bit
110         stream marking the positions of opening angle brackets, can also be formed in parallel.
111         Bit-scan instructions of commodity processors may then be used on lexical item streams to
112         implement rapid single-instruction scanning across variable-length multi-byte text blocks
113         as in the Parabix XML parser [<xref linkend="CASCON08"/>]. Overall, these techniques may be
114         combined to yield end-to-end performance that may be 1.5X to 15X faster than alternatives
115            [<xref linkend="SVGOpen08"/>].</para>
116      <para>Continued research in parallel bit stream techniques as well as more conventional
117         application of SIMD techniques in XML processing offers further prospects for improvement
118         of core XML components as well as for tackling performance-critical tasks further up the
119         stack. A newly prototyped technique for parallel tag parsing using bitstream addition is
120         expected to improve parsing performance even beyond that achieved using sequential bit
121         scans. Several techniques for improved symbol table performance are being investigated,
122         including parallel hash value calculation and length-based sorting using the cheap length
123         determination afforded by bit scans. To deliver the benefits of parallel bit stream
124         technology to the Java world, we are developing Array Set Model (ASM) representations of
125         XML Infoset and other XML information models for efficient transmission across the JNI
126         boundary.</para>
127
128      <para>Amplifying these software advances, continuing hardware advances in commodity processors
129         increase the relative advantage of parallel bit stream techniques over traditional
130         byte-at-a-time processors. For example, the Intel Core architecture improved SSE processing
131         to give superscalar execution of bitwise logic operations (3 instructions per cycle vs. 1
132         in Pentium 4). Upcoming 256-bit AVX technology extends the register set and replaces
133         destructive two-operand instructions with a nondestructive three-operand form. General
134         purpose programming on graphic processing units (GPGPU) such as the upcoming 512-bit
135         Larrabee processor may also be useful for XML applications using parallel bit streams. New
136         instruction set architectures may also offer dramatic improvements in core algorithms.
137         Using the relatively simple extensions to support the principle of inductive doubling, a 3X
138         improvement in several core parallel bit stream algorithms may be achieved [<xref
139            linkend="ASPLOS09"/>]. Other possibilities include direct implementation of parallel
140         extract and parallel deposit (pex/pdep) instructions [<xref linkend="Pex06"/>], and
141         bit-level interleave operations as in Larrabee, each of which would have important
142         application to parallel bit stream processing.</para>
143
144      <para>Further prospects for XML performance improvement arise from leveraging the
145         intraregister parallelism of parallel bit stream technology to exploit the interchip
146         parallelism of multicore computing. Parallel bit stream techniques can support multicore
147         parallelism in both data partitioning and task partitioning models. For example, the
148         datasection partitioning approach of Wu, Zhang, Yu and Li may be used to partition blocks
149         for speculative parallel parsing on separate cores followed by a postprocessing step to
150         join partial S-trees [<xref linkend="Wu08"/>].</para>
151
152      <para>In our view, the established and expected performance advantages of parallel bit stream
153         technology over traditional byte-at-a-time processing are so compelling that parallel bit
154         stream technology should ultimately form the foundation of every high-performance XML
155         software stack. We envision a common high-performance XML kernel that may be customized to
156         a variety of processor architectures and that supports a wide range of existing and new XML
157         APIs. Widespread deployment of this technology should greatly benefit the XML community in
158         addressing both the deserved and undeserved criticism of XML on performance grounds. A
159         further benefit of improved performance is a substantial greening of XML technologies.</para>
160
161      <para>To complement our research program investigating fundamental algorithms and issues in
162         high-performance XML processing, our work also involves development of open source software
163         implementing these algorithms, with a goal of full conformance to relevant specifications.
164         From the research perspective, this approach is valuable in ensuring that the full
165         complexity of required XML processing is addressed in reporting and assessing processing
166         results. However, our goal is also to use this open source software as a basis of
167         technology transfer. A Simon Fraser University spin-off company, called International
168         Characters, Inc., has been created to commercialize the results of this work using a
169         patent-based open source model.</para>
170
171      <para>To date, we have not yet been successful in establishing a broader community of
172         participation with our open source code base. Within open-source communities, there is
173         often a general antipathy towards software patents; this may limit engagement with our
174         technology, even though it has been dedicated for free use in open source. </para>
175
176      <para>A further complication is the inherent difficulty of SIMD programming in general, and
177         parallel bit stream programming in particular. Considerable work is required with each new
178         algorithmic technique being investigated as well as in retargetting our techniques for each
179         new development in SIMD and multicore processor technologies. To address these concerns, we
180         have increasingly shifted the emphasis of our research program towards compiler technology
181         capable of generating parallel bit stream code from higher-level specifications.</para>
182   </section>
183
184   <section>
185      <title>A Catalog of Parallel Bit Streams for XML</title>
186      <section>
187         <title>Introduction</title>
188         <para>In this section, we introduce the fundamental concepts of parallel bit stream
189            technology and present a comprehensive catalog of parallel bit streams for use in XML
190            processing. In presenting this catalog, the focus is on the specification of the bit
191            streams as data streams in one-to-one correspondence with the character code units of an
192            input XML stream. The goal is to define these bit streams in the abstract without
193            initially considering memory layouts, register widths or other issues related to
194            particular target architectures. In cataloging these techniques, we also hope to convey
195            a sense of the breadth of applications of parallel bit stream technology to XML
196            processing tasks. </para>
197      </section>
198
199      <section>
200         <title>Basis Bit Streams</title>
201         <para>Given a byte-oriented text stream represented in UTF-8, for example, we define a
202            transform representation of this text consisting of a set of eight parallel bit streams
203            for the individual bits of each byte. Thus, the <code>Bit0</code> stream is the stream
204            of bits consisting of bit 0 of each byte in the input byte stream, <code>Bit1</code> is
205            the bit stream consisting of bit 1 of each byte in the input stream and so on. The set
206            of streams <code>Bit0</code> through <code>Bit7</code> are known as the <emphasis>basis
207               streams</emphasis> of the parallel bit stream representation. The following table
208            shows an example XML character stream together with its representation as a set of 8
209            basis streams. <table>
210               <caption>
211                  <para>XML Character Stream Transposition.</para>
212               </caption>
213               <colgroup>
214                  <col align="left" valign="top"/>
215                  <col align="left" valign="top"/>
216                  <col align="left" valign="top"/>
217                  <col align="left" valign="top"/>
218                  <col align="left" valign="top"/>
219                  <col align="left" valign="top"/>
220               </colgroup>
221               <tbody>
222                  <tr valign="top">
223                     <td>XML</td>
224                     <td>
225                        <code>&lt;</code>
226                     </td>
227                     <td>
228                        <code>t</code>
229                     </td>
230                     <td>
231                        <code>a</code>
232                     </td>
233                     <td>
234                        <code>g</code>
235                     </td>
236                     <td>
237                        <code>/</code>
238                     </td>
239                     <td>
240                        <code>&gt;</code>
241                     </td>
242                  </tr>
243                  <tr valign="top">
244                     <td>ASCII</td>
245                     <td>
246                        <code>00111100</code>
247                     </td>
248                     <td>
249                        <code>01110100</code>
250                     </td>
251                     <td>
252                        <code>01100001</code>
253                     </td>
254                     <td>
255                        <code>01100111</code>
256                     </td>
257                     <td>
258                        <code>00101111</code>
259                     </td>
260                     <td>
261                        <code>00111110</code>
262                     </td>
263                  </tr>
264                  <tr valign="top">
265                     <td>Bit0</td>
266                     <td>
267                        <code>0</code>
268                     </td>
269                     <td>
270                        <code>0</code>
271                     </td>
272                     <td>
273                        <code>0</code>
274                     </td>
275                     <td>
276                        <code>0</code>
277                     </td>
278                     <td>
279                        <code>0</code>
280                     </td>
281                     <td>
282                        <code>0</code>
283                     </td>
284                  </tr>
285                  <tr valign="top">
286                     <td>Bit1</td>
287                     <td>
288                        <code>0</code>
289                     </td>
290                     <td>
291                        <code>1</code>
292                     </td>
293                     <td>
294                        <code>1</code>
295                     </td>
296                     <td>
297                        <code>1</code>
298                     </td>
299                     <td>
300                        <code>0</code>
301                     </td>
302                     <td>
303                        <code>0</code>
304                     </td>
305                  </tr>
306                  <tr valign="top">
307                     <td>Bit2</td>
308                     <td>
309                        <code>1</code>
310                     </td>
311                     <td>
312                        <code>1</code>
313                     </td>
314                     <td>
315                        <code>1</code>
316                     </td>
317                     <td>
318                        <code>1</code>
319                     </td>
320                     <td>
321                        <code>1</code>
322                     </td>
323                     <td>
324                        <code>1</code>
325                     </td>
326                  </tr>
327                  <tr valign="top">
328                     <td>Bit3</td>
329                     <td>
330                        <code>1</code>
331                     </td>
332                     <td>
333                        <code>1</code>
334                     </td>
335                     <td>
336                        <code>0</code>
337                     </td>
338                     <td>
339                        <code>0</code>
340                     </td>
341                     <td>
342                        <code>0</code>
343                     </td>
344                     <td>
345                        <code>1</code>
346                     </td>
347                  </tr>
348                  <tr valign="top">
349                     <td>Bit4</td>
350                     <td>
351                        <code>1</code>
352                     </td>
353                     <td>
354                        <code>0</code>
355                     </td>
356                     <td>
357                        <code>0</code>
358                     </td>
359                     <td>
360                        <code>0</code>
361                     </td>
362                     <td>
363                        <code>1</code>
364                     </td>
365                     <td>
366                        <code>1</code>
367                     </td>
368                  </tr>
369                  <tr valign="top">
370                     <td>Bit5</td>
371                     <td>
372                        <code>1</code>
373                     </td>
374                     <td>
375                        <code>1</code>
376                     </td>
377                     <td>
378                        <code>0</code>
379                     </td>
380                     <td>
381                        <code>1</code>
382                     </td>
383                     <td>
384                        <code>1</code>
385                     </td>
386                     <td>
387                        <code>1</code>
388                     </td>
389                  </tr>
390                  <tr valign="top">
391                     <td>Bit6</td>
392                     <td>
393                        <code>0</code>
394                     </td>
395                     <td>
396                        <code>0</code>
397                     </td>
398                     <td>
399                        <code>0</code>
400                     </td>
401                     <td>
402                        <code>1</code>
403                     </td>
404                     <td>
405                        <code>1</code>
406                     </td>
407                     <td>
408                        <code>1</code>
409                     </td>
410                  </tr>
411                  <tr valign="top">
412                     <td>Bit7</td>
413                     <td>
414                        <code>0</code>
415                     </td>
416                     <td>
417                        <code>0</code>
418                     </td>
419                     <td>
420                        <code>1</code>
421                     </td>
422                     <td>
423                        <code>1</code>
424                     </td>
425                     <td>
426                        <code>1</code>
427                     </td>
428                     <td>
429                        <code>0</code>
430                     </td>
431                  </tr>
432               </tbody>
433            </table>
434         </para>
435         <para> Depending on the features of a particular processor architecture, there are a number
436            of algorithms for transposition to parallel bit stream form. Several of these algorithms
437            employ a three-stage structure. In the first stage, the input byte stream is divided
438            into a pair of half-length streams consisting of four bits for each byte, for example,
439            one stream for the high nybble of each byte and another for the low nybble of each byte.
440            In the second stage, these streams of four bits per byte are each divided into streams
441            consisting of two bits per original byte, for example streams for the
442            <code>Bit0/Bit1</code>, <code>Bit2/Bit3</code>, <code>Bit4/Bit5</code>, and
443               <code>Bit6/Bit7</code> pairs. In the final stage, the streams are further subdivided
444            in the individual bit streams. </para>
445         <para> Using SIMD capabilities, this process is quite efficient, with an amortized cost of
446            1.1 CPU cycles per input byte on Intel Core 2 with SSE, or 0.6 CPU cycles per input byte
447            on Power PC G4 with Altivec. With future advances in processor technology, this
448            transposition overhead is expected to reduce, possibly taking advantage of upcoming
449            parallel extract (pex) instructions on Intel technology. In the ideal, only 24
450            instructions are needed to transform a block of 128 input bytes using 128-bit SSE
451            registers using the inductive doubling instruction set architecture, representing an
452            overhead of less than 0.2 instructions per input byte. </para>
453      </section>
454
455      <section>
456         <title>General Streams</title>
457         <para>This section describes the bit streams which support operations that are common to
458            many XML processing tasks.</para>
459
460         <section>
461            <title>Deletion Mask Streams</title>
462            <para>The DelMask (deletion mask) streams marks character code unit positions for
463               deletion. Since the deletion operation arises in many stages of XML processing,
464               positions are initially marked for deletion, and then subsequently deleted in
465               parallel, using a bitwise ORing of a number of deletion masks. A single invocation of
466               a SIMD based parallel deletion algorithm may perform deletions accumulated across a
467               number of XML processing stages. Several algorithms to delete bits at positions
468               marked by DelMask are possible [<xref linkend="u8u16"/>]. </para>
469            <para> As an example, deletion arises in the replacement of predefined entities, such as
470               in the replacement of the &amp;amp; entity, with the &amp; character. Further
471               deletion masks, such as masks resulting from UTF-8 to UTF-16 transcoding, XML
472               end-of-line handling, and CDATA section delimeter processing may then be ORd for
473               accumulation with the predefined entity deletion mask.</para>
474            <para>The following table provides an example of generating a DelMask in the context of
475               bit stream based parsing of well-formed character references and predefined entities.
476               Character reference and predefined entity bit stream definitions are provided below.<!-- PARABIX2_1  -->
477               <!--
478            <table>
479               <caption>
480                  <para>DelMask Stream Generation</para>
481               </caption>
482            </table>
483            -->
484            </para>
485         </section>
486
487         <section>
488            <title>Error Flag Streams </title>
489            <para>Error flag streams indicates the character code unit positions of syntactical
490               errors. XML processing examples which benefit from the marking error positions
491               include UTF-8 character sequence validation and XML parsing [<xref linkend="u8u16"
492               />].</para>
493            <para>The following table provides an example of using bit streams to parse character
494               references and predefined entities which fail to meet the XML 1.0 well-formedness
495               constraints. This results in the generation of a error flag stream.<!-- PARABIX2_2  -->
496               <!-- REPLACE
497               <table>
498               <caption>
499                  <para>Error Flag Stream Generation</para>
500                  </caption>
501                  <colgroup>
502                     <col align="left" valign="top"/>
503                  </colgroup>
504                  <tbody>
505                     <tr valign="top">
506                        <td>XML</td>
507                        <td>
508                           <code>Well Formed &amp;lt; Erroneous &amp;gt!</code>
509                        </td>
510                     </tr>
511                     <tr valign="top">
512                        <td>RefStart</td>
513                        <td>
514                           <code></code>
515                        </td>
516                     </tr>
517                     <tr valign="top">
518                        <td>RefEnd</td>
519                        <td>
520                           <code></code>
521                        </td>
522                     </tr>
523                     <tr valign="top">
524                        <td>RefError</td>
525                        <td>
526                           <code></code>
527                        </td>
528                     </tr>
529                  </tbody>
530               </table>
531               -->
532            </para>
533
534         </section>
535
536      </section>
537
538      <section>
539         <title>Lexical Item Streams</title>
540         <para>Lexical item streams differ from traditional streams of tokens in that they are bit
541            streams that mark the positions of tokens, whitespace or delimiters. Additional bit
542            streams, such as the reference streams and callout streams, are subsequently constructed
543            based on the information held within the set of lexical items streams. Differentiation
544            between the actual tokens that may occur at a particular point (e.g., the different XML
545            tokens that begin “&lt;”) may be performed using multicharacter recognizers on the
546            bytestream representation [<xref linkend="CASCON08"/>].</para>
547         <para>A key role of lexical item streams in XML parsing is to facilitate fast scanning
548            operations. For example, a LeftAngle lexical item stream may be formed to identify those
549            character code unit positions at which a “&lt;” character occurs. Hardware register
550            bit scan operations may then be used by the XML parser on the LeftAngle stream to
551            efficiently identify the position of the next “&lt;”. Based on the capabilities of
552            current commodity processors, a single register bit scan operation may effectively scan
553            up to 64 byte positions with a single instruction.</para>
554         <para>Overall, the construction of the full set of lexical item stream computations
555            requires approximately 1.0 CPU cycles per byte when implemented for 128 positions at a
556            time using 128-bit SSE registers on Intel Core2 processors [<xref linkend="CASCON08"/>].
557            The following table describes the core lexical item streams defined by the Parabix XML
558            parser.</para>
559         <para>
560            <table>
561               <caption>
562                  <para>Lexical item stream descriptions.</para>
563               </caption>
564               <tbody>
565                  <tr>
566                     <td align="left"> LAngle </td>
567                     <td align="left"> Marks the position of any left angle bracket character.</td>
568                  </tr>
569                  <tr>
570                     <td align="left"> RAngle </td>
571                     <td align="left"> Marks the position of any right angle bracket character.</td>
572                  </tr>
573                  <tr>
574                     <td align="left"> LBracket </td>
575                     <td align="left"> Marks the position of any left square bracker character.</td>
576                  </tr>
577                  <tr>
578                     <td align="left"> RBracket </td>
579                     <td align="left"> Marks the position of any right square bracket
580                     character.</td>
581                  </tr>
582                  <tr>
583                     <td align="left"> Exclam </td>
584                     <td align="left"> Marks the position of any exclamation mark character.</td>
585                  </tr>
586                  <tr>
587                     <td align="left"> QMark </td>
588                     <td align="left"> Marks the position of any question mark character.</td>
589                  </tr>
590                  <tr>
591                     <td align="left"> Hyphen </td>
592                     <td align="left"> Marks the position of any hyphen character.</td>
593                  </tr>
594                  <tr>
595                     <td align="left"> Equals </td>
596                     <td align="left"> Marks the position of any equal sign character.</td>
597                  </tr>
598                  <tr>
599                     <td align="left"> SQuote </td>
600                     <td align="left"> Marks the position of any single quote character.</td>
601                  </tr>
602                  <tr>
603                     <td align="left"> DQuote </td>
604                     <td align="left"> Marks the position of any double quote character.</td>
605                  </tr>
606                  <tr>
607                     <td align="left"> Slash </td>
608                     <td align="left"> Marks the position of any forward slash character</td>
609                  </tr>
610                  <tr>
611                     <td align="left"> NameScan </td>
612                     <td align="left"> Marks the position of any XML name character.</td>
613                  </tr>
614                  <tr>
615                     <td align="left"> WS </td>
616                     <td align="left"> Marks the position of any XML 1.0 whitespace character.</td>
617                  </tr>
618                  <tr>
619                     <td align="left"> PI_start </td>
620                     <td align="left"> Marks the position of the start of any processing instruction
621                        at the '?' character position.</td>
622                  </tr>
623                  <tr>
624                     <td align="left"> PI_end </td>
625                     <td align="left"> Marks the position of any end of any processing instruction
626                        at the '>' character position.</td>
627                  </tr>
628                  <tr>
629                     <td align="left"> CtCD_start </td>
630                     <td align="left"> Marks the position of the start of any comment or CDATA
631                        section at the '!' character position.</td>
632                  </tr>
633                  <tr>
634                     <td align="left"> EndTag_start </td>
635                     <td align="left"> Marks the position of any end tag at the '/' character
636                        position.</td>
637                  </tr>
638                  <tr>
639                     <td align="left"> CD_end </td>
640                     <td align="left"> Marks the position of the end of any CDATA section at the '>'
641                        character position. </td>
642                  </tr>
643                  <tr>
644                     <td align="left"> DoubleHyphen </td>
645                     <td align="left"> Marks the position of any double hyphen character.</td>
646                  </tr>
647                  <tr>
648                     <td align="left"> RefStart </td>
649                     <td align="left"> Marks the position of any ampersand character.</td>
650                  </tr>
651                  <tr>
652                     <td align="left"> Hash </td>
653                     <td align="left"> Marks the position of any hash character.</td>
654                  </tr>
655                  <tr>
656                     <td align="left"> x </td>
657                     <td align="left"> Marks the position of any 'x' character.</td>
658                  </tr>
659                  <tr>
660                     <td align="left"> Digit </td>
661                     <td align="left"> Marks the position of any digit character.</td>
662                  </tr>
663                  <tr>
664                     <td align="left"> Hex </td>
665                     <td align="left"> Marks the position of any hexidecimal character.</td>
666                  </tr>
667                  <tr>
668                     <td align="left"> Semicolon </td>
669                     <td align="left"> Marks the position of any semicolon character.</td>
670                  </tr>
671               </tbody>
672            </table>
673         </para>
674         <para>The following table illustrates a number of the lexical item streams.
675            <!--
676            <table>
677               <caption>
678                  <para>Lexical Item Streams</para>
679               </caption>
680           
681               
682               <colgroup>
683                  <col align="left" valign="top"/>
684               </colgroup>
685
686               </tbody>
687            </table>
688               -->
689         </para>
690      </section>
691
692      <section>
693         <title>UTF-8 Byte Classification, Scope and Validation Streams</title>
694         <para> An XML parser must accept the UTF-8 encoding of Unicode [<xref linkend="XML10"/>].
695            It is a fatal error if an XML document determined to be in UTF-8 contains byte sequences
696            that are not legal in that encoding. UTF-8 byte classification, scope and error flag bit
697            streams are defined to validate UTF-8 byte sequences and support transcoding to UTF-16
698            if desired.</para>
699
700         <section>
701            <title>UTF-8 Byte Classification Streams</title>
702            <para>UTF-8 byte classification bit streams classify UTF-8 bytes based on their role in
703               forming single and multibyte sequences. The u8Prefix and u8Suffix bit streams
704               identify bytes that represent, respectively, prefix or suffix bytes of multibyte
705               sequences. The u8UniByte bit stream identifies those bytes that may be considered
706               single-byte sequences. The u8Prefix2, u8Prefix3, and u8Prefix4 refine the u8Prefix
707               respectively indicating prefixes of two, three or four byte sequences.</para>
708         </section>
709
710         <section>
711            <title>UTF-8 Scope Streams</title>
712            <para> Scope streams represent expectations established by prefix bytes. For example,
713               bit stream u8Scope22 represents the positions at which a second byte of a two-byte
714               sequence is expected based on the occurrence of a two-byte prefix in the immediately
715               preceding positions. The u8scope32, u8Scope33, u8Scope42, u8scope43, and u8Scope44
716               complete the set of UTF-8 scope streams.</para>
717            <para> The following examples demonstrate the UTF-8 character encoding validation
718               process using parallel bit stream techniques. The result of this validation process
719               is an error flag stream identifying those positions at which errors are identified.</para>
720            <para>
721               <table>
722                  <caption>
723                     <para>UTF-8 Scope Streams</para>
724                  </caption>
725                  <colgroup>
726                     <col align="left" valign="top"/>
727                     <col align="left" valign="top"/>
728                     <col align="left" valign="top"/>
729                     <col align="left" valign="top"/>
730                     <col align="left" valign="top"/>
731                     <col align="left" valign="top"/>
732                     <col align="left" valign="top"/>
733                     <col align="left" valign="top"/>
734                     <col align="left" valign="top"/>
735                     <col align="left" valign="top"/>
736                     <col align="left" valign="top"/>
737                     <col align="left" valign="top"/>
738                     <col align="left" valign="top"/>
739                     <col align="left" valign="top"/>
740                     <col align="left" valign="top"/>
741                     <col align="left" valign="top"/>
742                     <col align="left" valign="top"/>
743                     <col align="left" valign="top"/>
744                     <col align="left" valign="top"/>
745                     <col align="left" valign="top"/>
746                     <col align="left" valign="top"/>
747                     <col align="left" valign="top"/>
748                     <col align="left" valign="top"/>
749                     <col align="left" valign="top"/>
750                     <col align="left" valign="top"/>
751                     <col align="left" valign="top"/>
752                     <col align="left" valign="top"/>
753                     <col align="left" valign="top"/>
754                  </colgroup>
755                  <tbody>
756                     <tr valign="top">
757                        <td>Input Data</td>
758                        <td colspan="1">
759                           <code>A</code>
760                        </td>
761                        <td colspan="1">
762                           <code> </code>
763                        </td>
764                        <td colspan="1">
765                           <code>T</code>
766                        </td>
767                        <td colspan="1">
768                           <code>e</code>
769                        </td>
770                        <td colspan="1">
771                           <code>x</code>
772                        </td>
773                        <td colspan="1">
774                           <code>t</code>
775                        </td>
776                        <td colspan="1">
777                           <code> </code>
778                        </td>
779                        <td colspan="1">
780                           <code>i</code>
781                        </td>
782                        <td colspan="1">
783                           <code>n</code>
784                        </td>
785                        <td colspan="1">
786                           <code> </code>
787                        </td>
788                        <td colspan="1">
789                           <code>F</code>
790                        </td>
791                        <td colspan="1">
792                           <code>a</code>
793                        </td>
794                        <td colspan="1">
795                           <code>r</code>
796                        </td>
797                        <td colspan="1">
798                           <code>s</code>
799                        </td>
800                        <td colspan="1">
801                           <code>i</code>
802                        </td>
803                        <td colspan="1">
804                           <code>:</code>
805                        </td>
806                        <td colspan="2">
807                           <code>ى</code>
808                        </td>
809                        <td colspan="2">
810                           <code>س</code>
811                        </td>
812                        <td colspan="2">
813                           <code>ر</code>
814                        </td>
815                        <td colspan="2">
816                           <code>ا</code>
817                        </td>
818                        <td colspan="2">
819                           <code>ف</code>
820                        </td>
821                        <td colspan="1">
822                           <code> </code>
823                        </td>
824                        <td colspan="2">
825                           <code>ن</code>
826                        </td>
827                        <td colspan="2">
828                           <code>ت</code>
829                        </td>
830                        <td colspan="2">
831                           <code>م</code>
832                        </td>
833                        <td colspan="1">
834                           <code> </code>
835                        </td>
836                        <td colspan="2">
837                           <code>ك</code>
838                        </td>
839                        <td colspan="2">
840                           <code>ى</code>
841                        </td>
842                     </tr>
843                     <tr valign="top">
844                        <td>UTF-8</td>
845                        <td>
846                           <code>41</code>
847                        </td>
848                        <td>
849                           <code>20</code>
850                        </td>
851                        <td>
852                           <code>54</code>
853                        </td>
854                        <td>
855                           <code>65</code>
856                        </td>
857                        <td>
858                           <code>78</code>
859                        </td>
860                        <td>
861                           <code>74</code>
862                        </td>
863                        <td>
864                           <code>20</code>
865                        </td>
866                        <td>
867                           <code>69</code>
868                        </td>
869                        <td>
870                           <code>6E</code>
871                        </td>
872                        <td>
873                           <code>20</code>
874                        </td>
875                        <td>
876                           <code>46</code>
877                        </td>
878                        <td>
879                           <code>61</code>
880                        </td>
881                        <td>
882                           <code>72</code>
883                        </td>
884                        <td>
885                           <code>73</code>
886                        </td>
887                        <td>
888                           <code>69</code>
889                        </td>
890                        <td>
891                           <code>3A</code>
892                        </td>
893                        <td>
894                           <code>D9</code>
895                        </td>
896                        <td>
897                           <code>89</code>
898                        </td>
899                        <td>
900                           <code>D8</code>
901                        </td>
902                        <td>
903                           <code>B3</code>
904                        </td>
905                        <td>
906                           <code>D8</code>
907                        </td>
908                        <td>
909                           <code>B1</code>
910                        </td>
911                        <td>
912                           <code>D8</code>
913                        </td>
914                        <td>
915                           <code>A7</code>
916                        </td>
917                        <td>
918                           <code>D9</code>
919                        </td>
920                        <td>
921                           <code>81</code>
922                        </td>
923                        <td>
924                           <code>20</code>
925                        </td>
926                        <td>
927                           <code>D9</code>
928                        </td>
929                        <td>
930                           <code>86</code>
931                        </td>
932                        <td>
933                           <code>D8</code>
934                        </td>
935                        <td>
936                           <code>AA</code>
937                        </td>
938                        <td>
939                           <code>D9</code>
940                        </td>
941                        <td>
942                           <code>85</code>
943                        </td>
944                        <td>
945                           <code>20</code>
946                        </td>
947                        <td>
948                           <code>D9</code>
949                        </td>
950                        <td>
951                           <code>83</code>
952                        </td>
953                        <td>
954                           <code>D9</code>
955                        </td>
956                        <td>
957                           <code>89</code>
958                        </td>
959                     </tr>
960                     <tr valign="top">
961                        <td>u8UniByte</td>
962                        <td>
963                           <code>1</code>
964                        </td>
965                        <td>
966                           <code>1</code>
967                        </td>
968                        <td>
969                           <code>1</code>
970                        </td>
971                        <td>
972                           <code>1</code>
973                        </td>
974                        <td>
975                           <code>1</code>
976                        </td>
977                        <td>
978                           <code>1</code>
979                        </td>
980                        <td>
981                           <code>1</code>
982                        </td>
983                        <td>
984                           <code>1</code>
985                        </td>
986                        <td>
987                           <code>1</code>
988                        </td>
989                        <td>
990                           <code>1</code>
991                        </td>
992                        <td>
993                           <code>1</code>
994                        </td>
995                        <td>
996                           <code>1</code>
997                        </td>
998                        <td>
999                           <code>1</code>
1000                        </td>
1001                        <td>
1002                           <code>1</code>
1003                        </td>
1004                        <td>
1005                           <code>1</code>
1006                        </td>
1007                        <td>
1008                           <code>1</code>
1009                        </td>
1010                        <td>
1011                           <code>0</code>
1012                        </td>
1013                        <td>
1014                           <code>0</code>
1015                        </td>
1016                        <td>
1017                           <code>0</code>
1018                        </td>
1019                        <td>
1020                           <code>0</code>
1021                        </td>
1022                        <td>
1023                           <code>0</code>
1024                        </td>
1025                        <td>
1026                           <code>0</code>
1027                        </td>
1028                        <td>
1029                           <code>0</code>
1030                        </td>
1031                        <td>
1032                           <code>0</code>
1033                        </td>
1034                        <td>
1035                           <code>0</code>
1036                        </td>
1037                        <td>
1038                           <code>0</code>
1039                        </td>
1040                        <td>
1041                           <code>1</code>
1042                        </td>
1043                        <td>
1044                           <code>0</code>
1045                        </td>
1046                        <td>
1047                           <code>0</code>
1048                        </td>
1049                        <td>
1050                           <code>0</code>
1051                        </td>
1052                        <td>
1053                           <code>0</code>
1054                        </td>
1055                        <td>
1056                           <code>0</code>
1057                        </td>
1058                        <td>
1059                           <code>0</code>
1060                        </td>
1061                        <td>
1062                           <code>1</code>
1063                        </td>
1064                        <td>
1065                           <code>0</code>
1066                        </td>
1067                        <td>
1068                           <code>0</code>
1069                        </td>
1070                        <td>
1071                           <code>0</code>
1072                        </td>
1073                        <td>
1074                           <code>0</code>
1075                        </td>
1076                     </tr>
1077                     <tr valign="top">
1078                        <td>u8Prefix</td>
1079                        <td>
1080                           <code>0</code>
1081                        </td>
1082                        <td>
1083                           <code>0</code>
1084                        </td>
1085                        <td>
1086                           <code>0</code>
1087                        </td>
1088                        <td>
1089                           <code>0</code>
1090                        </td>
1091                        <td>
1092                           <code>0</code>
1093                        </td>
1094                        <td>
1095                           <code>0</code>
1096                        </td>
1097                        <td>
1098                           <code>0</code>
1099                        </td>
1100                        <td>
1101                           <code>0</code>
1102                        </td>
1103                        <td>
1104                           <code>0</code>
1105                        </td>
1106                        <td>
1107                           <code>0</code>
1108                        </td>
1109                        <td>
1110                           <code>0</code>
1111                        </td>
1112                        <td>
1113                           <code>0</code>
1114                        </td>
1115                        <td>
1116                           <code>0</code>
1117                        </td>
1118                        <td>
1119                           <code>0</code>
1120                        </td>
1121                        <td>
1122                           <code>0</code>
1123                        </td>
1124                        <td>
1125                           <code>0</code>
1126                        </td>
1127                        <td>
1128                           <code>1</code>
1129                        </td>
1130                        <td>
1131                           <code>0</code>
1132                        </td>
1133                        <td>
1134                           <code>1</code>
1135                        </td>
1136                        <td>
1137                           <code>0</code>
1138                        </td>
1139                        <td>
1140                           <code>1</code>
1141                        </td>
1142                        <td>
1143                           <code>0</code>
1144                        </td>
1145                        <td>
1146                           <code>1</code>
1147                        </td>
1148                        <td>
1149                           <code>0</code>
1150                        </td>
1151                        <td>
1152                           <code>1</code>
1153                        </td>
1154                        <td>
1155                           <code>0</code>
1156                        </td>
1157                        <td>
1158                           <code>0</code>
1159                        </td>
1160                        <td>
1161                           <code>1</code>
1162                        </td>
1163                        <td>
1164                           <code>0</code>
1165                        </td>
1166                        <td>
1167                           <code>1</code>
1168                        </td>
1169                        <td>
1170                           <code>0</code>
1171                        </td>
1172                        <td>
1173                           <code>1</code>
1174                        </td>
1175                        <td>
1176                           <code>0</code>
1177                        </td>
1178                        <td>
1179                           <code>0</code>
1180                        </td>
1181                        <td>
1182                           <code>1</code>
1183                        </td>
1184                        <td>
1185                           <code>0</code>
1186                        </td>
1187                        <td>
1188                           <code>1</code>
1189                        </td>
1190                        <td>
1191                           <code>0</code>
1192                        </td>
1193                     </tr>
1194                     <tr valign="top">
1195                        <td>u8Suffix</td>
1196                        <td>
1197                           <code>0</code>
1198                        </td>
1199                        <td>
1200                           <code>0</code>
1201                        </td>
1202                        <td>
1203                           <code>0</code>
1204                        </td>
1205                        <td>
1206                           <code>0</code>
1207                        </td>
1208                        <td>
1209                           <code>0</code>
1210                        </td>
1211                        <td>
1212                           <code>0</code>
1213                        </td>
1214                        <td>
1215                           <code>0</code>
1216                        </td>
1217                        <td>
1218                           <code>0</code>
1219                        </td>
1220                        <td>
1221                           <code>0</code>
1222                        </td>
1223                        <td>
1224                           <code>0</code>
1225                        </td>
1226                        <td>
1227                           <code>0</code>
1228                        </td>
1229                        <td>
1230                           <code>0</code>
1231                        </td>
1232                        <td>
1233                           <code>0</code>
1234                        </td>
1235                        <td>
1236                           <code>0</code>
1237                        </td>
1238                        <td>
1239                           <code>0</code>
1240                        </td>
1241                        <td>
1242                           <code>0</code>
1243                        </td>
1244                        <td>
1245                           <code>0</code>
1246                        </td>
1247                        <td>
1248                           <code>1</code>
1249                        </td>
1250                        <td>
1251                           <code>0</code>
1252                        </td>
1253                        <td>
1254                           <code>1</code>
1255                        </td>
1256                        <td>
1257                           <code>0</code>
1258                        </td>
1259                        <td>
1260                           <code>1</code>
1261                        </td>
1262                        <td>
1263                           <code>0</code>
1264                        </td>
1265                        <td>
1266                           <code>1</code>
1267                        </td>
1268                        <td>
1269                           <code>0</code>
1270                        </td>
1271                        <td>
1272                           <code>1</code>
1273                        </td>
1274                        <td>
1275                           <code>0</code>
1276                        </td>
1277                        <td>
1278                           <code>0</code>
1279                        </td>
1280                        <td>
1281                           <code>1</code>
1282                        </td>
1283                        <td>
1284                           <code>0</code>
1285                        </td>
1286                        <td>
1287                           <code>1</code>
1288                        </td>
1289                        <td>
1290                           <code>0</code>
1291                        </td>
1292                        <td>
1293                           <code>1</code>
1294                        </td>
1295                        <td>
1296                           <code>0</code>
1297                        </td>
1298                        <td>
1299                           <code>0</code>
1300                        </td>
1301                        <td>
1302                           <code>1</code>
1303                        </td>
1304                        <td>
1305                           <code>0</code>
1306                        </td>
1307                        <td>
1308                           <code>1</code>
1309                        </td>
1310                     </tr>
1311                     <tr valign="top">
1312                        <td>u8Prefix2</td>
1313                        <td>
1314                           <code>0</code>
1315                        </td>
1316                        <td>
1317                           <code>0</code>
1318                        </td>
1319                        <td>
1320                           <code>0</code>
1321                        </td>
1322                        <td>
1323                           <code>0</code>
1324                        </td>
1325                        <td>
1326                           <code>0</code>
1327                        </td>
1328                        <td>
1329                           <code>0</code>
1330                        </td>
1331                        <td>
1332                           <code>0</code>
1333                        </td>
1334                        <td>
1335                           <code>0</code>
1336                        </td>
1337                        <td>
1338                           <code>0</code>
1339                        </td>
1340                        <td>
1341                           <code>0</code>
1342                        </td>
1343                        <td>
1344                           <code>0</code>
1345                        </td>
1346                        <td>
1347                           <code>0</code>
1348                        </td>
1349                        <td>
1350                           <code>0</code>
1351                        </td>
1352                        <td>
1353                           <code>0</code>
1354                        </td>
1355                        <td>
1356                           <code>0</code>
1357                        </td>
1358                        <td>
1359                           <code>0</code>
1360                        </td>
1361                        <td>
1362                           <code>1</code>
1363                        </td>
1364                        <td>
1365                           <code>0</code>
1366                        </td>
1367                        <td>
1368                           <code>1</code>
1369                        </td>
1370                        <td>
1371                           <code>0</code>
1372                        </td>
1373                        <td>
1374                           <code>1</code>
1375                        </td>
1376                        <td>
1377                           <code>0</code>
1378                        </td>
1379                        <td>
1380                           <code>1</code>
1381                        </td>
1382                        <td>
1383                           <code>0</code>
1384                        </td>
1385                        <td>
1386                           <code>1</code>
1387                        </td>
1388                        <td>
1389                           <code>0</code>
1390                        </td>
1391                        <td>
1392                           <code>0</code>
1393                        </td>
1394                        <td>
1395                           <code>1</code>
1396                        </td>
1397                        <td>
1398                           <code>0</code>
1399                        </td>
1400                        <td>
1401                           <code>1</code>
1402                        </td>
1403                        <td>
1404                           <code>0</code>
1405                        </td>
1406                        <td>
1407                           <code>1</code>
1408                        </td>
1409                        <td>
1410                           <code>0</code>
1411                        </td>
1412                        <td>
1413                           <code>0</code>
1414                        </td>
1415                        <td>
1416                           <code>1</code>
1417                        </td>
1418                        <td>
1419                           <code>0</code>
1420                        </td>
1421                        <td>
1422                           <code>1</code>
1423                        </td>
1424                        <td>
1425                           <code>0</code>
1426                        </td>
1427                     </tr>
1428                     <tr valign="top">
1429                        <td>u8Scope22</td>
1430                        <td>
1431                           <code>0</code>
1432                        </td>
1433                        <td>
1434                           <code>0</code>
1435                        </td>
1436                        <td>
1437                           <code>0</code>
1438                        </td>
1439                        <td>
1440                           <code>0</code>
1441                        </td>
1442                        <td>
1443                           <code>0</code>
1444                        </td>
1445                        <td>
1446                           <code>0</code>
1447                        </td>
1448                        <td>
1449                           <code>0</code>
1450                        </td>
1451                        <td>
1452                           <code>0</code>
1453                        </td>
1454                        <td>
1455                           <code>0</code>
1456                        </td>
1457                        <td>
1458                           <code>0</code>
1459                        </td>
1460                        <td>
1461                           <code>0</code>
1462                        </td>
1463                        <td>
1464                           <code>0</code>
1465                        </td>
1466                        <td>
1467                           <code>0</code>
1468                        </td>
1469                        <td>
1470                           <code>0</code>
1471                        </td>
1472                        <td>
1473                           <code>0</code>
1474                        </td>
1475                        <td>
1476                           <code>0</code>
1477                        </td>
1478                        <td>
1479                           <code>0</code>
1480                        </td>
1481                        <td>
1482                           <code>1</code>
1483                        </td>
1484                        <td>
1485                           <code>0</code>
1486                        </td>
1487                        <td>
1488                           <code>1</code>
1489                        </td>
1490                        <td>
1491                           <code>0</code>
1492                        </td>
1493                        <td>
1494                           <code>1</code>
1495                        </td>
1496                        <td>
1497                           <code>0</code>
1498                        </td>
1499                        <td>
1500                           <code>1</code>
1501                        </td>
1502                        <td>
1503                           <code>0</code>
1504                        </td>
1505                        <td>
1506                           <code>1</code>
1507                        </td>
1508                        <td>
1509                           <code>0</code>
1510                        </td>
1511                        <td>
1512                           <code>0</code>
1513                        </td>
1514                        <td>
1515                           <code>1</code>
1516                        </td>
1517                        <td>
1518                           <code>0</code>
1519                        </td>
1520                        <td>
1521                           <code>1</code>
1522                        </td>
1523                        <td>
1524                           <code>0</code>
1525                        </td>
1526                        <td>
1527                           <code>1</code>
1528                        </td>
1529                        <td>
1530                           <code>0</code>
1531                        </td>
1532                        <td>
1533                           <code>0</code>
1534                        </td>
1535                        <td>
1536                           <code>1</code>
1537                        </td>
1538                        <td>
1539                           <code>0</code>
1540                        </td>
1541                        <td>
1542                           <code>1</code>
1543                        </td>
1544                     </tr>
1545                     <tr valign="top">
1546                        <td>u8Error</td>
1547                        <td>
1548                           <code>0</code>
1549                        </td>
1550                        <td>
1551                           <code>0</code>
1552                        </td>
1553                        <td>
1554                           <code>0</code>
1555                        </td>
1556                        <td>
1557                           <code>0</code>
1558                        </td>
1559                        <td>
1560                           <code>0</code>
1561                        </td>
1562                        <td>
1563                           <code>0</code>
1564                        </td>
1565                        <td>
1566                           <code>0</code>
1567                        </td>
1568                        <td>
1569                           <code>0</code>
1570                        </td>
1571                        <td>
1572                           <code>0</code>
1573                        </td>
1574                        <td>
1575                           <code>0</code>
1576                        </td>
1577                        <td>
1578                           <code>0</code>
1579                        </td>
1580                        <td>
1581                           <code>0</code>
1582                        </td>
1583                        <td>
1584                           <code>0</code>
1585                        </td>
1586                        <td>
1587                           <code>0</code>
1588                        </td>
1589                        <td>
1590                           <code>0</code>
1591                        </td>
1592                        <td>
1593                           <code>0</code>
1594                        </td>
1595                        <td>
1596                           <code>0</code>
1597                        </td>
1598                        <td>
1599                           <code>0</code>
1600                        </td>
1601                        <td>
1602                           <code>0</code>
1603                        </td>
1604                        <td>
1605                           <code>0</code>
1606                        </td>
1607                        <td>
1608                           <code>0</code>
1609                        </td>
1610                        <td>
1611                           <code>0</code>
1612                        </td>
1613                        <td>
1614                           <code>0</code>
1615                        </td>
1616                        <td>
1617                           <code>0</code>
1618                        </td>
1619                        <td>
1620                           <code>0</code>
1621                        </td>
1622                        <td>
1623                           <code>0</code>
1624                        </td>
1625                        <td>
1626                           <code>0</code>
1627                        </td>
1628                        <td>
1629                           <code>0</code>
1630                        </td>
1631                        <td>
1632                           <code>0</code>
1633                        </td>
1634                        <td>
1635                           <code>0</code>
1636                        </td>
1637                        <td>
1638                           <code>0</code>
1639                        </td>
1640                        <td>
1641                           <code>0</code>
1642                        </td>
1643                        <td>
1644                           <code>0</code>
1645                        </td>
1646                        <td>
1647                           <code>0</code>
1648                        </td>
1649                        <td>
1650                           <code>0</code>
1651                        </td>
1652                        <td>
1653                           <code>0</code>
1654                        </td>
1655                        <td>
1656                           <code>0</code>
1657                        </td>
1658                        <td>
1659                           <code>0</code>
1660                        </td>
1661                     </tr>
1662                  </tbody>
1663               </table>
1664
1665            </para>
1666
1667            <section>
1668               <title>UTF-8 Validation Streams</title>
1669               <para> Proper formation of UTF-8 byte sequences requires that the correct number of
1670                  suffix bytes always follow a UTF-8 prefix byte and that certain illegal
1671                  combinations are ruled out. For example, sequences beginning with the prefix bytes
1672                  0xF5 through 0xFF are illegal as they would represent code point values above
1673                  10FFFF. In addition, there are constraints on the first suffix byte following
1674                  certain special prefixes, namely that a suffix following the prefix 0xE0 must fall
1675                  in the range 0xA0 –0xBF, a suffix following the prefix 0xED must fall in the range
1676                  0x80 –0x9F, a suffix following the prefix 0xF0 must fall in the range 0x90 –0xBF
1677                  and a suffix following the prefix 0xF4 must fall in the range 0x80 –0x8F. The task
1678                  of ensuring that each of these constraints hold is known as UTF-8 validation. The
1679                  following bit streams xE0, xED, xF0, xF4, xA0_xBF, x80_x9F, x90_xBF, and x80_x8F
1680                  are constructed to flag UTF-8 validation errors. The result of UTF-8 validation is
1681                  an UTF-8 error flag bit stream contructed as an ORing of a series of UTF-8
1682                  validation tests. </para>
1683            </section>
1684         </section>
1685
1686         <section>
1687            <title>UTF-8 Surrogate Character Streams</title>
1688            <para> The Unicode surrogate characters OxFFFF and OxFFFE correspond to the UTF-8
1689               encodings of 0xEF 0xBF 0xBF and 0xEF 0xBF 0xBE respectively. As such, bit streams
1690               xEF, xBF, and xBE are constructed to flag illegal surrogate characters in XML as part
1691               of the XML character validation process. </para>
1692         </section>
1693
1694         <section>
1695            <title>UTF-8 to UTF-16 Transcoding</title>
1696            <para>UTF-8 is often preferred for storage and data exchange, it is suitable for
1697               processing, but it is significantly more complex to process than UTF-16 [<xref
1698                  linkend="Unicode"/>]. As such, XML documents are typically encoded in UTF-8 for
1699               serialization and transport, and subsequently transcoded to UTF-16 for processing
1700               with programming languages such as Java and C#. Following the parallel bit stream
1701               methods developed for the u8u16 transcoder, a high-performance standalone UTF-8 to
1702               UTF-16 transcoder [<xref linkend="u8u16"/>], transcoding to UTF-16 may be achieved by
1703               computing a series of 16 bit streams. One stream for each of the individual bits of a
1704               UTF-16 code unit. </para>
1705            <para>The bit streams for UTF-16 are conveniently divided into groups: the eight streams
1706               u16Hi0, u16Hi1, ..., u16Hi7 for the high byte of each UTF-16 code unit and the eight
1707               streams u16Lo1, ..., u16Lo7 for the low byte. Upon conversion of the parallel bit
1708               stream data back to byte streams, eight sequential byte streams U16h0, U16h1, ...,
1709               U16Hi7 are used for the high byte of each UTF-16 code unit, while U16Lo0, U16Lo1,...,
1710               U16Lo7 are used for the corresponding low byte. Interleaving these streams then
1711               produces the full UTF-16 doublebyte stream.</para>
1712         </section>
1713
1714         <section>
1715            <title>UTF-8 Indexed UTF-16 Streams</title>
1716            <para>UTF-16 bit streams are initially defined in UTF-8 indexed form. That is, with sets
1717               of bits in one-to-one correspondence with UTF-8 bytes. However, only one set of
1718               UTF-16 bits is required for encoding two or three-byte UTF-8 sequences and only two
1719               sets are required for surrogate pairs corresponding to four-byte UTF-8 sequences. The
1720               u8LastByte (u8UniByte , u8Scope22 , u8Scope33 , and u8Scope44 ) and u8Scope42 streams
1721               mark the positions at which the correct UTF-16 bits are computed. The bit sets at
1722               other positions must be deleted to compress the streams to UTF-16 indexed form.
1723            </para>
1724         </section>
1725      </section>
1726
1727      <section>
1728         <title>Control Character Streams</title>
1729         <para>The control character bit streams marks ASCII control characters in the range
1730            x00-x1F. Additional control character bit streams mark the tab, carriage return, line
1731            feed, and space characters. An additional bit stream to mark carriage return line
1732            combinations is also constructed. Control character bit streams support the operations
1733            of XML character validation and XML end-of-line handling.</para>
1734
1735         <section>
1736            <title>XML Character Validation</title>
1737            <para>Legal characters in XML are the tab, carriage return, and line feed characters
1738               together with all Unicode characters and excluding the surrogate blocks, OxFFFE and
1739               OxFFFF [<xref linkend="XML10"/>]. The x00_x1F bit stream is constructed and used in
1740               combination with additional control character bit streams to flags illegal control
1741               characters in XML. Bit stream XML character validation results in the production of a
1742               bit stream error mask. </para>
1743         </section>
1744
1745         <section>
1746            <title>XML 1.0 End-of-line Handling</title>
1747            <para>In XML 1.0 the two-character sequence CR LF (carriage return, line feed) together
1748               with any CR character not followed by a LF character must be converted to a single LF
1749               character [<xref linkend="XML10"/>].</para>
1750            <para>By defining carriage return, line feed, and carriage return line feed bit streams,
1751               dentoted CR, LF and CRLF respectively, end-of-line normalization processing can be
1752               performed in parallel, using only a small number of logical and shift operations.</para>
1753            <para/>
1754            <para>The following example demonstrates the generation of the CRLF deletion mask. In
1755               this example, the position of all CR characters followed by LF characters are marked
1756               for deletion. Isolated carriage returns are then replaced with LF characters.
1757               Completion of this process satisfies the XML 1.0 end-of-line handling requirements.</para>
1758            <para>
1759               <table>
1760                  <caption>
1761                     <para>XML 1.0 End-of-line Handling</para>
1762                  </caption>
1763                  <colgroup>
1764                     <col align="left" valign="top"/>
1765                  </colgroup>
1766                  <tbody>
1767                     <tr valign="top">
1768                        <td>Input Data</td>
1769                        <td>
1770                           <code>first line C second line CL third line L one more C nothing
1771                           left</code>
1772                        </td>
1773                     </tr>
1774                     <tr valign="top">
1775                        <td>CR</td>
1776                        <td>
1777                           <code>-----------1-------------1------------------------1-------------</code>
1778                        </td>
1779                     </tr>
1780                     <tr valign="top">
1781                        <td>LF</td>
1782                        <td>
1783                           <code>--------------------------1------------1------------------------</code>
1784                        </td>
1785                     </tr>
1786                     <tr valign="top">
1787                        <td>Delmask</td>
1788                        <td>
1789                           <code>-------------------------1--------------------------------------</code>
1790                        </td>
1791                     </tr>
1792                  </tbody>
1793               </table>
1794
1795            </para>
1796         </section>
1797
1798      </section>
1799
1800      <!-- Comment Processing Instruction and CDATA Section Streams ??? -->
1801
1802      <section>
1803         <title>Comment, Processing Instruction, CDATA Section Streams</title>
1804         <para>Comments, processing instructions and CDATA sections represent sections of an XML
1805            document which may contain markup that is not interpreted by the XML processor. As such,
1806            the union of comment, processing Instruction and CDATA section extents define regions of
1807            non-interpreteable markup in an XML document. The stream formed by this union is termed
1808            the ignorable markup stream. The purpose of the the ignorable markup stream is to mark
1809            the positions of all non-interpreted XML markup for deletion.</para>
1810         <para>The following tables provides an example of marking comment extents. <table>
1811               <caption>
1812                  <para>Comment, Processing Instuction and CDATA Streams</para>
1813               </caption>
1814               <colgroup>
1815                  <col align="left" valign="top"/>
1816               </colgroup>
1817               <tbody>
1818                  <tr valign="top">
1819                     <td>Input Data</td>
1820                     <td>
1821                        <code>&lt;!-- do a&amp;b --&gt; &lt;?php f(a&amp;b)
1822                           ?&gt; &lt;!-- show x&lt;&lt;1
1823                           --&gt;&lt;![CDATA[abcdedf x&lt;&lt;1 ]]&gt;</code>
1824                     </td>
1825                  </tr>
1826                  <tr valign="top">
1827                     <td>Comment</td>
1828                     <td>
1829                        <code>111111111111111-----------------111111111111111111-------------------------</code>
1830                     </td>
1831                  </tr>
1832                  <tr valign="top">
1833                     <td>CDATA</td>
1834                     <td>
1835                        <code>--------------------------------------------------1111111111111111111111111</code>
1836                     </td>
1837                  </tr>
1838                  <tr valign="top">
1839                     <td>PI</td>
1840                     <td>
1841                        <code>----------------111111111111111--------------------------------------------</code>
1842                     </td>
1843                  </tr>
1844               </tbody>
1845            </table>
1846         </para>
1847         <para> With the removal of all non-interpreteable markup, several phases of parallel bit
1848            stream based SIMD operations may follow, operating on up to 128 byte positions on
1849            current commondity processors, assured of XML markup relevancy, and in parallel. For
1850            example, with the removal of comments, processing instructions and CDATA secions, XML
1851            names may be identified and length sorted for efficient symbol table construction. </para>
1852         <para> As an aside, comments and CDATA sections must be first be validated to ensure that
1853            comments do not contain "--" sequences and that CDATA sections do not contain
1854            "]]&gt;" sequences prior to ignorable markup stream generation.</para>
1855      </section>
1856
1857
1858      <section>
1859         <title>Predefined Entity Deletion Streams</title>
1860         <para>Predefined character (<![CDATA[&lt;,&gt;,&amp;,&apos;,&quot;]]>)
1861            and numeric character references (&amp;#nnnn;, &amp;#xhhhh;) must be replaced by
1862            a single character [<xref linkend="XML10"/>]. Using a strategy analogous as that used
1863            for comment, processing instruction and CDATA sections, the marking of the union of all
1864            references byte position extents in bit space, with the exception of the final bit
1865            position of each reference, defines the deletion mask stream for predefined
1866         entities.</para>
1867      </section>
1868
1869      <section>
1870         <title>Parallel Parsing with Bit Stream Addition Streams</title>
1871         <para>Whereas sequential bit scans over lexical item streams form the basis of XML parsing
1872            in the current Parabix parser, a new method of parallel parsing has been developed and
1873            prototyped using the concept of bitstream addition. Fundamental to this method is the
1874            concept of a <emphasis>cursor</emphasis> stream, a bit stream marking the positions of
1875            multiple parallel parses currently in process. </para>
1876         <para>The results of parsing using the bit stream addition technique are producing using a
1877            series of <emphasis>call-out</emphasis> bit streams. These streams mark the beginning
1878            and end of each start tag, end tag and empty tag. Within tags, additional streams exist
1879            to mark start and end positions for tag names, attribute names and attribute valus. An
1880            error flag stream marks the positions of any syntactic errors encountered during
1881            parsing.</para>
1882         <para>
1883            <table>
1884               <caption>
1885                  <para>Parallel Parsing Call Out Streams</para>
1886               </caption>
1887               <colgroup>
1888                  <col align="left" valign="top"/>
1889               </colgroup>
1890               <tbody>
1891                  <tr valign="top">
1892                     <td>Input Data</td>
1893                     <td>
1894                        <code>&lt;first
1895                           att1=&quot;val1&quot;&gt;&lt;second/&gt;&lt;third
1896                           wrong=value&gt;some
1897                        text&lt;/third&gt;&lt;/first/&gt;</code>
1898                     </td>
1899                  </tr>
1900                  <tr valign="top">
1901                     <td>ElemNamePositions</td>
1902                     <td>
1903                        <code>-1------------------1--------1-------------------------------------------</code>
1904                     </td>
1905                  </tr>
1906                  <tr valign="top">
1907                     <td>ElemNameFollows</td>
1908                     <td>
1909                        <code>------1-------------------1-------1--------------------------------------</code>
1910                     </td>
1911                  </tr>
1912                  <tr valign="top">
1913                     <td>STagEnds</td>
1914                     <td>
1915                        <code>------------------1------------------------------------------------------</code>
1916                     </td>
1917                  </tr>
1918                  <tr valign="top">
1919                     <td>EmptyTagEnds</td>
1920                     <td>
1921                        <code>---------------------------1---------------------------------------------</code>
1922                     </td>
1923                  </tr>
1924                  <tr valign="top">
1925                     <td>ParseError</td>
1926                     <td>
1927                        <code>-----------------------------------------1-----------------------------1-</code>
1928                     </td>
1929                  </tr>
1930                  <tr valign="top">
1931                     <td>AttNameStarts</td>
1932                     <td>
1933                        <code>-------1---------------------------1-------------------------------------</code>
1934                     </td>
1935                  </tr>
1936                  <tr valign="top">
1937                     <td>AttNameFollows</td>
1938                     <td>
1939                        <code>-----------1----------------------------1--------------------------------</code>
1940                     </td>
1941                  </tr>
1942                  <tr valign="top">
1943                     <td>AttValStarts</td>
1944                     <td>
1945                        <code>------------1----------------------------1-------------------------------</code>
1946                     </td>
1947                  </tr>
1948                  <tr valign="top">
1949                     <td>AttValEnds</td>
1950                     <td>
1951                        <code>-----------------1-------------------------------------------------------</code>
1952                     </td>
1953                  </tr>
1954                  <tr valign="top">
1955                     <td>EndTagSeconds</td>
1956                     <td>
1957                        <code>---------------------------------------------------------1-------1-------</code>
1958                     </td>
1959                  </tr>
1960                  <tr valign="top">
1961                     <td>EndTagEnds</td>
1962                     <td>
1963                        <code>---------------------------------------------------------------1-------1-</code>
1964                     </td>
1965                  </tr>
1966               </tbody>
1967            </table>
1968         </para>
1969
1970      </section>
1971
1972   </section>
1973   <section>
1974      <title>SIMD Beyond Bitstreams: Names and Numbers</title>
1975
1976      <para>Whereas the fundamental innovation of our work is the use of SIMD technology in
1977         implementing parallel bit streams for XML, there are also important ways in which more
1978         traditional byte-oriented SIMD operations can be useful in accelerating other aspects of
1979         XML processing.</para>
1980
1981      <section>
1982         <title>Name Lookup</title>
1983         <para>Efficient symbol table mechanisms for looking up element and attribute names is
1984            important for almost all XML processing applications. It is also an important technique
1985            merely for assessing well-formedness of an XML document; rather than validating the
1986            character-by-character composition of each occurrence of an XML name as it is
1987            encountered, it is more efficient to validate all but the first occurrence by first
1988            determining whether the name already exists in a table of prevalidated names.</para>
1989
1990         <para>The first symbol table mechanism deployed in the Parabix parser simply used the
1991            hashmaps of the C++ standard template library, without deploying any SIMD technology.
1992            However, with the overhead of character validation, transcoding and parsing dramatically
1993            reduced by parallel bit stream technology, we found that symbol lookups then accounted
1994            for about half of the remaining execution time in a statistics gathering application
1995               [<xref linkend="CASCON08"/>]. Thus, symbol table processing was identified as a major
1996            target for further performance improvement. </para>
1997         <para> Our first effort to improve symbol table performance was to employ the splash tables
1998            with cuckoo hashing as described by Ross [<xref linkend="Ross06"/>], using SIMD
1999            technology for parallel bucket processing. Although this technique did turn out to have
2000            the advantage of virtually constant-time performance even for very large vocabularies,
2001            it was not particularly helpful for the relatively small vocabularies typically found in
2002            XML document processing. </para>
2003         <para> However, a second approach has been found to be quite useful, taking advantage of
2004            parallel bit streams for cheap determination of symbol length. In essence, the length of
2005            a name can be determined very cheaply using a single bit scan operation. This then makes
2006            it possible to use length-sorted symbol table processing, as follows. First, the
2007            occurrences of all names are stored in arrays indexed by length. Then the length-sorted
2008            arrays may each be inserted into the symbol table in turn. The advantage of this is that
2009            a separate loop may be written for each length. Length sorting makes for very efficient
2010            name processing. For example hash value computations and name comparisons can be made by
2011            loading multibyte values and performing appropriate shifting and masking operations,
2012            without the need for a byte-at-a-time loop. In initial experiments, this length-sorting
2013            approach was found to reduce symbol lookup cost by a factor of two. </para>
2014         <para> Current research includes the application of SIMD technology to further enhance the
2015            performance of length-sorted lookup. We have identified a promising technique for
2016            parallel processing of multiple name occurrences using a parallel trie lookup technique.
2017            Given an array of occurrences of names of a particular length, the first one, two or
2018            four bytes of each name are gathered and stored in a linear array. SIMD techniques are
2019            then used to compare these prefixes with the possible prefixes for the current position
2020            within the trie. In general, a very small number of possibilities exist for each trie
2021            node, allowing for fast linear search through all possibilities. Typically, the
2022            parallelism is expected to exceed the number of possibilities to search through at each
2023            node. With length-sorting to separate the top-level trie into many small subtries, we
2024            expect only a single step of symbol lookup to be needed in most practical instances. </para>
2025
2026         <para>The gather step of this algorithm is actually a common technique in SIMD processing.
2027            Instruction set support for gather operations is a likely future direction for SIMD
2028            technology.</para>
2029      </section>
2030
2031      <section>
2032         <title>Numeric Processing</title>
2033         <para> Many XML applications involve numeric data fields as attribute values or element
2034            content. Although most current XML APIs uniformly return information to applications in
2035            the form of character strings, it is reasonable to consider direct API support for
2036            numeric conversions within a high-performance XML engine. With string to numeric
2037            conversion such a common need, why leave it to application programmers? </para>
2038         <para> High-performance string to numeric conversion using SIMD operations also can
2039            considerably outperform the byte-at-a-time loops that most application programmers or
2040            libraries might employ. A first step is reduction of ASCII bytes to corresponding
2041            decimal nybbles using a SIMD packing operation. Then an inductive doubling algorithm
2042            using SIMD operations may be employed. First, 16 sets of adjacent nybble values in the
2043            range 0-9 can be combined in just a few SIMD operations to 16 byte values in the range
2044            0-99. Then 8 sets of byte values may similarly be combined with further SIMD processing
2045            to produce doublebyte values in the range 0-9999. Further combination of doublebyte
2046            values into 32-bit integers and so on can also be performed using SIMD operations. </para>
2047         <para> Using appropriate gather operations to bring numeric strings into appropriate array
2048            structures, an XML engine could offer high-performance numeric conversion services to
2049            XML application programmers. We expect this to be an important direction for our future
2050            work, particularly in support of APIs that focus on direct conversion of XML data into
2051            business objects. </para>
2052
2053      </section>
2054   </section>
2055
2056   <section>
2057      <title>APIs and Parallel Bit Streams</title>
2058
2059      <section>
2060         <title>The ILAX Streaming API</title>
2061         <para>The In-Line API for XML (ILAX) is the base API provided with the Parabix parser. It
2062            is intended for low-level extensions compiled right into the engine, with minimum
2063            possible overhead. It is similar to streaming event-based APIs such as SAX, but
2064            implemented by inline substitution rather than using callbacks. In essence, an extension
2065            programmer provides method bodies for event-processing methods declared internal to the
2066            Parabix parsing engine, compiling the event processing code directly with the core code
2067            of the engine. </para>
2068         <para> Although ILAX can be used directly for application programming, its primary use is
2069            for implementing engine extensions that support higher-level APIs. For example, the
2070            implementation of C or C++ based streaming APIs based on the Expat [<xref
2071               linkend="Expat"/>] or general SAX models can be quite directly implemented. C/C++ DOM
2072            or other tree-based APIs can also be fairly directly implemented. However, delivering
2073            Parabix performance to Java-based XML applications is challenging due to the
2074            considerable overhead of crossing the Java Native Interface (JNI) boundary. This issue
2075            is addressed with the Array Set Model (ASM) concept discussed in the following section. </para>
2076         <para> With the recent development of parallel parsing using bitstream addition, it is
2077            likely that the underlying ILAX interface of Parabix will change. In essence, ILAX
2078            suffers the drawback of all event-based interfaces: they are fundamentally sequential in
2079            number. As research continues, we expect efficient parallel methods building on parallel
2080            bit stream foundations to move up the stack of XML processing requirements. Artificially
2081            imposing sequential processing is thus expected to constrain further advances in XML
2082            performance. </para>
2083      </section>
2084
2085      <section>
2086         <title>Efficient XML in Java Using Array Set Models</title>
2087         <para> In our GML-to-SVG case study, we identified the lack of high-performance XML
2088            processing solutions for Java to be of particular interest. Java byte code does not
2089            provide access to the SIMD capabilities of the underlying machine architecture. Java
2090            just-in-time compilers might be capable of using some SIMD facilities, but there is no
2091            real prospect of conventional compiler technology translating byte-at-a-time algorithms
2092            into parallel bit stream code. So the primary vehicle for delivering high-performance
2093            XML processing is to call native parallel bit stream code written in C through JNI
2094            capabilities. </para>
2095         <para>However, each JNI call is expensive, so it is desirable to minimize the number of
2096            calls and get as much work done during each call as possible. This mitigates against
2097            direct implementation of streaming APIs in Java through one-to-one mappings to an
2098            underlying streaming API in C. Instead, we have concentrated on gathering information on
2099            the C side into data structures that can then be passed to the Java side. However, using
2100            either C pointer-based structures or C++ objects is problematic because these are
2101            difficult to interpret on the Java side and are not amenable to Java's automatic storage
2102            management system. Similarly, Java objects cannot be conveniently created on the C side.
2103            However, it is possible to transfer arrays of simple data values (bytes or integers)
2104            between C and Java, so that makes a reasonable focus for bulk data communication between
2105            C and Java. </para>
2106         <para><emphasis>Array Set Models</emphasis> are array-based representations of information
2107            representing an XML document in accord with XML InfoSet [<xref linkend="InfoSet"/>] or
2108            other XML data models relevant to particular APIs. As well as providing a mechanism for
2109            efficient bulk data communication across the JNI boundary, ASMs potentially have a
2110            number of other benefits in high-performance XML processing. <itemizedlist>
2111               <listitem>
2112                  <para>Prefetching. Commodity processors commonly support hardware and/or software
2113                     prefetching to ensure that data is available in a processor cache when it is
2114                     needed. In general, prefetching is most effective in conjunction with the
2115                     continuous sequential memory access patterns associated with array
2116                  processing.</para>
2117               </listitem>
2118               <listitem>
2119                  <para>DMA. Some processing environments provide Direct Memory Access (DMA)
2120                     controllers for block data movement in parallel with computation. For example,
2121                     the Cell Broadband Engine uses DMA controllers to move the data to and from the
2122                     local stores of the synergistic processing units. Arrays of contiguous data
2123                     elements are well suited to bulk data movement using DMA.</para>
2124               </listitem>
2125               <listitem>
2126                  <para>SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern
2127                     processor instruction sets allow simultaneous application of particular
2128                     instructions to sets of elements from parallel arrays. For effective use of
2129                     SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS
2130                     (Array of Structures) model. </para>
2131               </listitem>
2132               <listitem>
2133                  <para>Multicore processors. Array-oriented processing can enable the effective
2134                     distribution of work to the individual cores of a multicore system in two
2135                     distinct ways. First, provided that sequential dependencies can be minimized or
2136                     eliminated, large arrays can be divided into separate segments to be processed
2137                     in parallel on each core. Second, pipeline parallelism can be used to implement
2138                     efficient multipass processing with each pass consisting of a processing kernel
2139                     with array-based input and array-based output. </para>
2140               </listitem>
2141               <listitem>
2142                  <para>Streaming buffers for large XML documents. In the event that an XML document
2143                     is larger than can be reasonably represented entirely within processor memory,
2144                     a buffer-based streaming model can be applied to work through a document using
2145                     sliding windows over arrays of elements stored in document order. </para>
2146               </listitem>
2147
2148            </itemizedlist>
2149         </para>
2150
2151         <section>
2152            <title>Saxon-B TinyTree Example</title>
2153            <para>As a first example of the ASM concept, current work includes a proof-of-concept to
2154               deliver a high-performance replacement for building the TinyTree data structure used
2155               in Saxon-B 6.5.5, an open-source XSLT 2.0 processor written in Java [<xref
2156                  linkend="Saxon"/>]. Although XSLT stylesheets may be cached for performance, the
2157               caching of source XML documents is typically not possible. A new TinyTree object to
2158               represent the XML source document is thus commonly constructed with each new query so
2159               that the overall performance of simple queries on large source XML documents is
2160               highly dependent on TinyTree build time. Indeed, in a study of Saxon-SA, the
2161               commercial version of Saxon, query time was shown to be dominated by TinyTree build
2162               time [<xref linkend="Kay08"/>]. Similar performance results are demonstrable for the
2163               Saxon-B XSLT processor as well. </para>
2164            <para> The Saxon-B processor studied is a pure Java solution, converting a SAX (Simple
2165               API for XML) event stream into the TinyTree Java object using the efficient Aelfred
2166               XML parser [<xref linkend="AElfred"/>]. The TinyTree structure is itself an
2167               array-based structure mapping well suited to the ASM concept. It consists of six
2168               parallel arrays of integers indexed on node number and containing one entry for each
2169               node in the source document, with the exception of attribute and namespace nodes
2170                  [<xref linkend="Saxon"/>]. Four of the arrays respectively provide node kind, name
2171               code, depth, and next sibling information for each node, while the two others are
2172               overloaded for different purposes based on node kind value. For example, in the
2173               context of a text node , one of the overloaded arrays holds the text buffer offset
2174               value whereas the other holds the text buffer length value. Attributes and namespaces
2175               are represented using similiar parallel array of values. The stored TinyTree values
2176               are primarily primitive Java types, however, object types such as Java Strings and
2177               Java StringBuffers are also used to hold attribute values and comment values
2178               respectively. </para>
2179            <para> In addition to the TinyTree object, Saxon-B maintains a NamePool object which
2180               represents a collection of XML name triplets. Each triplet is composed of a Namespace
2181               URI, a Namespace prefix and a local name and encoded as an integer value known as a
2182               namecode. Namecodes permit efficient name search and look-up using integer
2183               comparison. Namecodes may also be subsequently decoded to recover namespace and local
2184               name information. </para>
2185            <para> Using the Parabix ILAX interface, a high-performance reimplementation of TinyTree
2186               and NamePool data structures was built to compare with the Saxon-B implementation. In
2187               fact, two functionally equivalent versions of the ASM java class were constructed. An
2188               initial version was constructed based on a set of primitive Java arrays constructed
2189               and allocated in the Java heap space via JNI New&lt;PrimitiveType&gt;Array
2190               method call. In this version, the JVM garbage collector is aware of all memory
2191               allocated in the native code. However, in this approach, large array copy operations
2192               limited overall performance to approximately a 2X gain over the Saxon-B build time. </para>
2193            <para>To further address the performance penalty imposed by copying large array values,
2194               a second version of the ASM Java object was constructed based on natively backed
2195               Direct Memory Byte Buffers [<xref linkend="JNI"/>]. In this version the JVM garbage
2196               collector is unaware any native memory resources backing the Direct Memory Byte
2197               Buffers. Large JNI-based copy operations are avoided; however, system memory must be
2198               explicitly deallocated via a Java native method call. Using this approach, our
2199               preliminary results show an approximate total 2.5X gain over Saxon-B build time.
2200            </para>
2201         </section>
2202      </section>
2203   </section>
2204
2205
2206   <section>
2207      <title>Compiler Technology</title>
2208
2209      <para> An important focus of our recent work is on the development of compiler technology to
2210         automatically generate the low-level SIMD code necessary to implement bit stream processing
2211         given suitable high-level specifications. This has several potential benefits. First, it
2212         can eliminate the tedious and error-prone programming of bit stream operations in terms of
2213         register-at-a-time SIMD operations. Second, compilation technology can automatically employ
2214         a variety of performance improvement techniques that are difficult to apply manually. These
2215         include algorithms for instruction scheduling and register allocation as well as
2216         optimization techniques for common subexpression expression elimination and register
2217         rematerialization among others. Third, compiler technology makes it easier to make changes
2218         to the low-level code for reasons of perfective or adaptive maintenance.</para>
2219
2220      <para>Beyond these reasons, compiler technology also offers the opportunity for retargetting
2221         the generation of code to accommodate different processor architectures and API
2222         requirements. Strategies for efficient parallel bit stream code can vary considerably
2223         depending on processor resources such as the number of registers available, the particular
2224         instruction set architecture supported, the size of L1 and L2 data caches, the number of
2225         available cores and so on. Separate implementation of custom code for each processor
2226         architecture would thus be likely to be prohibitively expensive, prone to errors and
2227         inconsistencies and difficult to maintain. Using compilation technology, however, the idea
2228         would be to implement a variety of processor-specific back-ends all using a common front
2229         end based on parallel bit streams. </para>
2230
2231      <section>
2232         <title>Character Class Compiler</title>
2233
2234         <para>The first compiler component that we have implemented is a character class compiler,
2235            capable of generation all the bit stream logic necessary to produce a set of lexical
2236            item streams each corresponding to some particular set of characters to be recognized.
2237            By taking advantage of common patterns between characters within classes, and special
2238            optimization logic for recognizing character-class ranges, our existing compiler is able
2239            to generate well-optimized code for complex sets of character classes involving numbers
2240            of special characters as well as characters within specific sets of ranges. </para>
2241
2242      </section>
2243      <section>
2244         <title>Regular Expression Compilation</title>
2245
2246         <para>Based on the character class compiler, we are currently investigating the
2247            construction of a regular expression compiler that can implement bit-stream based
2248            parallel regular-expression matching similar to that describe previously for parallel
2249            parsing by bistream addition. This compiler works with the assumption that bitstream
2250            regular-expression definitions are deterministic; no backtracking is permitted with the
2251            parallel bit stream representation. In XML applications, this compiler is primarily
2252            intended to enforce regular-expression constraints on string datatype specifications
2253            found in XML schema. </para>
2254
2255      </section>
2256
2257      <section>
2258         <title>Unbounded Bit Stream Compilation</title>
2259
2260         <para>The Catalog of XML Bit Streams presented earlier consist of a set of abstract,
2261            unbounded bit streams, each in one-to-one correspondence with input bytes of a text
2262            file. Determining how these bit streams are implemented using fixed-width SIMD
2263            registers, and possibly processed in fixed-length buffers that represent some multiple
2264            of the register width is a source of considerable programming complexity. The general
2265            goal of our compilation strategy in this case is to allow operations to be programmed in
2266            terms of unbounded bit streams and then automatically reduced to efficient low-level
2267            code with the application of a systematic code generation strategy for handling block
2268            and buffer boundary crossing. This is work currently in progress. </para>
2269
2270      </section>
2271   </section>
2272
2273   <section>
2274      <title>Conclusion</title>
2275      <para>Parallel bit stream technology offers the opportunity to dramatically speed up the core
2276         XML processing components used to implement virtually any XML API. Character validation and
2277         transcoding, whitespace processing, and parsing up to including the full validation of tag
2278         syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the
2279         positions of all element names, attribute names and attribute values can also be produced,
2280         followed by fast bit scan operations to generate position and length values. Beyond bit
2281         streams, byte-oriented SIMD processing of names and numerals can also accelerate
2282         performance beyond sequential byte-at-a-time methods. </para>
2283      <para>Advances in processor architecture are likely to further amplify the performance of
2284         parallel bit stream technology over traditional byte-at-a-time processing over the next
2285         decade. Improvements to SIMD register width, register complement and operation format can
2286         all result in further gains. New SIMD instruction set features such as inductive doubling
2287         support, parallel extract and deposit instructions, bit interleaving and scatter/gather
2288         capabilities should also result in significant speed-ups. Leveraging the intraregister
2289         parallelism of parallel bit stream technology within SIMD registers to take of intrachip
2290         parallelism on multicore processors should accelerate processing further. </para>
2291      <para>Technology transfer using a patent-based open-source business model is a further goal of
2292         our work with a view to widespread deployment of parallel bit stream technology in XML
2293         processing stacks implementing a variety of APIs. The feasibility of substantial
2294         performance improvement in replacement of technology implementing existing APIs has been
2295         demonstrated even in complex software architectures involving delivery of performance
2296         benefits across the JNI boundary. We are seeking to accelerate these deployment efforts
2297         both through the development of compiler technology to reliably apply these methods to a
2298         variety of architectures as well as to identify interested collaborators using open-source
2299         or commercial models. </para>
2300   </section>
2301
2302   <section>
2303      <title>Acknowledgments</title>
2304      <para>This work is supported in part by research grants and scholarships from the Natural
2305         Sciences and Engineering Research Council of Canada, the Mathematics of Information
2306         Technology and Complex Systems Network and the British Columbia Innovation Council. </para>
2307      <para>We thank our colleague Dan Lin (Linda) for her work in high-performance symbol table
2308         processing. </para>
2309   </section>
2310
2311   <bibliography>
2312      <title>Bibliography</title>
2313      <bibliomixed xml:id="XMLChip09" xreflabel="Leventhal and Lemoine 2009">Leventhal, Michael and
2314         Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on
2315         Processing XML Efficiently 2009, Montréal.</bibliomixed>
2316      <bibliomixed xml:id="Datapower09" xreflabel="Salz, Achilles and Maze 2009">Salz, Richard,
2317         Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM
2318         DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML
2319         Efficiently 2009, Montréal.</bibliomixed>
2320      <bibliomixed xml:id="PPoPP08" xreflabel="Cameron 2007">Cameron, Robert D. 2007. A Case Study
2321         in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings
2322         of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt
2323         Lake City, Utah. On the Web at <link>http://research.ihost.com/ppopp08/</link>.</bibliomixed>
2324      <bibliomixed xml:id="CASCON08" xreflabel="Cameron, Herdy and Lin 2008">Cameron, Robert D.,
2325         Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream
2326         Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and
2327         Practice of Parallel Programming 2008, Toronto.</bibliomixed>
2328      <bibliomixed xml:id="SVGOpen08" xreflabel="Herdy, Burggraf and Cameron 2008">Herdy, Kenneth
2329         S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG
2330         Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems.
2331         Proceedings of SVG Open 6th International Conference on Scalable Vector Graphics,
2332         Nuremburg. On the Web at
2333            <link>http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/</link>.</bibliomixed>
2334      <bibliomixed xml:id="Ross06" xreflabel="Ross 2006">Ross, Kenneth A. 2006. Efficient hash
2335         probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at
2336            <link>www.cs.columbia.edu/~kar/pubsk/icde2007.pdf</link>.</bibliomixed>
2337      <bibliomixed xml:id="ASPLOS09" xreflabel="Cameron and Lin 2009">Cameron, Robert D. and Dan
2338         Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The
2339         Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.</bibliomixed>
2340      <bibliomixed xml:id="Wu08" xreflabel="Wu et al. 2008">Wu, Yu, Qi Zhang, Zhiqiang Yu and
2341         Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation.
2342         Proceedings of Balisage 2008, Montréal. On the Web at
2343            <link>http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html</link>.</bibliomixed>
2344      <bibliomixed xml:id="u8u16" xreflabel="Cameron 2008">u8u16 - A High-Speed UTF-8 to UTF-16
2345         Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing
2346         Science Simon Fraser University, June 21 2007.</bibliomixed>
2347      <bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth
2348         Edition) W3C Recommendation 26 November 2008. On the Web at
2349            <link>http://www.w3.org/TR/REC-xml/</link>.</bibliomixed>
2350      <bibliomixed xml:id="Unicode" xreflabel="Unicode">The Unicode Consortium. 2009. On the Web at
2351            <link>http://unicode.org/</link>.</bibliomixed>
2352      <bibliomixed xml:id="Pex06" xreflabel="Hilewitz and Lee 2006"> Hilewitz, Y. and Ruby B. Lee.
2353         2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit
2354         Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific
2355         Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.</bibliomixed>
2356      <bibliomixed xml:id="InfoSet" xreflabel="XML Infoset">XML Information Set (Second Edition) W3C
2357         Recommendation 4 February 2004. On the Web at
2358         <link>http://www.w3.org/TR/xml-infoset/</link>.</bibliomixed>
2359      <bibliomixed xml:id="Saxon" xreflabel="Saxon">SAXON The XSLT and XQuery Processor. On the Web
2360         at <link>http://saxon.sourceforge.net/</link>.</bibliomixed>
2361      <bibliomixed xml:id="Kay08" xreflabel="Kay 2008"> Kay, Michael Y. 2008. Ten Reasons Why Saxon
2362         XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.</bibliomixed>
2363      <bibliomixed xml:id="AElfred" xreflabel="Ælfred"> The Ælfred XML Parser. On the Web at
2364            <link>http://saxon.sourceforge.net/aelfred.html</link>.</bibliomixed>
2365      <bibliomixed xml:id="JNI" xreflabel="Hitchens 2002">Hitchens, Ron. Java NIO. O'Reilly, 2002.</bibliomixed>
2366      <bibliomixed xml:id="Expat" xreflabel="Expat">The Expat XML Parser.
2367            <link>http://expat.sourceforge.net/</link>.</bibliomixed>
2368   </bibliography>
2369
2370</article>
Note: See TracBrowser for help on using the repository browser.