Changeset 3042 for docs


Ignore:
Timestamp:
Apr 19, 2013, 1:04:40 AM (6 years ago)
Author:
ksherdy
Message:

Added indentation.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

    r3041 r3042  
    33<!DOCTYPE article SYSTEM "balisage-1-3.dtd">
    44<article xmlns="http://docbook.org/ns/docbook" version="5.0-subset Balisage-1.3"
    5   xml:id="HR-23632987-8973">
    6    <title></title>
     5   xml:id="HR-23632987-8973">
     6   <title/>
    77   <info>
    8 <!--
     8      <!--
    99      <confgroup>
    1010         <conftitle>International Symposium on Processing XML Efficiently: Overcoming Limits on
     
    1414-->
    1515      <abstract>
    16          <para>Prior research on the acceleration of XML processing
    17 using SIMD and multi-core parallelism has lead to
    18 a number of interesting research prototypes.  This work
    19 investigates the extent to which the techniques underlying
    20 these prototypes could result in systematic performance
    21 benefits when fully integrated into a commercial XML parser.
    22 The widely used Xerces-C++ parser of the Apache Software
    23 Foundation was chosen as the foundation for the study.
    24 A systematic restructuring of the parser was undertaken,
    25 while maintaining the existing API for application programmers.
    26 Using SIMD techniques alone, an increase in parsing speed
    27 of at least 50% was observed in a range of applications.
    28 When coupled with pipeline parallelism on dual core processors,
    29 improvements of 2x and beyond were realized.
    30 </para>
     16         <para>Prior research on the acceleration of XML processing using SIMD and multi-core
     17            parallelism has lead to a number of interesting research prototypes. This work
     18            investigates the extent to which the techniques underlying these prototypes could result
     19            in systematic performance benefits when fully integrated into a commercial XML parser.
     20            The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the
     21            foundation for the study. A systematic restructuring of the parser was undertaken, while
     22            maintaining the existing API for application programmers. Using SIMD techniques alone,
     23            an increase in parsing speed of at least 50% was observed in a range of applications.
     24            When coupled with pipeline parallelism on dual core processors, improvements of 2x and
     25            beyond were realized. </para>
    3126      </abstract>
    32                         <author>
     27      <author>
    3328         <personname>
    3429            <firstname>Nigel</firstname>
     
    3631         </personname>
    3732         <personblurb>
    38              <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead developer of icXML.
    39              He earned a Bachelor of Technology in Information Technology at Kwantlen Polytechnic University in 2009
    40              and was awarded the Dean’s Medal for Outstanding Achievement.</para>
    41              <para>Nigel is currently researching ways to leverage both the Parabix framework and stream-processing models
    42              to further accelerate XML parsing within icXML.</para>
     33            <para>Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
     34               developer of icXML. He earned a Bachelor of Technology in Information Technology at
     35               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
     36               Outstanding Achievement.</para>
     37            <para>Nigel is currently researching ways to leverage both the Parabix framework and
     38               stream-processing models to further accelerate XML parsing within icXML.</para>
    4339         </personblurb>
    4440         <affiliation>
    4541            <jobtitle>Developer</jobtitle>
    46              <orgname>International Characters Inc.</orgname>
     42            <orgname>International Characters Inc.</orgname>
    4743         </affiliation>
    4844         <affiliation>
    4945            <jobtitle>Graduate Student, School of Computing Science</jobtitle>
    50              <orgname>Simon Fraser University </orgname>
     46            <orgname>Simon Fraser University </orgname>
    5147         </affiliation>
    5248         <email>nmedfort@sfu.ca</email>
     
    5854         </personname>
    5955         <personblurb>
    60             <para></para>
     56            <para/>
    6157         </personblurb>
    6258         <affiliation>
    63             <jobtitle></jobtitle>
    64             <orgname></orgname>
     59            <jobtitle/>
     60            <orgname/>
    6561         </affiliation>
    66          <email></email>
     62         <email/>
    6763      </author>
    6864      <author>
     
    7571               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
    7672               of Science in Computing Science with a Certificate in Spatial Information Systems at
    77                Simon Fraser University in 2005.
    78                                                 </para>
     73               Simon Fraser University in 2005. </para>
    7974            <para> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
    8075               University with industrial scholarship support from the Natural Sciences and
     
    8277               Complex Systems NCE, and the BC Innovation Council. His research focus is an analysis
    8378               of the principal techniques that may be used to improve XML processing performance in
    84                the context of the Geography Markup Language (GML).
    85                                                 </para>
     79               the context of the Geography Markup Language (GML). </para>
    8680         </personblurb>
    8781         <affiliation>
     
    9185         <email>ksherdy@sfu.ca</email>
    9286      </author>
    93                  <author>
    94           <personname>
    95              <firstname>Rob</firstname>
    96              <surname>Cameron</surname>
    97           </personname>
    98           <personblurb>
    99              <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of
    100              Applied Sciences at Simon Fraser
    101                 University.   His research interests include programming language
    102                 and software system technology, with a specific focus on high performance
    103                 text processing using SIMD and multicore parallelism.  He is the developer
    104                 of the REX XML shallow parser as well as the parallel bit stream (Parabix)
    105                 framework for SIMD text processing. 
    106               </para>
    107           </personblurb>
    108           <affiliation>
    109              <jobtitle>Professor of Computing Science</jobtitle>
    110              <orgname>Simon Fraser University</orgname>
     87      <author>
     88         <personname>
     89            <firstname>Rob</firstname>
     90            <surname>Cameron</surname>
     91         </personname>
     92         <personblurb>
     93            <para>Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
     94               Sciences at Simon Fraser University. His research interests include programming
     95               language and software system technology, with a specific focus on high performance
     96               text processing using SIMD and multicore parallelism. He is the developer of the REX
     97               XML shallow parser as well as the parallel bit stream (Parabix) framework for SIMD
     98               text processing. </para>
     99         </personblurb>
     100         <affiliation>
     101            <jobtitle>Professor of Computing Science</jobtitle>
     102            <orgname>Simon Fraser University</orgname>
    111103         </affiliation>
    112           <affiliation>
    113              <jobtitle>Chief Technology Officer</jobtitle>
    114              <orgname>International Characters, Inc.</orgname>
     104         <affiliation>
     105            <jobtitle>Chief Technology Officer</jobtitle>
     106            <orgname>International Characters, Inc.</orgname>
    115107         </affiliation>
    116108         <email>cameron@cs.sfu.ca</email>
    117        </author>
     109      </author>
    118110      <author>
    119111         <personname>
     
    122114         </personname>
    123115         <personblurb>
    124             <para></para>
     116            <para/>
    125117         </personblurb>
    126118         <affiliation>
    127             <jobtitle></jobtitle>
    128             <orgname></orgname>
     119            <jobtitle/>
     120            <orgname/>
    129121         </affiliation>
    130          <email></email>
     122         <email/>
    131123      </author>
    132 <!--
     124      <!--
    133125      <legalnotice>
    134126         <para>Copyright &#x000A9; 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
     
    144136   <section>
    145137      <title>Introduction</title>
    146       <para></para>
    147       <para></para>
    148       <para></para>
    149       <para></para>
     138      <para/>
     139      <para/>
     140      <para/>
     141      <para/>
    150142   </section>
    151143
     
    154146      <section>
    155147         <title>Xerces C++ Structure</title>
    156 <para>
    157 The Xerces C++ parser
    158 <!-- is a widely-used standards-conformant -->
    159 <!-- XML parser produced as open-source software -->
    160 <!-- by the Apache Software Foundation. -->
    161 <!-- It -->
    162 features comprehensive support for a variety of character encodings
    163 both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
    164 multiple XML vocabularies through the XML namespace
    165 mechanism, as well as complete implementations
    166 of structure and data validation through multiple grammars
    167 declared using either legacy DTDs (document type
    168 definitions) or modern XML Schema facilities.
    169 Xerces also supports several APIs for accessing
    170 parser services, including event-based parsing
    171 using either pull parsing or SAX/SAX2 push-style
    172 parsing as well as a DOM tree-based parsing interface.
    173 </para>
    174 <para>
    175 <!--What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?-->
    176 <!--Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars-->
    177 <!--Should we show a val-grind summary of a few files in a linechart form?-->
    178 Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
    179 first to the last byte of input data. Each byte passes through several processing layers and is
    180 classified and eventually validated within the context of the document state.
    181 This introduces implicit dependencies between the various tasks within the application that make it
    182 difficult to optimize for performance.
    183 As a complex software system, no one feature dominates the overall parsing performance.
    184 Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
    185 Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
    186 parallelization in isolation would only produce a minute improvement in performance.
    187 Unfortunately, early investigation into these functions found
    188 that incorporating speculation-free thread-level parallelization was impossible
    189 and they were already performing well in their given tasks;
    190 thus only trivial enhancements were attainable.
    191 In order to obtain a systematic acceleration of Xerces,
    192 it should be expected that a comprehensive restructuring
    193 is required, involving all aspects of the parser.
    194 </para>
    195 <para>
    196 <!-- In order to obtain systematic acceleration of the Xerces parser,-->
    197 <!-- it should be expected that a comprehensive restructuring-->
    198 <!-- is required, involving all aspects of the parser.-->
    199 <!-- FIGURE
     148         <para> The Xerces C++ parser <!-- is a widely-used standards-conformant -->
     149            <!-- XML parser produced as open-source software -->
     150            <!-- by the Apache Software Foundation. -->
     151            <!-- It --> features comprehensive support for a variety of character encodings both
     152            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
     153            XML vocabularies through the XML namespace mechanism, as well as complete
     154            implementations of structure and data validation through multiple grammars declared
     155            using either legacy DTDs (document type definitions) or modern XML Schema facilities.
     156            Xerces also supports several APIs for accessing parser services, including event-based
     157            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
     158            tree-based parsing interface. </para>
     159         <para>
     160            <!--What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?-->
     161            <!--Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars-->
     162            <!--Should we show a val-grind summary of a few files in a linechart form?--> Xerces,
     163            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
     164            the first to the last byte of input data. Each byte passes through several processing
     165            layers and is classified and eventually validated within the context of the document
     166            state. This introduces implicit dependencies between the various tasks within the
     167            application that make it difficult to optimize for performance. As a complex software
     168            system, no one feature dominates the overall parsing performance. Figure
     169            \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a
     170            typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of
     171            these functions for parallelization in isolation would only produce a minute improvement
     172            in performance. Unfortunately, early investigation into these functions found that
     173            incorporating speculation-free thread-level parallelization was impossible and they were
     174            already performing well in their given tasks; thus only trivial enhancements were
     175            attainable. In order to obtain a systematic acceleration of Xerces, it should be
     176            expected that a comprehensive restructuring is required, involving all aspects of the
     177            parser. </para>
     178         <para>
     179            <!-- In order to obtain systematic acceleration of the Xerces parser,-->
     180            <!-- it should be expected that a comprehensive restructuring-->
     181            <!-- is required, involving all aspects of the parser.-->
     182            <!-- FIGURE
    200183\begin{figure}[h]
    201184\begin{tabular}{r|l}
     
    217200\end{figure}
    218201-->
    219 </para>
     202         </para>
    220203      </section>
    221204      <section>
    222205         <title>The Parabix Framework</title>
    223 <para>
    224 The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
    225 (and other forms of text processing.) The key idea is to exploit the availability of wide
    226 SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks
    227 of input data by using one register bit per single input byte.
    228 To facilitate this, the input data is first transposed into a set of basis bit streams.
    229 In <!--FIGURE REF Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
     206         <para> The Parabix (parallel bit stream) framework is a transformative approach to XML
     207            parsing (and other forms of text processing.) The key idea is to exploit the
     208            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
     209            data from long blocks of input data by using one register bit per single input byte. To
     210            facilitate this, the input data is first transposed into a set of basis bit streams. In <!--FIGURE REF Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
    230211is represented as 8 basis bit streams, $\tt b<subscript>{0 \ldots 7}$.
    231212-->
    232 <!-- The bits used to construct $\tt <subscript>7</subscript>$ have been highlighted in this example. -->
    233 Boolean-logic operations\footnote{&#8743;, \&#8744; and &#172; denote the boolean AND, OR and NOT operators.}
    234 are used to classify the input bits into a set of <emphasis role="ital">character-class bit streams</emphasis>, which identify key
    235 characters (or groups of characters) with a <code>1</code>.
    236 For example, one of the fundamental characters in XML is a left-angle bracket.
    237 A character is an <code>&apos;&lt;&apos; if and only if &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
    238 b<subscript>3</subscript>) &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>) &#8743; &#172; (b<subscript>6</subscript> &#8744; b<subscript>7</subscript>) = 1</code>.
    239 Similarly, a character is numeric,
    240 <code>[0-9] if and only if &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript> &#8743; (b<subscript>5</subscript> &#8744; b<subscript>6</subscript>))</code>.
    241 An important observation here is that ranges of characters may
    242 require fewer operations than individual characters and
    243 <!-- the classification cost could be amortized over many character classes.-->
    244 multiple classes can share the classification cost.
    245 </para>
    246 <para>
    247 <!-- FIGURE
     213            <!-- The bits used to construct $\tt <subscript>7</subscript>$ have been highlighted in this example. -->
     214            Boolean-logic operations\footnote{&#8743;, \&#8744; and &#172; denote the
     215            boolean AND, OR and NOT operators.} are used to classify the input bits into a set of
     216               <emphasis role="ital">character-class bit streams</emphasis>, which identify key
     217            characters (or groups of characters) with a <code>1</code>. For example, one of the
     218            fundamental characters in XML is a left-angle bracket. A character is an
     219               <code>&apos;&lt;&apos; if and only if
     220               &#172;(b<subscript>0</subscript> &#8744; b<subscript>1</subscript>)
     221               &#8743; (b<subscript>2</subscript> &#8743; b<subscript>3</subscript>)
     222               &#8743; (b<subscript>4</subscript> &#8743; b<subscript>5</subscript>)
     223               &#8743; &#172; (b<subscript>6</subscript> &#8744;
     224               b<subscript>7</subscript>) = 1</code>. Similarly, a character is numeric, <code>[0-9]
     225               if and only if &#172;(b<subscript>0</subscript> &#8744;
     226               b<subscript>1</subscript>) &#8743; (b<subscript>2</subscript> &#8743;
     227                  b<subscript>3</subscript>) &#8743; &#172;(b<subscript>4</subscript>
     228               &#8743; (b<subscript>5</subscript> &#8744;
     229            b<subscript>6</subscript>))</code>. An important observation here is that ranges of
     230            characters may require fewer operations than individual characters and
     231            <!-- the classification cost could be amortized over many character classes.--> multiple
     232            classes can share the classification cost. </para>
     233         <para>
     234            <!-- FIGURE
    248235\begin{figure}[h]
    249236\begin{center}
     
    267254\end{figure}
    268255-->
    269 </para>
    270 <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
    271 <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
    272 <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
    273 <!-- process, intra-element well-formedness validation is performed on each block -->
    274 <!-- of text. -->
    275 <para>
    276 Consider, for example, the XML source data stream shown in the first line of <!-- FIGURE REF Figure \ref{fig:parabix1} -->.
    277 The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style
    278 parsing, with each bit of each stream in one-to-one correspondence to the source character code units
    279 of the input stream.
    280 For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores.
    281 The first bit stream shown is that for the opening
    282 angle brackets that represent tag openers in XML.
    283 The second and third streams show a partition of the
    284 tag openers into start tag marks and end tag marks
    285 depending on the character immediately following the
    286 opener (i.e., <code>&quot;/&quot;</code>) or not.  The remaining three
    287 lines show streams that can be computed in subsequent
    288 parsing (using the technique
    289 of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
    290 attribute names and attribute values of tags. 
    291 </para>
    292 <para>
    293 Two intuitions may help explain how the Parabix approach can lead
    294 to improved XML parsing performance. The first is that
    295 the use of the full register width offers a considerable
    296 information advantage over sequential byte-at-a-time
    297 parsing.  That is, sequential processing of bytes
    298 uses just 8 bits of each register, greatly limiting the
    299 processor resources that are effectively being used at any one time.
    300 The second is that byte-at-a-time loop scanning loops are actually
    301 often just computing a single bit of information per iteration:
    302 is the scan complete yet?
    303 Rather than computing these individual decision-bits, an approach that computes
    304 many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
    305 should provide substantial benefit.
    306 </para>
    307 <para>
    308 Previous studies have shown that the Parabix approach improves many aspects of XML processing,
    309 including transcoding \cite{Cameron2008}, character classification and validation,
    310 tag parsing and well-formedness checking. 
    311 The first Parabix parser used processor bit scan instructions to considerably accelerate
    312 sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
    313 Recent work has incorporated a method of parallel
    314 scanning using bitstream addition \cite{cameron-EuroPar2011}, as
    315 well as combining SIMD methods with 4-stage pipeline parallelism to further improve
    316 throughput \cite{HPCA2012}.
    317 Although these research prototypes handled the full syntax of schema-less XML documents,
    318 they lacked the functionality required by full XML parsers.
    319 </para>
    320 <para>
    321 Commercial XML processors support transcoding of multiple character sets and can parse and
    322 validate against multiple document vocabularies.
    323 Additionally, they provide API facilities beyond those found in research prototypes,
    324 including the widely used SAX, SAX2 and DOM interfaces.
    325 </para>
     256         </para>
     257         <!-- Using a mixture of boolean-logic and arithmetic operations, character-class -->
     258         <!-- bit streams can be transformed into lexical bit streams, where the presense of -->
     259         <!-- a 1 bit identifies a key position in the input data. As an artifact of this -->
     260         <!-- process, intra-element well-formedness validation is performed on each block -->
     261         <!-- of text. -->
     262         <para> Consider, for example, the XML source data stream shown in the first line of
     263            <!-- FIGURE REF Figure \ref{fig:parabix1} -->. The remaining lines of this figure show
     264            several parallel bit streams that are computed in Parabix-style parsing, with each bit
     265            of each stream in one-to-one correspondence to the source character code units of the
     266            input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are
     267            represented as underscores. The first bit stream shown is that for the opening angle
     268            brackets that represent tag openers in XML. The second and third streams show a
     269            partition of the tag openers into start tag marks and end tag marks depending on the
     270            character immediately following the opener (i.e., <code>&quot;/&quot;</code>) or
     271            not. The remaining three lines show streams that can be computed in subsequent parsing
     272            (using the technique of bitstream addition \cite{cameron-EuroPar2011}), namely streams
     273            marking the element names, attribute names and attribute values of tags. </para>
     274         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
     275            parsing performance. The first is that the use of the full register width offers a
     276            considerable information advantage over sequential byte-at-a-time parsing. That is,
     277            sequential processing of bytes uses just 8 bits of each register, greatly limiting the
     278            processor resources that are effectively being used at any one time. The second is that
     279            byte-at-a-time loop scanning loops are actually often just computing a single bit of
     280            information per iteration: is the scan complete yet? Rather than computing these
     281            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
     282            bytes at a time using 128-bit registers) should provide substantial benefit. </para>
     283         <para> Previous studies have shown that the Parabix approach improves many aspects of XML
     284            processing, including transcoding \cite{Cameron2008}, character classification and
     285            validation, tag parsing and well-formedness checking. The first Parabix parser used
     286            processor bit scan instructions to considerably accelerate sequential scanning loops for
     287            individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method
     288            of parallel scanning using bitstream addition \cite{cameron-EuroPar2011}, as well as
     289            combining SIMD methods with 4-stage pipeline parallelism to further improve throughput
     290            \cite{HPCA2012}. Although these research prototypes handled the full syntax of
     291            schema-less XML documents, they lacked the functionality required by full XML parsers. </para>
     292         <para> Commercial XML processors support transcoding of multiple character sets and can
     293            parse and validate against multiple document vocabularies. Additionally, they provide
     294            API facilities beyond those found in research prototypes, including the widely used SAX,
     295            SAX2 and DOM interfaces. </para>
    326296      </section>
    327297      <section>
    328298         <title>Sequential vs. Parallel Paradigm</title>
    329 <para>
    330 Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents sequentially.
    331 Each character is examined to distinguish between the
    332 XML-specific markup, such as a left angle bracket <code>&lt;</code>, and the
    333 content held within the document. 
    334 As the parser progresses through the document, it alternates between markup scanning,
    335 validation and content processing modes.
    336 </para>
    337 <para>
    338 In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
    339   Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
    340   current state and the events associated with transitions of states.}.
    341 Each state transition indicates the processing context of subsequent characters.
    342 Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
    343 </para>
    344 <para>
    345 Parabix-style XML parsers utilize a concept of layered processing.
    346 A block of source text is transformed into a set of lexical bitstreams,
    347 which undergo a series of operations that can be grouped into logical layers,
    348 e.g., transposition, character classification, and lexical analysis.
    349 Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
    350 To meet the API requirements of the document-ordered Xerces output,
    351 the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
    352 </para>
    353       </section>                       
    354      </section>                 
    355                 <section>
    356                         <title>Architecture</title>             
    357                         <section>
    358                        <title>Overview</title>
    359 <!--\def \CSG{Content Stream Generator}-->
    360 <para>
    361 icXML is more than an optimized version of Xerces. Many components were grouped, restructured and
    362 rearchitected with pipeline parallelism in mind.
    363 In this section, we highlight the core differences between the two systems.
    364 As shown in Figure \ref{fig:xerces-arch}, Xerces
    365 is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
    366 The <emphasis role="ital">Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
    367 the majority of the character set encoding validation is performed as a byproduct of this process.
    368 The <emphasis role="ital">Reader</emphasis> is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
    369 It tracks the current line/column position,
    370 <!--(which is reported in the unlikely event that the input contains an error), -->
    371 performs line-break normalization and validates context-specific character set issues,
    372 such as tokenization of qualified-names.
    373 The <emphasis role="ital">Scanner</emphasis> pulls data through the reader and constructs the intermediate representation (IR)
    374 of the document; it deals with all issues related to entity expansion, validates
    375 the XML well-formedness constraints and any character set encoding issues that cannot
    376 be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    377 and normalization of character references, etc.)
    378 The <emphasis role="ital">Namespace Binder</emphasis> is a core piece of the element stack.
    379 It handles namespace scoping issues between different XML vocabularies.
    380 This allows the scanner to properly select the correct schema grammar structures.
    381 The <emphasis role="ital">Validator</emphasis> takes the IR produced by the Scanner (and
    382 potentially annotated by the Namespace Binder) and assesses whether the final output matches
    383 the user-defined DTD and schema grammar(s) before passing it to the end-user.
    384 </para>
    385 <para>
    386 <!-- FIGURE
     299         <para> Xerces&#8212;like all traditional XML parsers&#8212;processes XML documents
     300            sequentially. Each character is examined to distinguish between the XML-specific markup,
     301            such as a left angle bracket <code>&lt;</code>, and the content held within the
     302            document. As the parser progresses through the document, it alternates between markup
     303            scanning, validation and content processing modes. </para>
     304         <para> In other words, Xerces belongs to an equivalent class applications termed FSM
     305            applications\footnote{ Herein FSM applications are considered software systems whose
     306            behaviour is defined by the inputs, current state and the events associated with
     307            transitions of states.}. Each state transition indicates the processing context of
     308            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
     309            character could induce a state transition. </para>
     310         <para> Parabix-style XML parsers utilize a concept of layered processing. A block of source
     311            text is transformed into a set of lexical bitstreams, which undergo a series of
     312            operations that can be grouped into logical layers, e.g., transposition, character
     313            classification, and lexical analysis. Each layer is pipeline parallel and require
     314            neither speculation nor pre-parsing stages\cite{HPCA2012}. To meet the API requirements
     315            of the document-ordered Xerces output, the results of the Parabix processing layers must
     316            be interleaved to produce the equivalent behaviour. </para>
     317      </section>
     318   </section>
     319   <section>
     320      <title>Architecture</title>
     321      <section>
     322         <title>Overview</title>
     323         <!--\def \CSG{Content Stream Generator}-->
     324         <para> icXML is more than an optimized version of Xerces. Many components were grouped,
     325            restructured and rearchitected with pipeline parallelism in mind. In this section, we
     326            highlight the core differences between the two systems. As shown in Figure
     327            \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader,
     328            scanner, namespace binder, and validator. The <emphasis role="ital"
     329            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
     330            the majority of the character set encoding validation is performed as a byproduct of
     331            this process. The <emphasis role="ital">Reader</emphasis> is responsible for the
     332            streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current
     333            line/column position,
     334            <!--(which is reported in the unlikely event that the input contains an error), -->
     335            performs line-break normalization and validates context-specific character set issues,
     336            such as tokenization of qualified-names. The <emphasis role="ital">Scanner</emphasis>
     337            pulls data through the reader and constructs the intermediate representation (IR) of the
     338            document; it deals with all issues related to entity expansion, validates the XML
     339            well-formedness constraints and any character set encoding issues that cannot be
     340            completely handled by the reader or transcoder (e.g., surrogate characters, validation
     341            and normalization of character references, etc.) The <emphasis role="ital">Namespace
     342               Binder</emphasis> is a core piece of the element stack. It handles namespace scoping
     343            issues between different XML vocabularies. This allows the scanner to properly select
     344            the correct schema grammar structures. The <emphasis role="ital">Validator</emphasis>
     345            takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder)
     346            and assesses whether the final output matches the user-defined DTD and schema grammar(s)
     347            before passing it to the end-user. </para>
     348         <para>
     349            <!-- FIGURE
    387350\begin{figure}[h]
    388351\begin{center}
     
    393356\end{figure}
    394357-->
    395 </para>
    396 <para>
    397 In icXML functions are grouped into logical components.
    398 As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the Markup Processor.
    399 All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bitstreams.
    400 The <emphasis role="ital">Character Set Adapter</emphasis>, discussed in Section \ref{arch:character-set-adapter},
    401 mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
    402 set of lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}.
    403 These lexical bitstreams are later transformed into UTF-16 in the Content Stream Generator,
    404 after additional processing is performed.
    405 The first precursor to producing UTF-16 is the <emphasis role="ital">Parallel Markup Parser</emphasis> phase.
    406 It takes the lexical streams and produces a set of marker bitstreams in which a 1-bit identifies
    407 significant positions within the input data. One bitstream for each of the critical piece of information is created, such as
    408 the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
    409 Intra-element well-formedness validation is performed as an artifact of this process.
    410 Like Xerces, icXML must provide the Line and Column position of each error.
    411 The <emphasis role="ital">Line-Column Tracker</emphasis> uses the lexical information to keep track of the document position(s) through the use of an
    412 optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
    413 From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
    414 </para>
    415 <para>
    416 A typical XML file contains few unique element and attribute names&#8212;but each of them will occur frequently.
    417 icXML stores these as distinct data structures, called symbols, each with their own global identifier (GID).
    418 Using the symbol marker streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol Resolver</emphasis> scans through
    419 the raw data to produce a sequence of GIDs, called the <emphasis role="ital">symbol stream</emphasis>.
    420 </para>
    421 <para>
    422 The final components of the Parabix Subsystem are the <emphasis role="ital">Content Preparation Unit</emphasis> and <emphasis role="ital">Content Stream Generator</emphasis>.
    423 The former takes the (transposed) basis bitstreams and selectively filters them, according to the
    424 information provided by the Parallel Markup Parser, and the latter transforms the
    425 filtered streams into the tagged UTF-16 <emphasis role="ital">content stream</emphasis>, discussed in Section \ref{section:arch:contentstream}.
    426 </para>
    427 <para>
    428 Combined, the symbol and content stream form icXML's compressed IR of the XML document.
    429 The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to validate and produce the sequential output for the end user.
    430 The <emphasis role="ital">Final WF checker</emphasis> performs inter-element well-formedness validation that would be too costly
    431 to perform in bit space, such as ensuring every start tag has a matching end tag.
    432 Xerces's namespace binding functionality is replaced by the <emphasis role="ital">Namespace Processor</emphasis>. Unlike Xerces,
    433 it is a discrete phase that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI stream</emphasis>, which are
    434 associated with each symbol occurrence.
    435 This is discussed in Section \ref{section:arch:namespacehandling}.
    436 Finally, the <emphasis role="ital">Validation</emphasis> layer implements the Xerces's validator.
    437 However, preprocessing associated with each symbol greatly reduces the work of this stage.
    438 </para>
    439 <para>
    440 <!-- FIGURE
     358         </para>
     359         <para> In icXML functions are grouped into logical components. As shown in Figure
     360            \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the
     361            Markup Processor. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which
     362            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
     363               Adapter</emphasis>, discussed in Section \ref{arch:character-set-adapter}, mirrors
     364            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
     365            lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}. These lexical
     366            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
     367            additional processing is performed. The first precursor to producing UTF-16 is the
     368               <emphasis role="ital">Parallel Markup Parser</emphasis> phase. It takes the lexical
     369            streams and produces a set of marker bitstreams in which a 1-bit identifies significant
     370            positions within the input data. One bitstream for each of the critical piece of
     371            information is created, such as the beginning and ending of start tags, end tags,
     372            element names, attribute names, attribute values and content. Intra-element
     373            well-formedness validation is performed as an artifact of this process. Like Xerces,
     374            icXML must provide the Line and Column position of each error. The <emphasis role="ital"
     375               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
     376            document position(s) through the use of an optimized population count algorithm,
     377            described in Section \ref{section:arch:errorhandling}. From here, two data-independent
     378            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
     379         <para> A typical XML file contains few unique element and attribute names&#8212;but
     380            each of them will occur frequently. icXML stores these as distinct data structures,
     381            called symbols, each with their own global identifier (GID). Using the symbol marker
     382            streams produced by the Parallel Markup Parser, the <emphasis role="ital">Symbol
     383               Resolver</emphasis> scans through the raw data to produce a sequence of GIDs, called
     384            the <emphasis role="ital">symbol stream</emphasis>. </para>
     385         <para> The final components of the Parabix Subsystem are the <emphasis role="ital">Content
     386               Preparation Unit</emphasis> and <emphasis role="ital">Content Stream
     387            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
     388            filters them, according to the information provided by the Parallel Markup Parser, and
     389            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital"
     390               >content stream</emphasis>, discussed in Section \ref{section:arch:contentstream}. </para>
     391         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
     392            document. The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to
     393            validate and produce the sequential output for the end user. The <emphasis role="ital"
     394               >Final WF checker</emphasis> performs inter-element well-formedness validation that
     395            would be too costly to perform in bit space, such as ensuring every start tag has a
     396            matching end tag. Xerces's namespace binding functionality is replaced by the <emphasis
     397               role="ital">Namespace Processor</emphasis>. Unlike Xerces, it is a discrete phase
     398            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
     399               stream</emphasis>, which are associated with each symbol occurrence. This is
     400            discussed in Section \ref{section:arch:namespacehandling}. Finally, the <emphasis
     401               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
     402            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
     403         <para>
     404            <!-- FIGURE
    441405\begin{figure}[h]
    442406\begin{center}
     
    447411\end{figure}
    448412-->
    449 </para>
    450             </section>
    451                         <section>
    452                        <title>Character Set Adapters</title>
    453 <para>
    454 In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
    455 provide the end-consumer with a single encoding format.
    456 In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
    457 because of the need to decode and classify each byte of input, mapping variable-length UTF-8
    458 byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
    459 In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
    460 transcoding imposes at least a cost of buffer copying.
    461 </para>
    462 <para>
    463 In icXML, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
    464 Given a specified input encoding, a CSA is responsible for checking that
    465 input code units represent valid characters, mapping the characters of the encoding into
    466 the appropriate bitstreams for XML parsing actions (i.e., producing the lexical item
    467 streams), as well as supporting ultimate transcoding requirements.   All of this work
    468 is performed using the parallel bitstream representation of the source input.
    469 </para>
    470 <para>
    471 An important observation is that many character sets are an
    472 extension to the legacy 7-bit ASCII character set.  This includes the
    473 various ISO Latin character sets, UTF-8, UTF-16 and many others.
    474 Furthermore, all significant characters for parsing XML are confined to the
    475 ASCII repertoire.   Thus, a single common set of lexical item calculations
    476 serves to compute lexical item streams for all such ASCII-based character sets.
    477 </para>
    478 <para>
    479 A second observation is that&#8212;regardless of which character set is used&#8212;quite
    480 often all of the characters in a particular block of input will be within the ASCII range.
    481 This is a very simple test to perform using the bitstream representation, simply confirming that the
    482 bit 0 stream is zero for the entire block.   For blocks satisfying this test,
    483 all logic dealing with non-ASCII characters can simply be skipped.
    484 Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
    485 UTF-16 form are each set to zero in this case.
    486 </para>
    487 <para>
    488 A third observation is that repeated transcoding of the names of XML
    489 elements, attributes and so on can be avoided by using a look-up mechanism.
    490 That is, the first occurrence of each symbol is stored in a look-up
    491 table mapping the input encoding to a numeric symbol ID.   Transcoding
    492 of the symbol is applied at this time.  Subsequent look-up operations
    493 can avoid transcoding by simply retrieving the stored representation.
    494 As symbol look up is required to apply various XML validation rules,
    495 there is achieves the effect of transcoding each occurrence without
    496 additional cost.
    497 </para>
    498 <para>
    499 The cost of individual character transcoding is avoided whenever a block of input is
    500 confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
    501 Furthermore, when transcoding is required, the parallel bitstream representation
    502 supports efficient transcoding operations.   
    503 In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bitstreams
    504 can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
    505 and all but the final bytes of multi-byte sequences can be marked for deletion as
    506 discussed in the following subsection.
    507 In other cases, transcoding within a block only need be applied for non-ASCII
    508 bytes, which are conveniently identified by iterating through the bit 0 stream
    509 using bit scan operations.
    510 </para>
    511             </section>
    512                         <section>
    513                        <title>Combined Parallel Filtering</title>
    514 <para>
    515 As just mentioned, UTF-8 to UTF-16 transcoding involves marking
    516 all but the last bytes of multi-byte UTF-8 sequences as
    517 positions for deletion.   For example, the two
    518 Chinese characters <code>&#x4F60;&#x597D;</code>
    519 are represented as two three-byte UTF-8 sequences <code>E4 BD A0</code>
    520 and <code>E5 A5 BD</code> while the UTF-16 representation must be
    521 compressed down to the two code units <code>4F60</code> and <code>597D</code>.
    522 In the bit parallel representation, this corresponds to a reduction
    523 from six bit positions representing UTF-8 code units (bytes)
    524 down to just two bit positions representing UTF-16 code units
    525 (double bytes).   This compression may be achieved by
    526 arranging to calculate the correct UTF-16 bits at the
    527 final position of each sequence and creating a deletion
    528 mask to mark the first two bytes of each 3-byte sequence
    529 for deletion.   In this case, the portion of the mask
    530 corresponding to these input bytes is the bit sequence
    531 <code>110110</code>.  Using this approach, transcoding may then be
    532 completed by applying parallel deletion and inverse transposition of the
    533 UTF-16 bitstreams\cite{Cameron2008}.
    534 </para>
    535 <para>
    536 <!-- FIGURE
     413         </para>
     414      </section>
     415      <section>
     416         <title>Character Set Adapters</title>
     417         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
     418            Xerces itself and provide the end-consumer with a single encoding format. In the
     419            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
     420            because of the need to decode and classify each byte of input, mapping variable-length
     421            UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In
     422            other cases, transcoding may involve table look-up operations for each byte of input. In
     423            any case, transcoding imposes at least a cost of buffer copying. </para>
     424         <para> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
     425            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
     426            that input code units represent valid characters, mapping the characters of the encoding
     427            into the appropriate bitstreams for XML parsing actions (i.e., producing the lexical
     428            item streams), as well as supporting ultimate transcoding requirements. All of this work
     429            is performed using the parallel bitstream representation of the source input. </para>
     430         <para> An important observation is that many character sets are an extension to the legacy
     431            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
     432            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
     433            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
     434            serves to compute lexical item streams for all such ASCII-based character sets. </para>
     435         <para> A second observation is that&#8212;regardless of which character set is
     436            used&#8212;quite often all of the characters in a particular block of input will be
     437            within the ASCII range. This is a very simple test to perform using the bitstream
     438            representation, simply confirming that the bit 0 stream is zero for the entire block.
     439            For blocks satisfying this test, all logic dealing with non-ASCII characters can simply
     440            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
     441            UTF-16 form are each set to zero in this case. </para>
     442         <para> A third observation is that repeated transcoding of the names of XML elements,
     443            attributes and so on can be avoided by using a look-up mechanism. That is, the first
     444            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
     445            numeric symbol ID. Transcoding of the symbol is applied at this time. Subsequent look-up
     446            operations can avoid transcoding by simply retrieving the stored representation. As
     447            symbol look up is required to apply various XML validation rules, there is achieves the
     448            effect of transcoding each occurrence without additional cost. </para>
     449         <para> The cost of individual character transcoding is avoided whenever a block of input is
     450            confined to the ASCII subset and for all but the first occurrence of any XML element or
     451            attribute name. Furthermore, when transcoding is required, the parallel bitstream
     452            representation supports efficient transcoding operations. In the important case of UTF-8
     453            to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit
     454            parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes
     455            of multi-byte sequences can be marked for deletion as discussed in the following
     456            subsection. In other cases, transcoding within a block only need be applied for
     457            non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream
     458            using bit scan operations. </para>
     459      </section>
     460      <section>
     461         <title>Combined Parallel Filtering</title>
     462         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
     463            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
     464            Chinese characters <code>&#x4F60;&#x597D;</code> are represented as two
     465            three-byte UTF-8 sequences <code>E4 BD A0</code> and <code>E5 A5 BD</code> while the
     466            UTF-16 representation must be compressed down to the two code units <code>4F60</code>
     467            and <code>597D</code>. In the bit parallel representation, this corresponds to a
     468            reduction from six bit positions representing UTF-8 code units (bytes) down to just two
     469            bit positions representing UTF-16 code units (double bytes). This compression may be
     470            achieved by arranging to calculate the correct UTF-16 bits at the final position of each
     471            sequence and creating a deletion mask to mark the first two bytes of each 3-byte
     472            sequence for deletion. In this case, the portion of the mask corresponding to these
     473            input bytes is the bit sequence <code>110110</code>. Using this approach, transcoding
     474            may then be completed by applying parallel deletion and inverse transposition of the
     475            UTF-16 bitstreams\cite{Cameron2008}. </para>
     476         <para>
     477            <!-- FIGURE
    537478\begin{figure*}[tbh]
    538479\begin{center}
     
    540481Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
    541482-->
    542 <!-- Tag Openers & \verb`1____________1____________________________1____________1__________`\\-->
    543 <!-- Start Tag Marks & \verb`_1____________1___________________________________________________`\\-->
    544 <!-- End Tag Marks & \verb`___________________________________________1____________1_________`\\-->
    545 <!-- Empty Tag Marks & \verb`__________________________________________________________________`\\-->
    546 <!-- Element Names & \verb`_11111111_____1111111_____________________________________________`\\-->
    547 <!-- Attribute Names & \verb`______________________11_______11_________________________________`\\-->
    548 <!-- Attribute Values & \verb`__________________________111________111__________________________`\\-->
    549 <!-- FIGURE
     483            <!-- Tag Openers & \verb`1____________1____________________________1____________1__________`\\-->
     484            <!-- Start Tag Marks & \verb`_1____________1___________________________________________________`\\-->
     485            <!-- End Tag Marks & \verb`___________________________________________1____________1_________`\\-->
     486            <!-- Empty Tag Marks & \verb`__________________________________________________________________`\\-->
     487            <!-- Element Names & \verb`_11111111_____1111111_____________________________________________`\\-->
     488            <!-- Attribute Names & \verb`______________________11_______11_________________________________`\\-->
     489            <!-- Attribute Values & \verb`__________________________111________111__________________________`\\-->
     490            <!-- FIGURE
    550491String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
    551492Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
     
    558499\end{figure*}
    559500-->
    560 </para>
    561 <para>
    562 Rather than immediately paying the
    563 costs of deletion and transposition just for transcoding,
    564 however, icXML defers these steps so that the deletion
    565 masks for several stages of processing may be combined.
    566 In particular, this includes core XML requirements
    567 to normalize line breaks and to replace character
    568 reference and entity references by their corresponding
    569 text.   In the case of line break normalization,
    570 all forms of line breaks, including bare carriage
    571 returns (CR), line feeds (LF) and CR-LF combinations
    572 must be normalized to a single LF character in
    573 each case.   In icXML, this is achieved by
    574 first marking CR positions, performing two
    575 bit parallel operations to transform the marked
    576 CRs into LFs, and then marking for deletion any
    577 LF that is found immediately after the marked CR
    578 as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
    579 <!-- FIGURE
     501         </para>
     502         <para> Rather than immediately paying the costs of deletion and transposition just for
     503            transcoding, however, icXML defers these steps so that the deletion masks for several
     504            stages of processing may be combined. In particular, this includes core XML requirements
     505            to normalize line breaks and to replace character reference and entity references by
     506            their corresponding text. In the case of line break normalization, all forms of line
     507            breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations
     508            must be normalized to a single LF character in each case. In icXML, this is achieved by
     509            first marking CR positions, performing two bit parallel operations to transform the
     510            marked CRs into LFs, and then marking for deletion any LF that is found immediately
     511            after the marked CR as shown by the Pablo source code in Figure
     512            \ref{fig:LBnormalization}.
     513            <!-- FIGURE
    580514\begin{figure}
    581515\begin{verbatim}
     
    595529\end{figure}
    596530-->
    597 </para>
    598 <para>
    599 In essence, the deletion masks for transcoding and
    600 for line break normalization each represent a bitwise
    601 filter; these filters can be combined using bitwise-or
    602 so that the parallel deletion algorithm need only be
    603 applied once.
    604 </para>
    605 <para>
    606 A further application of combined filtering
    607 is the processing of XML character and entity
    608 references.   Consider, for example, the references <code>&amp;</code> or <code>&#x3C;</code>.
    609 which must be replaced in XML processing with 
    610 the single <code>&amp;</code> and <code>&lt;</code> characters, respectively.
    611 The approach in icXML is to mark all but the first character
    612 positions of each reference for deletion, leaving a
    613 single character position unmodified.  Thus, for the
    614 references <code>&amp;</code> or <code>&#x3C;</code> the
    615 masks <code>01111</code> and <code>011111</code> are formed and
    616 combined into the overall deletion mask.   After the
    617 deletion and inverse transposition operations are finally
    618 applied, a post-processing step inserts the proper character
    619 at these positions.   One note about this process is
    620 that it is speculative; references are assumed to generally
    621 be replaced by a single UTF-16 code unit.   In the case,
    622 that this is not true, it is addressed in post-processing.
    623 </para>
    624 <para>
    625 The final step of combined filtering occurs during
    626 the process of reducing markup data to tag bytes
    627 preceding each significant XML transition as described
    628 in section~\ref{section:arch:contentstream}.  Overall, icXML
    629 avoids separate buffer copying operations for each of the
    630 these filtering steps, paying the cost of parallel
    631 deletion and inverse transposition only once. 
    632 Currently, icXML employs the parallel-prefix compress algorithm
    633 of Steele~\cite{HackersDelight}  Performance
    634 is independent of the number of positions deleted.
    635 Future versions of icXML are expected to
    636 take advantage of the parallel extract operation~\cite{HilewitzLee2006}
    637 that Intel is now providing in its Haswell architecture.
    638 </para>
    639             </section>
    640                         <section>
    641                        <title>Content Stream</title>
    642 <para>
    643 A relatively-unique concept for icXML is the use of a filtered content stream.
    644 Rather that parsing an XML document in its original format, the input is transformed
    645 into one that is easier for the parser to iterate through and produce the sequential
    646 output.
    647 In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
    648 <!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
    649 is transformed into
    650 <!-- CODE -->
    651 <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
    652 through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
    653 </para>
    654 <para>
    655 Combined with the symbol stream, the parser traverses the content stream to effectively
    656 reconstructs the input document in its output form.
    657 The initial <emphasis role="ital">0</emphasis> indicates an empty content string. The following <code>&gt;</code>
    658 indicates that a start tag without any attributes is the first element in this text and
    659 the first unused symbol, <code>document</code>, is the element name.
    660 Succeeding that is the content string <code>fee</code>, which is null-terminated in accordance
    661 with the Xerces API specification. Unlike Xerces, no memory-copy operations
    662 are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
    663 accounts for 6.83% of Xerces's execution time.
    664 Additionally, it is cheap to locate the terminal character of each string:
    665 using the String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
    666 null character in the content stream in parallel, which in turn means the parser can
    667 directly jump to the end of every string without scanning for it.
    668 </para>
    669 <para>
    670 Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the existence of an attribute.
    671 Because all of the intra-element was performed in the Parabix Subsystem, this must be a legal attribute.
    672 Since attributes can only occur within start tags and must be accompanied by a textual value,
    673 the next symbol in the symbol stream must be the element name of a start tag,
    674 and the following one must be the name of the attribute and the string that follows the <code>=</code> must be its value.
    675 However, the subsequent <code>=</code> is not treated as an independent attribute because the parser has yet to
    676 read a <code>&gt;</code>, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
    677 it (along with the string value) is added to the element.
    678 Eventually the parser reaches a <code>/</code>, which marks the existence of an end tag. Every end tag requires an
    679 element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
    680 ensure that the appropriate scope-nesting rules have been applied.
    681 </para>
    682             </section>
    683                         <section>
    684                        <title>Namespace Handling</title>
    685 <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
    686 <para>
    687 In XML, namespaces prevents naming conflicts when multiple vocabularies are used together.
    688 It is especially important when a vocabulary application-dependant meaning, such as when
    689 XML or SVG documents are embedded within XHTML files.
    690 Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify
    691 specific names or resources.
    692 On line 1 of Figure \ref{fig:namespace1}, the <code>xmlns</code> attribute instructs the XML
    693 processor to bind the prefix <code>p</code> to the URI &apos;<code>pub.net</code>&apos; and the default (empty)
    694 prefix to <code>book.org</code>. Thus to the XML processor, the <code>title</code> on line 2 and
    695 <code>price</code> on line 4 both read as <code>&quot;book.org&quot;:title</code> and <code>&quot;book.org&quot;:price</code>
    696 respectively, whereas on line 3 and 5, <code>p:name</code> and <code>price</code> are seen as
    697 <code>&quot;pub.net&quot;:name</code> and <code>&quot;pub.net&quot;:price</code>. Even though the actual element name
    698 <code>price</code>, due to namespace scoping rules they are viewed as two uniquely-named items
    699 because the current vocabulary is determined by the namespace(s) that are in-scope.
    700 </para>
    701 <para>
    702 <!-- FIGURE
     531         </para>
     532         <para> In essence, the deletion masks for transcoding and for line break normalization each
     533            represent a bitwise filter; these filters can be combined using bitwise-or so that the
     534            parallel deletion algorithm need only be applied once. </para>
     535         <para> A further application of combined filtering is the processing of XML character and
     536            entity references. Consider, for example, the references <code>&amp;</code> or
     537               <code>&#x3C;</code>. which must be replaced in XML processing with the single
     538               <code>&amp;</code> and <code>&lt;</code> characters, respectively. The
     539            approach in icXML is to mark all but the first character positions of each reference for
     540            deletion, leaving a single character position unmodified. Thus, for the references
     541               <code>&amp;</code> or <code>&#x3C;</code> the masks <code>01111</code> and
     542               <code>011111</code> are formed and combined into the overall deletion mask. After the
     543            deletion and inverse transposition operations are finally applied, a post-processing
     544            step inserts the proper character at these positions. One note about this process is
     545            that it is speculative; references are assumed to generally be replaced by a single
     546            UTF-16 code unit. In the case, that this is not true, it is addressed in
     547            post-processing. </para>
     548         <para> The final step of combined filtering occurs during the process of reducing markup
     549            data to tag bytes preceding each significant XML transition as described in
     550            section~\ref{section:arch:contentstream}. Overall, icXML avoids separate buffer copying
     551            operations for each of the these filtering steps, paying the cost of parallel deletion
     552            and inverse transposition only once. Currently, icXML employs the parallel-prefix
     553            compress algorithm of Steele~\cite{HackersDelight} Performance is independent of the
     554            number of positions deleted. Future versions of icXML are expected to take advantage of
     555            the parallel extract operation~\cite{HilewitzLee2006} that Intel is now providing in its
     556            Haswell architecture. </para>
     557      </section>
     558      <section>
     559         <title>Content Stream</title>
     560         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
     561            Rather that parsing an XML document in its original format, the input is transformed
     562            into one that is easier for the parser to iterate through and produce the sequential
     563            output. In <!-- FIGURE REF Figure~\ref{fig:parabix2} -->, the source data
     564            <!-- \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>| -->
     565            is transformed into <!-- CODE -->
     566            <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
     567            through the parallel filtering algorithm, described in section \ref{sec:parfilter}. </para>
     568         <para> Combined with the symbol stream, the parser traverses the content stream to
     569            effectively reconstructs the input document in its output form. The initial <emphasis
     570               role="ital">0</emphasis> indicates an empty content string. The following
     571               <code>&gt;</code> indicates that a start tag without any attributes is the first
     572            element in this text and the first unused symbol, <code>document</code>, is the element
     573            name. Succeeding that is the content string <code>fee</code>, which is null-terminated
     574            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
     575            operations are required to produce these strings, which as
     576            Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time.
     577            Additionally, it is cheap to locate the terminal character of each string: using the
     578            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
     579            null character in the content stream in parallel, which in turn means the parser can
     580            directly jump to the end of every string without scanning for it. </para>
     581         <para> Following <code>&apos;fee&apos;</code> is a <code>=</code>, which marks the
     582            existence of an attribute. Because all of the intra-element was performed in the Parabix
     583            Subsystem, this must be a legal attribute. Since attributes can only occur within start
     584            tags and must be accompanied by a textual value, the next symbol in the symbol stream
     585            must be the element name of a start tag, and the following one must be the name of the
     586            attribute and the string that follows the <code>=</code> must be its value. However, the
     587            subsequent <code>=</code> is not treated as an independent attribute because the parser
     588            has yet to read a <code>&gt;</code>, which marks the end of a start tag. Thus only
     589            one symbol is taken from the symbol stream and it (along with the string value) is added
     590            to the element. Eventually the parser reaches a <code>/</code>, which marks the
     591            existence of an end tag. Every end tag requires an element name, which means they
     592            require a symbol. Inter-element validation whenever an empty tag is detected to ensure
     593            that the appropriate scope-nesting rules have been applied. </para>
     594      </section>
     595      <section>
     596         <title>Namespace Handling</title>
     597         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
     598         <para> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
     599            together. It is especially important when a vocabulary application-dependant meaning,
     600            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
     601            to uniform resource identifiers (URIs), which are strings used to identify specific
     602            names or resources. On line 1 of Figure \ref{fig:namespace1}, the <code>xmlns</code>
     603            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
     604               &apos;<code>pub.net</code>&apos; and the default (empty) prefix to
     605               <code>book.org</code>. Thus to the XML processor, the <code>title</code> on line 2
     606            and <code>price</code> on line 4 both read as
     607            <code>&quot;book.org&quot;:title</code> and
     608               <code>&quot;book.org&quot;:price</code> respectively, whereas on line 3 and
     609            5, <code>p:name</code> and <code>price</code> are seen as
     610               <code>&quot;pub.net&quot;:name</code> and
     611               <code>&quot;pub.net&quot;:price</code>. Even though the actual element name
     612               <code>price</code>, due to namespace scoping rules they are viewed as two
     613            uniquely-named items because the current vocabulary is determined by the namespace(s)
     614            that are in-scope. </para>
     615         <para>
     616            <!-- FIGURE
    703617\begin{figure}[h]
    704618\begin{tabular}{l|l}
     
    714628\end{figure}
    715629-->
    716 </para>
    717 <para>
    718 In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID.
    719 These persist for the lifetime of the application through the use of a global URI pool.
    720 Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
    721 in the document. Because a namespace declaration affects the entire element, it must be processed prior to
    722 grammar validation. This is a costly process considering that a typical namespaced XML document only comes
    723 in one of two forms:
    724 (1) those that declare a set of namespaces upfront and never change them, and
    725 (2) those that repeatedly modify the namespaces in predictable patterns.
    726 </para>
    727 <para>
    728 For that reason, icXML contains an independent namespace stack and utilizes bit vectors to cheaply perform
    729 <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
    730 <!-- performance advantage figure?? average cycles/byte cost? -->
    731 When a prefix is declared (e.g., <code>xmlns:p=&quot;pub.net&quot;</code>), a namespace binding is created that maps
    732 the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
    733 Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
    734 NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
    735 prefix binding set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and <code>11</code> respectively.
    736 To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
    737 maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
    738 NSID can be found using a bit-scan intrinsic.
    739 A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
    740 </para>
    741 <para>
    742 <!-- FIGURE
     630         </para>
     631         <para> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
     632            persist for the lifetime of the application through the use of a global URI pool. Xerces
     633            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
     634            (end tag) occurs in the document. Because a namespace declaration affects the entire
     635            element, it must be processed prior to grammar validation. This is a costly process
     636            considering that a typical namespaced XML document only comes in one of two forms: (1)
     637            those that declare a set of namespaces upfront and never change them, and (2) those that
     638            repeatedly modify the namespaces in predictable patterns. </para>
     639         <para> For that reason, icXML contains an independent namespace stack and utilizes bit
     640            vectors to cheaply perform <!-- speculation and scope resolution options with a single XOR operation &#8212; even if many alterations are performed. -->
     641            <!-- performance advantage figure?? average cycles/byte cost? --> When a prefix is
     642            declared (e.g., <code>xmlns:p=&quot;pub.net&quot;</code>), a namespace binding
     643            is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution
     644            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
     645            every prefix contains a bit vector marking every NSID that has ever been associated with
     646            it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding
     647            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
     648            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
     649            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
     650            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
     651            found using a bit-scan intrinsic. A namespace binding table, similar to Table
     652            \ref{tbl:namespace1}, provides the actual URI ID. </para>
     653         <para>
     654            <!-- FIGURE
    743655\begin{table}[h]
    744656\begin{center}
     
    754666\end{table}
    755667-->
    756 </para>
    757 <para>
    758 <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
    759 <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
    760 <!-- NSid = bitscan(VisiblePrefixBinding); -->
    761 <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
    762 </para>
    763 <para>
    764 To ensure that scoping rules are adhered to,
    765 whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored
    766 within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
    767 currently visible namespaces is XORed with the vector at the top of the stack.
    768 This allows any number of changes to be performed at each scope-level with a constant time.
    769 <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
    770 </para>
    771             </section>
    772                         <section>
    773                        <title>Error Handling</title>
    774 <para>
    775 <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
    776 Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
    777 As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
    778 if not, the current state determines the type and severity of the error.
    779 icXML emits errors in the similar manner&#8212;but how it discovers them is substantially different.
    780 Recall that in Figure \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and Markup Processor,
    781 each with its own system for detecting and producing error messages.
    782 </para>
    783 <para>
    784 Within the Parabix Subsystem, all computations are performed in parallel, a block at a time.
    785 Errors are derived as artifacts of bitstream calculations, with a 1-bit marking the byte-position of an error within a block,
    786 and the type of error is determined by the equation that discovered it.
    787 The difficulty of error processing in this section is that in Xerces the line and column number must be given
    788 with every error production. Two major issues exist because of this:
    789 (1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
    790 followed by a line feed, are counted as a single new line character.
    791 (2) column position is counted in characters, not bytes or code units;
    792 thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
    793 Note that typical XML documents are error-free but the calculation of the
    794 line/column position is a constant overhead in Xerces. <!-- that must be maintained in the case that one occurs. -->
    795 To reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence of the error and
    796 performs the minimal amount of book-keeping necessary to facilitate it.
    797 icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
    798 within the Line Column Tracker (LCT).
    799 One of the CSA's major responsibilities is transcoding an input text. <!-- from some encoding format to near-output-ready UTF-16. -->
    800 During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
    801 and validated.
    802 A <emphasis role="ital">line-feed bitstream</emphasis>, which marks the positions of the normalized new lines characters, is a natural derivative of
    803 this process.
    804 Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
    805 <!-- The optimization delays the counting process .... -->
    806 Column position is more difficult to calculate.
    807 It is possible to scan backwards through the bitstream of new line characters to determine the distance (in code-units)
    808 between the position between which an error was detected and the last line feed. However, this distance may exceed
    809 than the actual character position for the reasons discussed in (2).
    810 To handle this, the CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together many relevant bitstreams,
    811 such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
    812 normalization process.
    813 When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
    814 column number.
    815 </para>
    816 <para>
    817 The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces.
    818 However, reporting the correct line/column is a much more difficult problem.
    819 The Markup Processor parses the content stream, which is a series of tagged UTF-16 strings.
    820 Each string is normalized in accordance with the XML specification.
    821 All symbol data and unnecessary whitespace is eliminated from the stream;
    822 thus its impossible to derive the current location using only the content stream.
    823 To calculate the location, the Markup Processor borrows three additional pieces of information from the Parabix Subsystem:
    824 the line-feed, skip mask, and a <emphasis role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the (code-unit) position of every
    825 datum that was suppressed from the source during the production of the content stream.
    826 Armed with these, it is possible to calculate the actual line/column using
    827 the same system as the Parabix Subsystem until the sum of the negated deletion mask stream is equal to the current position.
    828 </para>
    829             </section>
    830                 </section>
     668         </para>
     669         <para>
     670            <!-- PrefixBindings = PrefixBindingTable[prefixID]; -->
     671            <!-- VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; -->
     672            <!-- NSid = bitscan(VisiblePrefixBinding); -->
     673            <!-- URIid = NameSpaceBindingTable[NSid].URIid; -->
     674         </para>
     675         <para> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
     676            any modification to the currently visible namespaces is calculated and stored within a
     677            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
     678            is found, the currently visible namespaces is XORed with the vector at the top of the
     679            stack. This allows any number of changes to be performed at each scope-level with a
     680            constant time.
     681            <!-- Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.-->
     682         </para>
     683      </section>
     684      <section>
     685         <title>Error Handling</title>
     686         <para>
     687            <!-- XML errors are rare but they do happen, especially with untrustworthy data sources.-->
     688            Xerces outputs error messages in two ways: through the programmer API and as thrown
     689            objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to
     690            assess whether the next character is legal; if not, the current state determines the
     691            type and severity of the error. icXML emits errors in the similar manner&#8212;but
     692            how it discovers them is substantially different. Recall that in Figure
     693            \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and
     694            Markup Processor, each with its own system for detecting and producing error messages. </para>
     695         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
     696            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
     697            the byte-position of an error within a block, and the type of error is determined by the
     698            equation that discovered it. The difficulty of error processing in this section is that
     699            in Xerces the line and column number must be given with every error production. Two
     700            major issues exist because of this: (1) line position adheres to XML white-normalization
     701            rules; as such, some sequences of characters, e.g., a carriage return followed by a line
     702            feed, are counted as a single new line character. (2) column position is counted in
     703            characters, not bytes or code units; thus multi-code-unit code-points and surrogate
     704            character pairs are all counted as a single column position. Note that typical XML
     705            documents are error-free but the calculation of the line/column position is a constant
     706            overhead in Xerces. <!-- that must be maintained in the case that one occurs. --> To
     707            reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence
     708            of the error and performs the minimal amount of book-keeping necessary to facilitate it.
     709            icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates
     710            the information within the Line Column Tracker (LCT). One of the CSA's major
     711            responsibilities is transcoding an input text.
     712            <!-- from some encoding format to near-output-ready UTF-16. --> During this process,
     713            white-space normalization rules are applied and multi-code-unit and surrogate characters
     714            are detected and validated. A <emphasis role="ital">line-feed bitstream</emphasis>,
     715            which marks the positions of the normalized new lines characters, is a natural
     716            derivative of this process. Using an optimized population count algorithm, the line
     717            count can be summarized cheaply for each valid block of text.
     718            <!-- The optimization delays the counting process .... --> Column position is more
     719            difficult to calculate. It is possible to scan backwards through the bitstream of new
     720            line characters to determine the distance (in code-units) between the position between
     721            which an error was detected and the last line feed. However, this distance may exceed
     722            than the actual character position for the reasons discussed in (2). To handle this, the
     723            CSA generates a <emphasis role="ital">skip mask</emphasis> bitstream by ORing together
     724            many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters,
     725            and any characters that were removed during the normalization process. When an error is
     726            detected, the sum of those skipped positions is subtracted from the distance to
     727            determine the actual column number. </para>
     728         <para> The Markup Processor is a state-driven machine. As such, error detection within it
     729            is very similar to Xerces. However, reporting the correct line/column is a much more
     730            difficult problem. The Markup Processor parses the content stream, which is a series of
     731            tagged UTF-16 strings. Each string is normalized in accordance with the XML
     732            specification. All symbol data and unnecessary whitespace is eliminated from the stream;
     733            thus its impossible to derive the current location using only the content stream. To
     734            calculate the location, the Markup Processor borrows three additional pieces of
     735            information from the Parabix Subsystem: the line-feed, skip mask, and a <emphasis
     736               role="ital">deletion mask stream</emphasis>, which is a bitstream denoting the
     737            (code-unit) position of every datum that was suppressed from the source during the
     738            production of the content stream. Armed with these, it is possible to calculate the
     739            actual line/column using the same system as the Parabix Subsystem until the sum of the
     740            negated deletion mask stream is equal to the current position. </para>
     741      </section>
     742   </section>
    831743
    832                 <section>
    833                         <title>Multithreading with Pipeline Parallelism</title>         
    834 <para>
    835 As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
    836 These are &quot;embarrassingly sequential.&quot;\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
    837 However, icXML is designed to organize processing into logical layers.   
    838 In particular, layers within the Parabix Subsystem are designed to operate
    839 over significant segments of input data before passing their outputs on for
    840 subsequent processing.  This fits well into the general model of pipeline
    841 parallelism, in which each thread is in charge of a single module or group
    842 of modules.
    843 </para>
    844 <para>
    845 The most straightforward division of work in icXML is to separate
    846 the Parabix Subsystem and the Markup Processor into distinct logical layers into two separate stages.
    847 The resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained software-pipeline application.
    848 In this case, the Parabix Subsystem thread <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a time and produces the
    849 content, symbol and URI streams, then stores them in a pre-allocated shared data structure <code>S</code>.
    850 The Markup Processor thread <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness and grammar-based validation,
    851 and the provides parsed XML data to the application through the Xerces API. 
    852 The shared data structure is implemented using a ring buffer,
    853 where every entry contains an independent set of data streams.
    854 In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
    855 A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
    856 In Figure \ref{threads_timeline1} the processing time of <code>T<subscript>1</subscript></code> is longer than <code>T<subscript>2</subscript></code>;
    857 thus <code>T<subscript>2</subscript></code> always waits for <code>T<subscript>1</subscript></code> to write to the shared memory.
    858 Figure \ref{threads_timeline2} illustrates the scenario in which <code>T<subscript>1</subscript></code> is faster
    859 and must wait for <code>T<subscript>2</subscript></code> to finish reading the shared data before it can reuse the memory space.
    860 </para>
    861 <para>
    862 <!-- FIGURE
     744   <section>
     745      <title>Multithreading with Pipeline Parallelism</title>
     746      <para> As discussed in section \ref{background:xerces}, Xerces can be considered a FSM
     747         application. These are &quot;embarrassingly
     748         sequential.&quot;\cite{Asanovic:EECS-2006-183} and notoriously difficult to
     749         parallelize. However, icXML is designed to organize processing into logical layers. In
     750         particular, layers within the Parabix Subsystem are designed to operate over significant
     751         segments of input data before passing their outputs on for subsequent processing. This fits
     752         well into the general model of pipeline parallelism, in which each thread is in charge of a
     753         single module or group of modules. </para>
     754      <para> The most straightforward division of work in icXML is to separate the Parabix Subsystem
     755         and the Markup Processor into distinct logical layers into two separate stages. The
     756         resultant application, <emphasis role="ital">icXML-p</emphasis>, is a course-grained
     757         software-pipeline application. In this case, the Parabix Subsystem thread
     758               <code>T<subscript>1</subscript></code> reads 16k of XML input <code>I</code> at a
     759         time and produces the content, symbol and URI streams, then stores them in a pre-allocated
     760         shared data structure <code>S</code>. The Markup Processor thread
     761            <code>T<subscript>2</subscript></code> consumes <code>S</code>, performs well-formedness
     762         and grammar-based validation, and the provides parsed XML data to the application through
     763         the Xerces API. The shared data structure is implemented using a ring buffer, where every
     764         entry contains an independent set of data streams. In the examples of Figure
     765         \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A
     766         lock-free mechanism is applied to ensure that each entry can only be read or written by one
     767         thread at the same time. In Figure \ref{threads_timeline1} the processing time of
     768               <code>T<subscript>1</subscript></code> is longer than
     769         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
     770         waits for <code>T<subscript>1</subscript></code> to write to the shared memory. Figure
     771         \ref{threads_timeline2} illustrates the scenario in which
     772         <code>T<subscript>1</subscript></code> is faster and must wait for
     773            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
     774         reuse the memory space. </para>
     775      <para>
     776         <!-- FIGURE
    863777\begin{figure}
    864778\subfigure[]{
     
    875789\end{figure}
    876790-->
    877 </para>
    878 <para>
    879 Overall, our design is intended to benefit a range of applications.
    880 Conceptually, we consider two design points.
    881 The first, the parsing performed by the Parabix Subsystem dominates at 67% of the overall cost,
    882 with the cost of application processing (including the driver logic within the Markup Processor) at 33%.   
    883 The second is almost the opposite scenario, the cost of application processing dominates at 60%,
    884 while the cost of XML parsing represents an overhead of 40%.
    885 </para>
    886 <para>
    887 Our design is predicated on a goal of using the Parabix
    888 framework to achieve a 50% to 100% improvement in the parsing engine itself.   
    889 In a best case scenario,
    890 a 100% improvement of the Parabix Subsystem for the design point in which
    891 XML parsing dominates at 67% of the total application cost.
    892 In this case, the single-threaded icXML should achieve a 1.5x speedup over Xerces
    893 so that the total application cost reduces to 67% of the original. 
    894 However, in icXML-p, our ideal scenario gives us two well-balanced threads
    895 each performing about 33% of the original work.   
    896 In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.
    897 </para>
    898 <para>
    899 At the other extreme of our design range, we consider an application
    900 in which core parsing cost is 40%.  Assuming the 2x speedup of
    901 the Parabix Subsystem over the corresponding Xerces core, single-threaded
    902 icXML delivers a 25% speedup.  However, the most significant
    903 aspect of our two-stage multi-threaded design then becomes the
    904 ability to hide the entire latency of parsing within the serial time
    905 required by the application.  In this case, we achieve
    906 an overall speedup in processing time by 1.67x.
    907 </para>
    908 <para>
    909 Although the structure of the Parabix Subsystem allows division of the work into
    910 several pipeline stages and has been demonstrated to be effective
    911 for four pipeline stages in a research prototype \cite{HPCA2012},
    912 our analysis here suggests that the further pipelining of work within
    913 the Parabix Subsystem is not worthwhile if the cost of application logic is little as
    914 33% of the end-to-end cost using Xerces.  To achieve benefits of
    915 further parallelization with multi-core technology, there would
    916 need to be reductions in the cost of application logic that
    917 could match reductions in core parsing cost.
    918 </para>
    919                 </section>
     791      </para>
     792      <para> Overall, our design is intended to benefit a range of applications. Conceptually, we
     793         consider two design points. The first, the parsing performed by the Parabix Subsystem
     794         dominates at 67% of the overall cost, with the cost of application processing (including
     795         the driver logic within the Markup Processor) at 33%. The second is almost the opposite
     796         scenario, the cost of application processing dominates at 60%, while the cost of XML
     797         parsing represents an overhead of 40%. </para>
     798      <para> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
     799         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
     800         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
     801         total application cost. In this case, the single-threaded icXML should achieve a 1.5x
     802         speedup over Xerces so that the total application cost reduces to 67% of the original.
     803         However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing
     804         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
     805         to a 3x speedup at best. </para>
     806      <para> At the other extreme of our design range, we consider an application in which core
     807         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
     808         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
     809         significant aspect of our two-stage multi-threaded design then becomes the ability to hide
     810         the entire latency of parsing within the serial time required by the application. In this
     811         case, we achieve an overall speedup in processing time by 1.67x. </para>
     812      <para> Although the structure of the Parabix Subsystem allows division of the work into
     813         several pipeline stages and has been demonstrated to be effective for four pipeline stages
     814         in a research prototype \cite{HPCA2012}, our analysis here suggests that the further
     815         pipelining of work within the Parabix Subsystem is not worthwhile if the cost of
     816         application logic is little as 33% of the end-to-end cost using Xerces. To achieve benefits
     817         of further parallelization with multi-core technology, there would need to be reductions in
     818         the cost of application logic that could match reductions in core parsing cost. </para>
     819   </section>
    920820
    921                 <section>
    922                         <title>Performance</title>             
    923 <para>
    924 We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications:
    925 the Xerces C++ SAXCount sample application,
    926 and a real world GML to SVG transformation application.
    927 We investigated XML parser performance using an Intel Core i7 quad-core
    928 (Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core),
    929 32+32 kB (per core) L1 cache,
    930 256 kB (per core) L2 cache,
    931 8 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
    932 </para>
    933 <para>
    934 We analyzed the execution profiles of each XML parser
    935 using the performance counters found in the processor.
    936 We chose several key hardware events that provide insight into the profile of each
    937 application and indicate if the processor is doing useful work. 
    938 The set of events included in our study are:
    939 processor cycles, branch instructions, branch mispredictions,
    940 and cache misses. The Performance Application Programming Interface
    941 (PAPI) Version 5.5.0 \cite{papi} toolkit
    942 was installed on the test system to facilitate the
    943 collection of hardware performance monitoring
    944 statistics. In addition, we used the Linux perf \cite{perf} utility
    945 to collect per core hardware events.
    946 </para>
    947                         <section>
    948                        <title>Xerces C++ SAXCount</title>
    949 <para>
    950 Xerces comes with sample applications that demonstrate salient features of the parser.
    951 SAXCount is the simplest such application:
    952 it counts the elements, attributes and characters of a given XML file using the (event based) SAX API
    953 and prints out the totals.
    954 </para>
    955 <para>
    956 <!-- TABLE
     821   <section>
     822      <title>Performance</title>
     823      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
     824         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
     825         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
     826         Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core)
     827         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
     828         12.04 (Linux). </para>
     829      <para> We analyzed the execution profiles of each XML parser using the performance counters
     830         found in the processor. We chose several key hardware events that provide insight into the
     831         profile of each application and indicate if the processor is doing useful work. The set of
     832         events included in our study are: processor cycles, branch instructions, branch
     833         mispredictions, and cache misses. The Performance Application Programming Interface (PAPI)
     834         Version 5.5.0 \cite{papi} toolkit was installed on the test system to facilitate the
     835         collection of hardware performance monitoring statistics. In addition, we used the Linux
     836         perf \cite{perf} utility to collect per core hardware events. </para>
     837      <section>
     838         <title>Xerces C++ SAXCount</title>
     839         <para> Xerces comes with sample applications that demonstrate salient features of the
     840            parser. SAXCount is the simplest such application: it counts the elements, attributes
     841            and characters of a given XML file using the (event based) SAX API and prints out the
     842            totals. </para>
     843         <para>
     844            <!-- TABLE
    957845\begin{table}
    958846\begin{center}
     
    973861\end{table}
    974862-->
    975 </para>
    976 <para>
    977 Table \ref{XMLDocChars} shows the document characteristics of the XML input
    978 files selected for the Xerces C++ SAXCount benchmark. The jaw.xml
    979 represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence
    980 required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented
    981 XML documents and consist entirely of single byte encoded ASCII characters.
    982 </para>
    983 <para>
    984 A key predictor of the overall parsing performance of an XML file is markup density\footnote{
    985   Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}.
    986 This metric has substantial influence on the performance of traditional recursive descent XML parsers
    987 because it directly corresponds to the number of state transitions that occur when parsing a document.
    988 We use a mixture of document-oriented and
    989 data-oriented XML files to analyze performance over a spectrum
    990 of markup densities.
    991 </para>
    992 <para>
    993 Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML in terms of
    994 CPU cycles per byte for the SAXCount application.
    995 The speedup for icXML over Xerces is 1.3x to 1.8x.
    996 With two threads on the multicore machine, icXML-p can achieve speedup up to 2.7x.
    997 Xerces is substantially slowed by dense markup
    998 but icXML is less affected through a reduction in branches and the use of parallel-processing techniques.
    999 icXML-p performs better as markup-density increases because the work performed by each stage is
    1000 well balanced in this application.
    1001 </para>
    1002 <para>
    1003 <!-- FIGURE
     863         </para>
     864         <para> Table \ref{XMLDocChars} shows the document characteristics of the XML input files
     865            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
     866            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
     867            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
     868            documents and consist entirely of single byte encoded ASCII characters. </para>
     869         <para> A key predictor of the overall parsing performance of an XML file is markup
     870            density\footnote{ Markup Density: the ratio of markup bytes used to define the structure
     871            of the document vs. its file size.}. This metric has substantial influence on the
     872            performance of traditional recursive descent XML parsers because it directly corresponds
     873            to the number of state transitions that occur when parsing a document. We use a mixture
     874            of document-oriented and data-oriented XML files to analyze performance over a spectrum
     875            of markup densities. </para>
     876         <para> Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML
     877            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
     878            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
     879            speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less
     880            affected through a reduction in branches and the use of parallel-processing techniques.
     881            icXML-p performs better as markup-density increases because the work performed by each
     882            stage is well balanced in this application. </para>
     883         <para>
     884            <!-- FIGURE
    1004885\begin{figure}
    1005886\includegraphics[width=0.5\textwidth]{plots/perf_SAX.pdf}
     
    1008889\end{figure}
    1009890-->
    1010 </para>
    1011             </section>
    1012                         <section>
    1013                        <title>GML2SVG</title>
    1014                        <para></para>
    1015             </section>
    1016                 </section>
     891         </para>
     892      </section>
     893      <section>
     894         <title>GML2SVG</title>
     895         <para/>
     896      </section>
     897   </section>
    1017898
    1018                 <section>
    1019                         <title>Conclusion and Future Work</title>               
    1020 <para>
    1021 This paper is the first case study documenting the significant
    1022 performance benefits that may be realized through the integration
    1023 of parallel bitstream technology into existing widely-used software libraries.
    1024 In the case of the Xerces-C++ XML parser, the
    1025 combined integration of SIMD and multicore parallelism was
    1026 shown capable of dramatic producing dramatic increases in
    1027 throughput and reductions in branch mispredictions and cache misses.
    1028 The modified parser, going under the name icXML is designed
    1029 to provide the full functionality of the original Xerces library
    1030 with complete compatibility of APIs.  Although substantial
    1031 re-engineering was required to realize the
    1032 performance potential of parallel technologies, this
    1033 is an important case study demonstrating the general
    1034 feasibility of these techniques.
    1035 </para>
    1036 <para>
    1037 The further development of icXML to move beyond 2-stage
    1038 pipeline parallelism is ongoing, with realistic prospects for
    1039 four reasonably balanced stages within the library.  For
    1040 applications such as GML2SVG which are dominated by time
    1041 spent on XML parsing, such a multistage pipelined parsing
    1042 library should offer substantial benefits. 
    1043 </para>
    1044 <para>
    1045 The example of XML parsing may be considered prototypical
    1046 of finite-state machines applications which have sometimes
    1047 been considered &quot;embarassingly sequential&quot; and so
    1048 difficult to parallelize that &quot;nothing works.&quot;  So the
    1049 case study presented here should be considered an important
    1050 data point in making the case that parallelization can
    1051 indeed be helpful across a broad array of application types.
    1052 </para>
    1053 <para>
    1054 To overcome the software engineering challenges in applying
    1055 parallel bitstream technology to existing software systems,
    1056 it is clear that better library and tool support is needed.
    1057 The techniques used in the implementation of icXML and
    1058 documented in this paper could well be generalized for
    1059 applications in other contexts and automated through
    1060 the creation of compiler technology specifically supporting
    1061 parallel bitstream programming.
    1062 </para>
    1063                 </section>
     899   <section>
     900      <title>Conclusion and Future Work</title>
     901      <para> This paper is the first case study documenting the significant performance benefits
     902         that may be realized through the integration of parallel bitstream technology into existing
     903         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
     904         integration of SIMD and multicore parallelism was shown capable of dramatic producing
     905         dramatic increases in throughput and reductions in branch mispredictions and cache misses.
     906         The modified parser, going under the name icXML is designed to provide the full
     907         functionality of the original Xerces library with complete compatibility of APIs. Although
     908         substantial re-engineering was required to realize the performance potential of parallel
     909         technologies, this is an important case study demonstrating the general feasibility of
     910         these techniques. </para>
     911      <para> The further development of icXML to move beyond 2-stage pipeline parallelism is
     912         ongoing, with realistic prospects for four reasonably balanced stages within the library.
     913         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
     914         multistage pipelined parsing library should offer substantial benefits. </para>
     915      <para> The example of XML parsing may be considered prototypical of finite-state machines
     916         applications which have sometimes been considered &quot;embarassingly
     917         sequential&quot; and so difficult to parallelize that &quot;nothing
     918         works.&quot; So the case study presented here should be considered an important data
     919         point in making the case that parallelization can indeed be helpful across a broad array of
     920         application types. </para>
     921      <para> To overcome the software engineering challenges in applying parallel bitstream
     922         technology to existing software systems, it is clear that better library and tool support
     923         is needed. The techniques used in the implementation of icXML and documented in this paper
     924         could well be generalized for applications in other contexts and automated through the
     925         creation of compiler technology specifically supporting parallel bitstream programming.
     926      </para>
     927   </section>
    1064928
    1065 <!--     
     929   <!-- 
    1066930   <section>
    1067931      <title>Acknowledgments</title>
Note: See TracChangeset for help on using the changeset viewer.