Ignore:
Timestamp:
Apr 19, 2013, 4:07:55 PM (6 years ago)
Author:
cameron
Message:

LB normalization figure

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

    r3052 r3053  
    127127      <!--
    128128      <legalnotice>
    129          <para>Copyright &#x000A9; 2009 Robert D. Cameron, Kenneth S. Herdy and Ehsan Amiri.
     129         <para>Copyright &#x000A9; 2013 Nigel Medforth, Dan Lin, Kenneth S. Herdy, Robert D. Cameron  and Arrvindh Shriraman.
    130130            This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative
    131131            Works 2.5 Canada License.</para>
     
    149149      <section>
    150150         <title>Xerces C++ Structure</title>
    151          <para> The Xerces C++ parser <!-- is a widely-used standards-conformant -->
    152             <!-- XML parser produced as open-source software -->
    153             <!-- by the Apache Software Foundation. -->
    154             <!-- It --> features comprehensive support for a variety of character encodings both
     151         <para> The Xerces C++ parser is a widely-used standards-conformant
     152            XML parser produced as open-source software
     153             by the Apache Software Foundation.
     154            It features comprehensive support for a variety of character encodings both
    155155            commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple
    156156            XML vocabularies through the XML namespace mechanism, as well as complete
     
    161161            tree-based parsing interface. </para>
    162162         <para>
    163             <!--What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?-->
    164             <!--Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars-->
    165             <!--Should we show a val-grind summary of a few files in a linechart form?--> Xerces,
     163            Xerces,
    166164            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
    167165            the first to the last byte of input data. Each byte passes through several processing
     
    208206            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
    209207            data from long blocks of input data by using one register bit per single input byte. To
    210             facilitate this, the input data is first transposed into a set of basis bit streams. In <!--FIGURE REF Figure~\ref{fig:BitStreamsExample}, the ASCII string ``{\ttfamily b7\verb|<|A}''
    211 is represented as 8 basis bit streams, $\tt b<subscript>{0 \ldots 7}$.
     208            facilitate this, the input data is first transposed into a set of basis bit streams.
     209              For example Table II shows  the ASCII bytes for the string "<code>b7&lt;A</code>" with
     210                the corresponding  8 basis bit streams, b<subscript>0</subscript> through  b<subscript>7</subscript> shown in Table III.
    212211-->
    213212            <!-- The bits used to construct $\tt <subscript>7</subscript>$ have been highlighted in this example. -->
     
    279278         <!-- process, intra-element well-formedness validation is performed on each block -->
    280279         <!-- of text. -->
    281          <para> Consider, for example, the XML source data stream shown in the first line of Table II.
     280         <para> Consider, for example, the XML source data stream shown in the first line of Table IV.
    282281The remaining lines of this figure show
    283282            several parallel bit streams that are computed in Parabix-style parsing, with each bit
     
    287286            brackets that represent tag openers in XML. The second and third streams show a
    288287            partition of the tag openers into start tag marks and end tag marks depending on the
    289             character immediately following the opener (i.e., <code>&quot;/&quot;</code>) or
     288            character immediately following the opener (i.e., &quot;<code>/</code>&quot;) or
    290289            not. The remaining three lines show streams that can be computed in subsequent parsing
    291290            (using the technique of bitstream addition \cite{cameron-EuroPar2011}), namely streams
    292291            marking the element names, attribute names and attribute values of tags. </para>
     292            <table>
     293                  <caption>
     294                     <para>XML Source Data and Derived Parallel Bit Streams</para>
     295                  </caption>
     296                  <colgroup>
     297                     <col align="centre" valign="top"/>
     298                     <col align="left" valign="top"/>
     299                  </colgroup>
     300                  <tbody>
     301          <tr><td> Source Data </td><td> <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
     302          <tr><td> Tag Openers </td><td> <code>1____________1____________________________1____________1__________</code></td></tr>
     303           <tr><td> Start Tag Marks </td><td> <code>_1____________1___________________________________________________</code></td></tr>
     304           <tr><td> End Tag Marks </td><td> <code>___________________________________________1____________1_________</code></td></tr>
     305           <tr><td> Empty Tag Marks </td><td> <code>__________________________________________________________________</code></td></tr>
     306           <tr><td> Element Names </td><td> <code>_11111111_____1111111_____________________________________________</code></td></tr>
     307           <tr><td> Attribute Names </td><td> <code>______________________11_______11_________________________________</code></td></tr>
     308           <tr><td> Attribute Values </td><td> <code>__________________________111________111__________________________</code></td></tr>
     309                  </tbody>
     310               </table>         
     311
    293312         <para> Two intuitions may help explain how the Parabix approach can lead to improved XML
    294313            parsing performance. The first is that the use of the full register width offers a
     
    491510            may then be completed by applying parallel deletion and inverse transposition of the
    492511            UTF-16 bitstreams\cite{Cameron2008}. </para>
    493 <table>
    494                   <caption>
    495                      <para>XML Source Data and Derived Parallel Bit Streams</para>
    496                   </caption>
    497                   <colgroup>
    498                      <col align="centre" valign="top"/>
    499                      <col align="left" valign="top"/>
    500                   </colgroup>
    501                   <tbody>
    502           <tr><td> Source Data </td><td> <code> <![CDATA[<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>]]> </code></td></tr>
    503           <tr><td> Tag Openers </td><td> <code>1____________1____________________________1____________1__________</code></td></tr>
    504            <tr><td> Start Tag Marks </td><td> <code>_1____________1___________________________________________________</code></td></tr>
    505            <tr><td> End Tag Marks </td><td> <code>___________________________________________1____________1_________</code></td></tr>
    506            <tr><td> Empty Tag Marks </td><td> <code>__________________________________________________________________</code></td></tr>
    507            <tr><td> Element Names </td><td> <code>_11111111_____1111111_____________________________________________</code></td></tr>
    508            <tr><td> Attribute Names </td><td> <code>______________________11_______11_________________________________</code></td></tr>
    509            <tr><td> Attribute Values </td><td> <code>__________________________111________111__________________________</code></td></tr>
    510                   </tbody>
    511                </table>         
    512512         <para> Rather than immediately paying the costs of deletion and transposition just for
    513513            transcoding, however, icXML defers these steps so that the deletion masks for several
     
    521521            after the marked CR as shown by the Pablo source code in Figure
    522522            \ref{fig:LBnormalization}.
    523             <!-- FIGURE
    524 \begin{figure}
    525 \begin{verbatim}
     523              <figure>
     524                <caption>Line Break Normalization Logic</caption>
     525  <programlisting>
    526526# XML 1.0 line-break normalization rules.
    527527if lex.CR:
     
    530530  u16lo.bit_6 ^= lex.CR
    531531  u16lo.bit_7 ^= lex.CR
    532   CRLF = pablo.Advance(lex.CR) & lex.LF
     532  CRLF = pablo.Advance(lex.CR) &amp; lex.LF
    533533  callouts.delmask |= CRLF
    534534# Adjust LF streams for line/column tracker
    535535  lex.LF |= lex.CR
    536536  lex.LF ^= CRLF
    537 \end{verbatim}
    538 \caption{Line Break Normalization Logic}\label{fig:LBnormalization}
    539 \end{figure}
    540 -->
     537</programlisting>
     538</figure>
    541539         </para>
    542540         <para> In essence, the deletion masks for transcoding and for line break normalization each
Note: See TracChangeset for help on using the changeset viewer.