Changeset 2483


Ignore:
Timestamp:
Oct 18, 2012, 7:01:09 PM (7 years ago)
Author:
nmedfort
Message:

minor edits

Location:
docs/Working/icXML
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-overview.tex

    r2471 r2483  
    33ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
    44rearchitected with pipeline parallelism in mind.
    5 In this section, we highlight the core differences between the two systems and discuss how they
    6 differ design wise.
     5In this section, we highlight the core differences between the two systems.
    76As shown in Figure \ref{fig:xerces-arch}, Xerces
    87is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
  • docs/Working/icXML/background-parabix.tex

    r2471 r2483  
    11\subsection{The Parabix Framework}
    22\label{background:parabix}
    3 
    4 \begin{figure*}[tbhp]
    5 \begin{center}
    6 \begin{tabular}{cr}\\
    7 Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
    8 Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
    9 Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
    10 End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
    11 Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
    12 Att Names & \verb`_______________________11_______11________________________1111_____________`\\
    13 Att Values & \verb`__________________________11111______11111_____________________111_________`
    14 \end{tabular}
    15 \end{center}
    16 \caption{XML Source Data and Derived Parallel Bit Streams}
    17 \label{fig:parabix1}
    18 \end{figure*}
    193
    204The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
     
    226(e.g., 128-bit) SIMD registers in commodity processors to represent data from long blocks
    237of input data by using one register bit per single input byte.
    24 To facilitate this, the input data is first transposed into a set of basis bit streams
    25 and then boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean
    26 AND, OR and NOT operators.} are used to classify the input bits into a set of character-class
    27 (and eventually lexical) bit streams.
    28 For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
    29   b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
    30   \ldots 7}$. The bits used to construct $\tt b_7$ have been
    31 highlighted in this example.
     8To facilitate this, the input data is first transposed into a set of basis bit streams.
     9In Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb|<|A}''
     10is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$.
     11% The bits used to construct $\tt b_7$ have been highlighted in this example.
     12Boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.}
     13are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
     14characters (or groups of characters) with a $1$.
     15For example, one of the fundemental characters in XML is a left-angle bracket.
     16A character is an `\verb`<`' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
     17b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
     18Similarly, a character is numeric
     19{\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
     20% An important observation here is that a range of characters can sometimes
     21% take fewer operations and require fewer basis bit streams to compute
     22% than individual characters. Finding optimal solutions to all
     23% character-classes is non-trivial and goes beyond the scope of this
     24% paper.
    3225
    33 \begin{figure}[h]
     26\begin{figure}[hp]
    3427\begin{center}
    3528\begin{tabular}{r c c c c }
     
    5346\end{figure}
    5447
    55 Character-class bit streams allow us to perform 128
    56 comparisons in parallel with a single operation by using a series
    57 of boolean-logic operations to merge multiple
    58 basis bit streams into a single character-class stream that marks the
    59 positions of key characters with a $1$. 
    60 For example, one of the fundemental markers in XML is the one that
    61 identifies all left-angle brackets ``\verb`<`''. A character is
    62 an ``\verb`<`'' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
    63 b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
    64 Classes of characters can be found with similar formulas.  For example, a
    65 character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
    66 \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
    67 important observation here is that a range of characters can sometimes
    68 take fewer operations and require fewer basis bit streams to compute
    69 than individual characters. Finding optimal solutions to all
    70 character-classes is non-trivial and goes beyond the scope of this
    71 paper.
    72 
    73 
    7448% Using a mixture of boolean-logic and arithmetic operations, character-class
    7549% bit streams can be transformed into lexical bit streams, where the presense of
     
    7751% process, intra-element well-formedness validation is performed on each block
    7852% of text.
     53
     54\begin{figure*}[tbhp]
     55\begin{center}
     56\begin{tabular}{cr}\\
     57Source Data & \verb`<root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>`\\
     58Tag Openers & \verb`1_____1_______1____1___________________________1____1_______________1______`\\
     59Start Tag Marks & \verb`_1_____1____________1________________________________1_____________________`\\
     60End Tag Marks & \verb`_______________1________________________________1____________________1_____`\\
     61Element Names & \verb`_1111__11___________11_______________________________1111__________________`\\
     62Att Names & \verb`_______________________11_______11________________________1111_____________`\\
     63Att Values & \verb`__________________________11111______11111_____________________111_________`
     64\end{tabular}
     65\end{center}
     66\caption{XML Source Data and Derived Parallel Bit Streams}
     67\label{fig:parabix1}
     68\end{figure*}
    7969
    8070Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
  • docs/Working/icXML/background-xerces.tex

    r2429 r2483  
    11\subsection{Xerces C++ Structure}
     2\label{background:xerces}
    23
    3 The Xerces C++ parser is a widely-used standards-conformant
    4 XML parser produced as open-source software
    5 by the Apache Software Foundation.  It features
    6 comprehensive support for a variety of character encodings
     4The Xerces C++ parser
     5% is a widely-used standards-conformant
     6% XML parser produced as open-source software
     7% by the Apache Software Foundation. 
     8% It
     9features comprehensive support for a variety of character encodings
    710both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
    811multiple XML vocabularies through the XML namespace
     
    1316Xerces also supports several APIs for accessing
    1417parser services, including event-based parsing
    15 using either pull parsing or SAX push-style
     18using either pull parsing or SAX/SAX2 push-style
    1619parsing as well as a DOM tree-based parsing interface.
    1720
  • docs/Working/icXML/icxml-main.tex

    r2478 r2483  
    128128
    129129\section{Background}
     130\label{background}
    130131
    131132\input{background-xerces}
Note: See TracChangeset for help on using the changeset viewer.