Changeset 2866 for docs/Working/icXML


Ignore:
Timestamp:
Jan 30, 2013, 4:12:43 PM (6 years ago)
Author:
nmedfort
Message:

edits

Location:
docs/Working/icXML
Files:
8 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-charactersetadapters.tex

    r2522 r2866  
    4242additional cost.
    4343
    44 In short, the cost of individual character transcoding is avoided whenever
    45 they constitute lexical items, whenever a block of input is confined to the ASCII subset
    46 and for all but the first occurrence of any XML element or attribute name.
     44The cost of individual character transcoding is avoided whenever a block of input is
     45confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
    4746Furthermore, when transcoding is required, the parallel bit stream representation
    48 generally supports efficient transcoding operations.   In the important
     47supports efficient transcoding operations.   In the important
    4948case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
    5049can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
  • docs/Working/icXML/arch-errorhandling.tex

    r2505 r2866  
    5959datum that was surpressed from the source during the production of the content stream.
    6060Armed with these, it is possible to calculate the actual line/column using
    61 the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the cursor position.
     61the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
  • docs/Working/icXML/arch-overview.tex

    r2532 r2866  
    11\subsection{Overview}
    2 \begin{figure}
    3 \begin{center}
    4 \includegraphics[width=0.15\textwidth]{plots/xerces.pdf}
    5 \caption{Xerces Architecture}
    6 \label{fig:xerces-arch}
    7 \end{center}
    8 \end{figure}
     2
     3\def \CSG{Stream Generator}
    94
    105\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
     
    1510The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
    1611the majority of the character set encoding validation is performed as a byproduct of this process.
    17 The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text;
    18 it keeps track of the current line/column of the cursor (which is reported to the end user in
    19 the unlikely event that the input contains an error), performs all line-break normalization
    20 and validates context-specific character set issues, such as tokenization of qualified-names and
    21 ensures each character is legal \wrt{} the XML specification.
     12The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
     13It tracks the current line/column position,
     14%(which is reported in the unlikely event that the input contains an error),
     15performs line-break normalization and validates context-specific character set issues,
     16such as tokenization of qualified-names.
    2217The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
    2318of the document; it deals with all issues related to entity expansion, validates
     
    2520be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    2621and normalization of character references, etc.)
    27 The {\it Namespace Binder}, which is a core piece of the element stack, is primarily tasked
    28 with handling namespace scoping issues between different XML vocabularies and faciliates
    29 the scanner with the construction and utilization of Schema grammar structures.
     22The {\it Namespace Binder} is a core piece of the element stack.
     23It handles namespace scoping issues between different XML vocabularies.
     24This allows the scanner to properly select the correct schema grammar structures.
    3025The {\it Validator} takes the IR produced by the Scanner (and
    3126potentially annotated by the Namespace Binder) and assesses whether the final output matches
    32 the user-defined DTD and Schema grammar(s) before passing it to the end-user.
     27the user-defined DTD and schema grammar(s) before passing it to the end-user.
    3328
    34 \begin{figure}
    35 \includegraphics[width=0.47\textwidth]{plots/icxml.pdf}
    36 \caption{\icXML{} Architecture}
    37 \label{fig:icxml-arch}
     29\begin{figure}[h]
     30\begin{center}
     31\includegraphics[height=0.45\textheight,keepaspectratio]{plots/xerces.pdf}
     32\caption{Xerces Architecture}
     33\label{fig:xerces-arch}
     34\end{center}
    3835\end{figure}
    3936
     
    4441mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
    4542set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
    46 These lexical bit streams are later transformed into UTF-16 in the Content Stream Generator,
     43These lexical bit streams are later transformed into UTF-16 in the \CSG{},
    4744after additional processing is performed.
    4845The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
     
    5249Intra-element well-formedness validation is performed as an artifact of this process.
    5350Like Xerces, \icXML{} must provide the Line and Column position of each error.
    54 The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
     51The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
    5552optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
    5653From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit.
    5754
    58 \icXML{} represents elements and attributes as distinct data structures, called symbols,
    59 each with their own global identifier (GID).
    60 Using the {\bf symbol marker streams} produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
    61 the raw data to produce a stream (series) of GIDs, called the {\it symbol stream}.
    62 A typical XML file will contain relatively few unique element and attribute names---but each of them will occur
    63 frequently. % throughout the document.
    64 % Grammar information can be associated with each symbol and can help reduce the look-up cost of the later Validation process.
     55A typical XML file contains few unique element and attribute names---but each of them will occur frequently.
     56\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
     57Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
     58the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
    6559
    66 The final components of the \PS{} are the {\it Content Preperation Unit} and {\it Content Stream Generator}.
     60The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}.
    6761The former takes the (transposed) basis bit streams and selectively filters them, according to the
    6862information provided by the Parallel Markup Parser, and the latter transforms the
     
    7771associated with each symbol occurrence.
    7872This is discussed in Section \ref{section:arch:namespacehandling}.
    79 Finally, the {\it Validation} layer mimics the Xerces's validator; however
    80 the majority of the grammar look-ups are performed beforehand and stored within the symbol themselves.
     73Finally, the {\it Validation} layer implements the Xerces's validator.
     74However, preprocessing associated with each symbol greatly reduces the work of this stage.
     75
     76\begin{figure}[h]
     77\begin{center}
     78\includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf}
     79\end{center}
     80\caption{\icXML{} Architecture}
     81\label{fig:icxml-arch}
     82\end{figure}
    8183
    8284
  • docs/Working/icXML/background-fundemental-differences.tex

    r2522 r2866  
    66XML-specific markup, such as a left angle bracket ``\verb`<`'', and the
    77content held within the document. 
    8 As the parser moves its cursor through the document, it alternates
    9 between markup scanning, validation, and content processing
    10 operations.  In other words, Xerces is a complex
    11 finite-state machine that use byte comparisons to transition between
    12 data and metadata states. Each state transition indicates the context
    13 for subsequent characters. Unfortunately, textual data tends to
    14 consist of variable-length items sequenced in generally unpredictable
    15 patterns; thus any character could be a state transition until deemed
    16 otherwise.
     8As the parser progresses through the document, it alternates between markup scanning,
     9validation and content processing modes.
     10
     11
     12In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
     13  Herein FSM applications are software systems whose behavior is defined by the inputs,
     14  current state and the events associated with transitions of states.}.
     15Each state transition indicates the processing context of subsequent characters.
     16Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
     17
     18% Unfortunately, textual data tends to consist of variable-length strings sequenced in
     19% unpredictable patterns.
     20% Each character must be examined in sequence because any character could be a state transition until deemed otherwise.
     21
     22
     23
    1724
    1825% Parallel: blocks/segments/buffers through layers
    19 Parabix-style XML parsers utilize a concept of layers:
    20 as each block of source text is transformed into a set of lexical bit streams,
    21 it undergoes a series of operations that can be grouped together in logical
    22 layers, such as transposition, character classification, and the lexical analysis
    23 phases. Each layer is pipeline parallel, requiring no speculation nor
    24 pre-parsing\cite{HPCA2012}.
    25 In adapting to the requirements of the Xerces sequential parsing API,
    26 however, the resultant parallel
    27 bit streams, taken individually, may out-of-order \wrt{} the source
    28 document.  Hence they must be amalgamated and iterated through to produce
    29 sequential output.
    30 % The end user should not be expected to work with out-of-order data ...
    31 
    32 % a block of input
    33 % text is transformed into a set of lexical bit streams. Operations are then
    34 % performed on these streams to identify key positions in the input data and
    35 % perform intra-element well-formedness validation (as an artifact of the
    36 % identification process.)
    37 
     26Parabix-style XML parsers utilize a concept of layered processing.
     27A block of source text is transformed into a set of lexical bit streams,
     28which undergo a series of operations that can be grouped into logical layers,
     29e.g., transposition, character classification, and lexical analysis.
     30Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
     31% In adapting to the requirements of the Xerces sequential parsing API,
     32% however, the resultant parallel bit streams may out-of-order \wrt{} the source document.
     33% Hence they must be amalgamated and iterated through to produce sequential output.
     34To meet the API requirements of the document-ordered Xerces output,
     35the results of the Parabix processing layers must be interleaved to produce the equivalent behavior.
  • docs/Working/icXML/background-parabix.tex

    r2530 r2866  
    11\subsection{The Parabix Framework}
    22\label{background:parabix}
    3 
    4 \begin{figure*}[tbh]
    5 \begin{center}
    6 \begin{tabular}{cr}\\
    7 Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
    8 Tag Openers & \verb`1____________1____________________________1____________1__________`\\
    9 Start Tag Marks & \verb`_1____________1___________________________________________________`\\
    10 End Tag Marks & \verb`___________________________________________1____________1_________`\\
    11 Empty Tag Marks & \verb`__________________________________________________________________`\\
    12 Element Names & \verb`_11111111_____1111111_____________________________________________`\\
    13 Attribute Names & \verb`______________________11_______11_________________________________`\\
    14 Attribute Values & \verb`__________________________111________111__________________________`\\
    15 % String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
    16 % Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
    17 % Deletion Mask & \verb`_11111111_____1111111111_1____1111_11_______11111111_____111111111`\\
    18 % Undeleted Data & \verb``{\tt\it 0}\verb`________>fee`{\tt\it 0}\verb`__________=_fie`{\tt\it 0}\verb`____=__foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/________fum`{\tt\it 0}\verb`/_________`
    19 \end{tabular}
    20 \end{center}
    21 \caption{XML Source Data and Derived Parallel Bit Streams}
    22 \label{fig:parabix1}
    23 \end{figure*}
    243
    254The Parabix (parallel bit stream) framework is a transformative approach to XML parsing
     
    3413are used to classify the input bits into a set of {\it character-class bit streams}, which identify key
    3514characters (or groups of characters) with a $1$.
    36 For example, one of the fundemental characters in XML is a left-angle bracket.
     15For example, one of the fundamental characters in XML is a left-angle bracket.
    3716A character is an `\verb`<`' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land
    3817b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
     
    9675The second is that byte-at-a-time loop scanning loops are actually
    9776often just computing a single bit of information per iteration:
    98 is the scan complete at this position yet?  Rather than
    99 computing these individual decision-bits, an approach that computes
     77is the scan complete yet?
     78Rather than computing these individual decision-bits, an approach that computes
    10079many of them in parallel (e.g., 128 bytes at a time using 128-bit registers)
    10180should provide substantial benefit.
    10281
    103 Previous studies have shown Parabix approach improves many aspects of XML processing,
     82Previous studies have shown that the Parabix approach improves many aspects of XML processing,
    10483including transcoding \cite{Cameron2008}, character classification and validation,
    10584tag parsing and well-formedness checking. 
     
    11089well as combining SIMD methods with 4-stage pipeline parallelism to further improve
    11190throughput \cite{HPCA2012}.
    112 Although these research prototypes handled the full syntax of
    113 {\bf grammarless} XML documents, they lacked the functionality required by full XML parsers.
    114 Namely, commercial XML processors, such as Xerces,
    115 as support for transcoding of multiple character sets,
    116 the ability to parse and validate against DTD grammars, both internal and external,
    117 facilities for handling different XML vocabularies through namespace
    118 processing, as well validation against XML Schema grammars. 
    119 Additionally, commercial parsers can be expected to provide a number of API
    120 facilities beyond those found in research prototypes, including
    121 full implementations of the widely used SAX, SAX2 and DOM interfaces.
     91Although these research prototypes handled the full syntax of schema-less XML documents,
     92they lacked the functionality required by full XML parsers.
    12293
     94Commercial XML processors support transcoding of multiple character sets and can parse and
     95validate against multiple document vocabulaties.
     96Additionally, they provide API facilities beyond those found in research prototypes,
     97including the widely used SAX, SAX2 and DOM interfaces.
  • docs/Working/icXML/background-xerces.tex

    r2530 r2866  
    1313of structure and data validation through multiple grammars
    1414declared using either legacy DTDs (document type
    15 definitions) or modern XML schema facilities.
     15definitions) or modern XML Schema facilities.
    1616Xerces also supports several APIs for accessing
    1717parser services, including event-based parsing
  • docs/Working/icXML/icxml-main.tex

    r2862 r2866  
    5252\input{abstract.tex}
    5353\section{Introduction}
    54 \section{Introduction}
    5554
    5655Parallelization and acceleration of XML parsing is a widely
     
    9089end-to-end acceleration of XML processing.   
    9190To achieve the best results possible, we undertook
    92 a comprehensive restructuring of the Xerces-C++ parser,
     91a nine-month comprehensive restructuring of the Xerces-C++ parser,
    9392seeking to expose as many critical aspects of XML parsing
    94 as possible for parallelization.   Overall, we have
    95 employed Parabix-style methods in transcoding, tokenization
    96 and tag parsing,  parallel string comparison methods in symbol
    97 resolution, bit parallel methods in namespace processing, as well as staged
    98 processing with pipeline parallelism to take advantage of
     93as possible for parallelization, the result of which we named icXML. 
     94Overall, we employed Parabix-style methods of transcoding, tokenization
     95and tag parsing, parallel string comparison methods in symbol
     96resolution, bit parallel methods in namespace processing,
     97as well as staged processing using pipeline parallelism to take advantage of
    9998multiple cores.   
    10099
     
    113112applying the techniques discussed herein in other application domains.
    114113
     114
    115115\section{Background}
    116116\label{background}
    117117
    118118\input{background-xerces}
     119
     120\begin{figure*}[th]
     121\begin{center}
     122\begin{tabular}{rr}\\
     123Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
     124Tag Openers & \verb`1____________1____________________________1____________1__________`\\
     125Start Tag Marks & \verb`_1____________1___________________________________________________`\\
     126End Tag Marks & \verb`___________________________________________1____________1_________`\\
     127Empty Tag Marks & \verb`__________________________________________________________________`\\
     128Element Names & \verb`_11111111_____1111111_____________________________________________`\\
     129Attribute Names & \verb`______________________11_______11_________________________________`\\
     130Attribute Values & \verb`__________________________111________111__________________________`\\
     131% String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
     132% Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
     133% Deletion Mask & \verb`_11111111_____1111111111_1____1111_11_______11111111_____111111111`\\
     134% Undeleted Data & \verb``{\tt\it 0}\verb`________>fee`{\tt\it 0}\verb`__________=_fie`{\tt\it 0}\verb`____=__foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/________fum`{\tt\it 0}\verb`/_________`
     135\end{tabular}
     136\end{center}
     137\caption{XML Source Data and Derived Parallel Bit Streams}
     138\label{fig:parabix1}
     139\end{figure*}
     140
    119141\input{background-parabix}
     142
    120143\input{background-fundemental-differences.tex}
    121144
  • docs/Working/icXML/parfilter.tex

    r2530 r2866  
    2424\begin{figure*}[tbh]
    2525\begin{center}
    26 \begin{tabular}{cr}\\
     26\begin{tabular}{rr}\\
    2727Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
    2828% Tag Openers & \verb`1____________1____________________________1____________1__________`\\
Note: See TracChangeset for help on using the changeset viewer.