Changeset 2439


Ignore:
Timestamp:
Oct 11, 2012, 5:09:57 PM (7 years ago)
Author:
nmedfort
Message:

work on overview

Location:
docs/Working/icXML
Files:
5 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-errorhandling.tex

    r2429 r2439  
    11\subsection{Error Handling}
     2\label{section:arch:errorhandling}
    23
    3 Challenges / Line Col Tracker
     4% Challenges / Line Col Tracker
  • docs/Working/icXML/arch-namespace.tex

    r2429 r2439  
    11\subsection{Namespace Handling}
     2\label{section:arch:namespacehandling}
    23
    34% Xerces stack-oriented vs icXML's bit-field oriented approach
  • docs/Working/icXML/arch-overview.tex

    r2429 r2439  
    11\subsection{Overview}
    22
    3 As the previous section aluded, the greatest difference between sequential parsing methods
    4 and the Parabix parsing model is how data is processed.
    5 Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end
    6 tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag
    7 events in sequential order, with all attribute data associated with the correct tag.
     3To better understand the difficulties in re-architecting Xerces, it is important to know
     4how Xerces and ICXML differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces
     5is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and
     6validator.
     7The transcoder converts all input data into UTF16; all text run through this module before
     8being processed as XML. The majority of the character set encoding validation is performed
     9as a byproduct of this process.
     10The reader is responsible for the streaming and buffering of all raw and transposed text;
     11it keeps track of the current line/column of the cursor, performs all line-break normalization
     12and validates context-specific character set issues, such as tokenization and ensuring each
     13character is legal w.r.t. the XML specification at that position.
     14The scanner pulls data through the reader and constructs the intermediate (and near-final)
     15representation of the document; it deals with all issues related to entity expansion, validates
     16the XML wellformedness constraints, and remaining character set encoding issues that cannot
     17be completely handled by the reader or transcoder (e.g., surrogate characters, validation
     18and normalization of character references).
     19The namespace binder is primarily tasked with handling all namespace scoping issues between
     20different XML vocabularies and faciliates the scanner with the construction and utilization
     21of Schema grammar structures.
     22The validator's job is to take the intermediate representation produced by the scanner (and
     23potentially annotated by the namespace binder) and assess whether the final output would match
     24the user-created Schema or DTD grammar specification(s).
    825
    9  
    10 The Parabix framework, however, does not allow for this (and would be hindered performance wise if
    11 forced to.)
    12 Thus our first question was, ``How can we how can we take full advantage
    13 of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced
    14 when given an input text.
     26\begin{figure}
     27\begin{center}
     28\includegraphics[width=0.15\textwidth]{plots/xerces.pdf}
     29\label{fig:xerces-arch}
     30\caption{Xerces Architecture}
     31\end{center}
     32\end{figure}
    1533
    16 By analyzing Xerces internal data structures and its produced output, two major observations were obvious:
    17 (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both
    18 internally (within the grammar structures and hash tables) and externally (for the end user).
    19 (2) all elements and attributes (both qualified and unqualified) are associated with a unique element
    20 declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate
    21 grammar reference in place of the element or attribute string.
     34ICXML differs substantially from Xerces in many ways: tasks, as shown in Figure \ref{fig:icxml-arch} were grouped into
     35logical components, ready for pipeline parallism. Two major categories of functions exist: those in the parabix subsystem, and
     36those in the markup processor. All tasks in the parabix subsystem use the parabix framework and represent data as bit streams.
     37The character set adapter closely mirrors Xerces's transcoder in terms of responsibility; however it produces a set of lexical
     38bit streams, similar to those shown in Figure \ref{fig:parabix1}, from the raw input instead of UTF16.
     39The line-column tracker uses the lexical information to keep track of the cursor position(s) through the use of an
     40optimized population count algorithm, which is described in Section \ref{section:arch:errorhandling}.
     41The parallel markup parser utilizes the same lexical stream to mark key positions within the input data, such as the beginning
     42and ending of tags, element and attribute names, and content. Intra-element well-formedness validation is performed as an
     43artifact of this process.
     44From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
     45% The output of both are required by the markup processor.
     46Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols.
     47A typical XML document will contain relatively few unique element and attribute names but each of them will occur
     48frequently throughout the document.
     49Each name is represented by a distinct symbol structure and global identifier (GID).
     50Using the information produced by the parallel markup parser, the {\it symbol resolver} uses a bitscan instruction to
     51iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs.
     52% This size of this set is, at most, the length of the input data $\div$ 2, as every symbol must have a terminal character.
     53One of the main advantages of this is that grammar information can be associated with the symbol itself and help bypass
     54the lookup cost in the validation process.
     55The final component of the parabix subsystem is the {\it content stream generator}. This component has a multitude of
     56responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
     57output-ready UTF-16 content for the markup processor.
     58
     59Everything in the markup processor uses a compressed representation of the document, generated by the
     60symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output.
     61The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
     62to perform in bitspace, such as ensuring every start tag has a matching end tag.
     63The {\it namespace processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
     64this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to
     65be associated with each instance of a symbol.
     66This is discussed in Section \ref{section:arch:namespacehandling}.
     67The final {\it validation} process is responsible for the same tasks as Xerces's validator, however,
     68the majority of the grammar look up operations is performed beforehand and stored within the symbols themselves.
     69
     70\begin{figure}
     71\includegraphics[width=0.50\textwidth]{plots/icxml.pdf}
     72\label{fig:icxml-arch}
     73\caption{ICXML Architecture}
     74\end{figure}
     75
     76% Probably not the right area but should we discuss issues with Xerces design that we tried to correct?
     77% - over-reliance on hash tables when domain knowledge dictated none would be needed
     78% - constant buffering of text to ensure that every QName/NCName and content was contained within a single string
     79% - abundant use of heap allocated memory
     80% - text conversions done in multiple areas
     81% - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk
     82
     83% As the previous section aluded, the greatest difference between sequential parsing methods
     84% and the Parabix parsing model is how data is processed.
     85% Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end
     86% tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag
     87% events in sequential order, with all attribute data associated with the correct tag.
     88%
     89%
     90
     91% The Parabix framework, however, does not allow for this (and would be hindered performance wise if
     92% forced to.)
     93% Thus our first question was, ``How can we how can we take full advantage
     94% of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced
     95% when given an input text.
     96%
     97% By analyzing Xerces internal data structures and its produced output, two major observations were obvious:
     98% (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both
     99% internally (within the grammar structures and hash tables) and externally (for the end user).
     100% (2) all elements and attributes (both qualified and unqualified) are associated with a unique element
     101% declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate
     102% grammar reference in place of the element or attribute string.
    22103
    23104
  • docs/Working/icXML/background-parabix.tex

    r2429 r2439  
    7070
    7171
    72 Using a mixture of boolean-logic and arithmetic operations, character-class
    73 bit streams can be transformed into lexical bit streams, where the presense of
    74 a 1 bit identifies a key position in the input data. As an artifact of this
    75 process, intra-element well-formedness validation is performed on each block
    76 of text.
     72% Using a mixture of boolean-logic and arithmetic operations, character-class
     73% bit streams can be transformed into lexical bit streams, where the presense of
     74% a 1 bit identifies a key position in the input data. As an artifact of this
     75% process, intra-element well-formedness validation is performed on each block
     76% of text.
    7777
    7878Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
  • docs/Working/icXML/icxml-main.tex

    r2429 r2439  
    131131\input{arch-errorhandling.tex}
    132132
    133 \begin{figure}
    134 \begin{center}
    135 \includegraphics[width=0.15\textwidth]{plots/xerces.pdf}
    136 \label{fig:xerces-arch}
    137 \caption{}
    138 \end{center}
    139 
    140 \end{figure}
    141 \begin{figure}
    142 \includegraphics[width=0.50\textwidth]{plots/icxml.pdf}
    143 \label{fig:icxml-arch}
    144 \caption{}
    145 \end{figure}
    146 
    147133\section{Performance}
    148134
Note: See TracChangeset for help on using the changeset viewer.