Ignore:
Timestamp:
Oct 19, 2012, 3:01:59 PM (7 years ago)
Author:
nmedfort
Message:

temp checkin

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-overview.tex

    r2483 r2496  
    11\subsection{Overview}
    22
    3 ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
     3\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
    44rearchitected with pipeline parallelism in mind.
    55In this section, we highlight the core differences between the two systems.
    66As shown in Figure \ref{fig:xerces-arch}, Xerces
    77is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
    8 The {\it Transcoder} converts all input data into UTF16; all text run through this module before
    9 being processed as XML. The majority of the character set encoding validation is performed
    10 as a byproduct of this process.
    11 The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text;
     8The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
     9the majority of the character set encoding validation is performed as a byproduct of this process.
     10The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text;
    1211it keeps track of the current line/column of the cursor (which is reported to the end user in
    13 the unlikely event that the input file contains an error), performs all line-break normalization
     12the unlikely event that the input contains an error), performs all line-break normalization
    1413and validates context-specific character set issues, such as tokenization of qualified-names and
    15 ensuring each character is legal w.r.t. the XML specification.
     14ensures each character is legal w.r.t. the XML specification.
    1615The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final)
    1716representation of the document; it deals with all issues related to entity expansion, validates
    18 the XML wellformedness constraints and any character set encoding issues that cannot
     17the XML well-formedness constraints and any character set encoding issues that cannot
    1918be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    2019and normalization of character references, etc.)
    2120The {\it Namespace Binder}, which is a core piece of their element stack, is primarily tasked
    22 with handling all namespace scoping issues between different XML vocabularies and faciliates
     21with handling namespace scoping issues between different XML vocabularies and faciliates
    2322the scanner with the construction and utilization of Schema grammar structures.
    2423The {\it Validator} takes the intermediate representation produced by the Scanner (and
    2524potentially annotated by the Namespace Binder) and assesses whether the final output matches
    26 the user-defined DTD and Schema grammar(s) before passing the data to the end-user.
     25the user-defined DTD and Schema grammar(s) before passing the information to the end-user.
    2726
    2827\begin{figure}
    2928\begin{center}
    3029\includegraphics[width=0.15\textwidth]{plots/xerces.pdf}
     30\caption{Xerces Architecture}
    3131\label{fig:xerces-arch}
    32 \caption{Xerces Architecture}
    3332\end{center}
    3433\end{figure}
    3534
    36 In ICXML functions are grouped into logical components.
     35In \icXML{} functions are grouped into logical components.
    3736As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
    3837All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
     
    4645the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
    4746Intra-element well-formedness validation is performed as an artifact of this process.
    48 Like Xerces, ICXML must provide the Line and Column position of each error.
     47Like Xerces, \icXML{} must provide the Line and Column position of each error.
    4948The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
    50 optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}.
    51 
    52 
    53 
     49optimized population count algorithm; this is described in Section \ref{section:arch:errorhandling}.
    5450From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
    5551% The output of both are required by the \MP{}.
    56 Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols.
    57 A typical XML document will contain relatively few unique element and attribute names but each of them will occur
    58 frequently throughout the document.
    59 Each name is represented by a distinct symbol structure and global identifier (GID).
     52Apart from the Parabix framework, another core difference between Xerces and \icXML{} is the use of symbols.
     53A typical XML document will contain relatively few unique element and attribute names---but each of them will occur frequently throughout the document.
     54In \icXML{}, names are represented by distinct symbol structures and global identifiers (GIDs).
    6055Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to
    6156iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs.
     
    6459the lookup cost in the validation process.
    6560The final component of the \PS{} is the {\it Content Stream Generator}. This component has a multitude of
    66 responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
    67 output-ready UTF-16 content for the \MP{}.
     61responsibilities, which will be discussed in Section \ref{sec:parfilter}, but its primary function is to produce
     62near-final UTF-16 content.
    6863
    69 Everything in the \MP{} uses a compressed representation of the document, generated by the
    70 symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output.
     64The {\it \MP{}} parses a compressed representation of the XML document, generated by the
     65symbol resolver and content stream generator, to validate and produce the final (sequential) output for the end user.
    7166The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
    7267to perform in bitspace, such as ensuring every start tag has a matching end tag.
    7368The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
    74 this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to
    75 be associated with each instance of a symbol.
     69this is performed as a discrete phase and simply produces a set of URI identifiers (URI IDs), to
     70be associated with each occurrence of a symbol.
    7671This is discussed in Section \ref{section:arch:namespacehandling}.
    7772The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however,
     
    8075\begin{figure}
    8176\includegraphics[width=0.50\textwidth]{plots/icxml.pdf}
     77\caption{\icXML{} Architecture}
    8278\label{fig:icxml-arch}
    83 \caption{ICXML Architecture}
    8479\end{figure}
    8580
Note: See TracChangeset for help on using the changeset viewer.