Ignore:
Timestamp:
Jan 30, 2013, 4:12:43 PM (7 years ago)
Author:
nmedfort
Message:

edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-overview.tex

    r2532 r2866  
    11\subsection{Overview}
    2 \begin{figure}
    3 \begin{center}
    4 \includegraphics[width=0.15\textwidth]{plots/xerces.pdf}
    5 \caption{Xerces Architecture}
    6 \label{fig:xerces-arch}
    7 \end{center}
    8 \end{figure}
     2
     3\def \CSG{Stream Generator}
    94
    105\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
     
    1510The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
    1611the majority of the character set encoding validation is performed as a byproduct of this process.
    17 The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text;
    18 it keeps track of the current line/column of the cursor (which is reported to the end user in
    19 the unlikely event that the input contains an error), performs all line-break normalization
    20 and validates context-specific character set issues, such as tokenization of qualified-names and
    21 ensures each character is legal \wrt{} the XML specification.
     12The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
     13It tracks the current line/column position,
     14%(which is reported in the unlikely event that the input contains an error),
     15performs line-break normalization and validates context-specific character set issues,
     16such as tokenization of qualified-names.
    2217The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
    2318of the document; it deals with all issues related to entity expansion, validates
     
    2520be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    2621and normalization of character references, etc.)
    27 The {\it Namespace Binder}, which is a core piece of the element stack, is primarily tasked
    28 with handling namespace scoping issues between different XML vocabularies and faciliates
    29 the scanner with the construction and utilization of Schema grammar structures.
     22The {\it Namespace Binder} is a core piece of the element stack.
     23It handles namespace scoping issues between different XML vocabularies.
     24This allows the scanner to properly select the correct schema grammar structures.
    3025The {\it Validator} takes the IR produced by the Scanner (and
    3126potentially annotated by the Namespace Binder) and assesses whether the final output matches
    32 the user-defined DTD and Schema grammar(s) before passing it to the end-user.
     27the user-defined DTD and schema grammar(s) before passing it to the end-user.
    3328
    34 \begin{figure}
    35 \includegraphics[width=0.47\textwidth]{plots/icxml.pdf}
    36 \caption{\icXML{} Architecture}
    37 \label{fig:icxml-arch}
     29\begin{figure}[h]
     30\begin{center}
     31\includegraphics[height=0.45\textheight,keepaspectratio]{plots/xerces.pdf}
     32\caption{Xerces Architecture}
     33\label{fig:xerces-arch}
     34\end{center}
    3835\end{figure}
    3936
     
    4441mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
    4542set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
    46 These lexical bit streams are later transformed into UTF-16 in the Content Stream Generator,
     43These lexical bit streams are later transformed into UTF-16 in the \CSG{},
    4744after additional processing is performed.
    4845The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
     
    5249Intra-element well-formedness validation is performed as an artifact of this process.
    5350Like Xerces, \icXML{} must provide the Line and Column position of each error.
    54 The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
     51The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
    5552optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
    5653From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit.
    5754
    58 \icXML{} represents elements and attributes as distinct data structures, called symbols,
    59 each with their own global identifier (GID).
    60 Using the {\bf symbol marker streams} produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
    61 the raw data to produce a stream (series) of GIDs, called the {\it symbol stream}.
    62 A typical XML file will contain relatively few unique element and attribute names---but each of them will occur
    63 frequently. % throughout the document.
    64 % Grammar information can be associated with each symbol and can help reduce the look-up cost of the later Validation process.
     55A typical XML file contains few unique element and attribute names---but each of them will occur frequently.
     56\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
     57Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
     58the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
    6559
    66 The final components of the \PS{} are the {\it Content Preperation Unit} and {\it Content Stream Generator}.
     60The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}.
    6761The former takes the (transposed) basis bit streams and selectively filters them, according to the
    6862information provided by the Parallel Markup Parser, and the latter transforms the
     
    7771associated with each symbol occurrence.
    7872This is discussed in Section \ref{section:arch:namespacehandling}.
    79 Finally, the {\it Validation} layer mimics the Xerces's validator; however
    80 the majority of the grammar look-ups are performed beforehand and stored within the symbol themselves.
     73Finally, the {\it Validation} layer implements the Xerces's validator.
     74However, preprocessing associated with each symbol greatly reduces the work of this stage.
     75
     76\begin{figure}[h]
     77\begin{center}
     78\includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf}
     79\end{center}
     80\caption{\icXML{} Architecture}
     81\label{fig:icxml-arch}
     82\end{figure}
    8183
    8284
Note: See TracChangeset for help on using the changeset viewer.