Ignore:
Timestamp:
Oct 16, 2012, 5:48:15 PM (7 years ago)
Author:
nmedfort
Message:

More work; mostly edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-overview.tex

    r2439 r2470  
    11\subsection{Overview}
    22
    3 To better understand the difficulties in re-architecting Xerces, it is important to know
    4 how Xerces and ICXML differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces
    5 is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and
    6 validator.
    7 The transcoder converts all input data into UTF16; all text run through this module before
     3ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
     4rearchitected into pipeline-parallel ready structure.
     5In this section, we highlight the core differences between the two systems and discuss how they
     6differ design wise.
     7As shown in Figure \ref{fig:xerces-arch}, Xerces
     8is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator.
     9The {\it Transcoder} converts all input data into UTF16; all text run through this module before
    810being processed as XML. The majority of the character set encoding validation is performed
    911as a byproduct of this process.
    10 The reader is responsible for the streaming and buffering of all raw and transposed text;
    11 it keeps track of the current line/column of the cursor, performs all line-break normalization
    12 and validates context-specific character set issues, such as tokenization and ensuring each
    13 character is legal w.r.t. the XML specification at that position.
    14 The scanner pulls data through the reader and constructs the intermediate (and near-final)
     12The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text;
     13it keeps track of the current line/column of the cursor (which is reported to the end user in
     14the unlikely event that the input file contains an error), performs all line-break normalization
     15and validates context-specific character set issues, such as tokenization of qualified-names and
     16ensuring each character is legal w.r.t. the XML specification.
     17The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final)
    1518representation of the document; it deals with all issues related to entity expansion, validates
    16 the XML wellformedness constraints, and remaining character set encoding issues that cannot
     19the XML wellformedness constraints and any character set encoding issues that cannot
    1720be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    18 and normalization of character references).
    19 The namespace binder is primarily tasked with handling all namespace scoping issues between
     21and normalization of character references, etc.)
     22The {\it Namespace binder} is primarily tasked with handling all namespace scoping issues between
    2023different XML vocabularies and faciliates the scanner with the construction and utilization
    2124of Schema grammar structures.
    22 The validator's job is to take the intermediate representation produced by the scanner (and
    23 potentially annotated by the namespace binder) and assess whether the final output would match
    24 the user-created Schema or DTD grammar specification(s).
     25The {\it Validator} takes the intermediate representation produced by the Scanner (and
     26potentially annotated by the Namespace Binder) and assesses whether the final output matches
     27the user-defined DTD and Schema grammar(s).
    2528
    2629\begin{figure}
     
    3235\end{figure}
    3336
    34 ICXML differs substantially from Xerces in many ways: tasks, as shown in Figure \ref{fig:icxml-arch} were grouped into
    35 logical components, ready for pipeline parallism. Two major categories of functions exist: those in the parabix subsystem, and
    36 those in the markup processor. All tasks in the parabix subsystem use the parabix framework and represent data as bit streams.
    37 The character set adapter closely mirrors Xerces's transcoder in terms of responsibility; however it produces a set of lexical
    38 bit streams, similar to those shown in Figure \ref{fig:parabix1}, from the raw input instead of UTF16.
    39 The line-column tracker uses the lexical information to keep track of the cursor position(s) through the use of an
    40 optimized population count algorithm, which is described in Section \ref{section:arch:errorhandling}.
    41 The parallel markup parser utilizes the same lexical stream to mark key positions within the input data, such as the beginning
    42 and ending of tags, element and attribute names, and content. Intra-element well-formedness validation is performed as an
    43 artifact of this process.
     37In ICXML, tasks, as shown in Figure \ref{fig:icxml-arch} are grouped into logical components.
     38Two major categories of functions exist: those in the parabix subsystem, and
     39those in the markup processor. All tasks in the parabix subsystem use the parabix framework {\bf (citation?)} and represent
     40data as a series of bit streams, which are discussed in Section \ref{background:parabix}.
     41The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
     42 closely mirrors Xerces's transcoder duties; however instead of producing UTF16 it produces a
     43set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed
     44into UTF-16 in the Content Buffer Generator, after additional processing is performed.
     45The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
     46It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
     47significant positions within the input data. One bit stream for each of the critical piece of information is created, such as
     48the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
     49Intra-element well-formedness validation is performed as an artifact of this process.
     50Like Xerces, ICXML must provide the Line and Column position of each error.
     51The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
     52optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}.
     53
     54
     55
    4456From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
    4557% The output of both are required by the markup processor.
     
    4860frequently throughout the document.
    4961Each name is represented by a distinct symbol structure and global identifier (GID).
    50 Using the information produced by the parallel markup parser, the {\it symbol resolver} uses a bitscan instruction to
     62Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to
    5163iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs.
    52 % This size of this set is, at most, the length of the input data $\div$ 2, as every symbol must have a terminal character.
    53 One of the main advantages of this is that grammar information can be associated with the symbol itself and help bypass
     64% It keys each symbol on its raw data representation, which means it can potentially be run in parallel with the content stream generator.
     65One of the main advantages of using GIDs is that grammar information can be associated with the symbol itself and help bypass
    5466the lookup cost in the validation process.
    55 The final component of the parabix subsystem is the {\it content stream generator}. This component has a multitude of
     67The final component of the parabix subsystem is the {\it Content Stream Generator}. This component has a multitude of
    5668responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
    5769output-ready UTF-16 content for the markup processor.
     
    6173The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
    6274to perform in bitspace, such as ensuring every start tag has a matching end tag.
    63 The {\it namespace processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
     75The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
    6476this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to
    6577be associated with each instance of a symbol.
    6678This is discussed in Section \ref{section:arch:namespacehandling}.
    67 The final {\it validation} process is responsible for the same tasks as Xerces's validator, however,
    68 the majority of the grammar look up operations is performed beforehand and stored within the symbols themselves.
     79The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however,
     80the majority of the grammar look up operations are performed beforehand and stored within the symbols themselves.
    6981
    7082\begin{figure}
Note: See TracChangeset for help on using the changeset viewer.