# Changeset 2470 for docs/Working/icXML/arch-overview.tex

Ignore:
Timestamp:
Oct 16, 2012, 5:48:15 PM (7 years ago)
Message:

More work; mostly edits

File:
1 edited

### Legend:

Unmodified
 r2439 \subsection{Overview} To better understand the difficulties in re-architecting Xerces, it is important to know how Xerces and ICXML differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator. The transcoder converts all input data into UTF16; all text run through this module before ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected into pipeline-parallel ready structure. In this section, we highlight the core differences between the two systems and discuss how they differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator. The {\it Transcoder} converts all input data into UTF16; all text run through this module before being processed as XML. The majority of the character set encoding validation is performed as a byproduct of this process. The reader is responsible for the streaming and buffering of all raw and transposed text; it keeps track of the current line/column of the cursor, performs all line-break normalization and validates context-specific character set issues, such as tokenization and ensuring each character is legal w.r.t. the XML specification at that position. The scanner pulls data through the reader and constructs the intermediate (and near-final) The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text; it keeps track of the current line/column of the cursor (which is reported to the end user in the unlikely event that the input file contains an error), performs all line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names and ensuring each character is legal w.r.t. the XML specification. The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final) representation of the document; it deals with all issues related to entity expansion, validates the XML wellformedness constraints, and remaining character set encoding issues that cannot the XML wellformedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references). The namespace binder is primarily tasked with handling all namespace scoping issues between and normalization of character references, etc.) The {\it Namespace binder} is primarily tasked with handling all namespace scoping issues between different XML vocabularies and faciliates the scanner with the construction and utilization of Schema grammar structures. The validator's job is to take the intermediate representation produced by the scanner (and potentially annotated by the namespace binder) and assess whether the final output would match the user-created Schema or DTD grammar specification(s). The {\it Validator} takes the intermediate representation produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and Schema grammar(s). \begin{figure} \end{figure} ICXML differs substantially from Xerces in many ways: tasks, as shown in Figure \ref{fig:icxml-arch} were grouped into logical components, ready for pipeline parallism. Two major categories of functions exist: those in the parabix subsystem, and those in the markup processor. All tasks in the parabix subsystem use the parabix framework and represent data as bit streams. The character set adapter closely mirrors Xerces's transcoder in terms of responsibility; however it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}, from the raw input instead of UTF16. The line-column tracker uses the lexical information to keep track of the cursor position(s) through the use of an optimized population count algorithm, which is described in Section \ref{section:arch:errorhandling}. The parallel markup parser utilizes the same lexical stream to mark key positions within the input data, such as the beginning and ending of tags, element and attribute names, and content. Intra-element well-formedness validation is performed as an artifact of this process. In ICXML, tasks, as shown in Figure \ref{fig:icxml-arch} are grouped into logical components. Two major categories of functions exist: those in the parabix subsystem, and those in the markup processor. All tasks in the parabix subsystem use the parabix framework {\bf (citation?)} and represent data as a series of bit streams, which are discussed in Section \ref{background:parabix}. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, closely mirrors Xerces's transcoder duties; however instead of producing UTF16 it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed into UTF-16 in the Content Buffer Generator, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies significant positions within the input data. One bit stream for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, ICXML must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}. From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}. % The output of both are required by the markup processor. frequently throughout the document. Each name is represented by a distinct symbol structure and global identifier (GID). Using the information produced by the parallel markup parser, the {\it symbol resolver} uses a bitscan instruction to Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs. % This size of this set is, at most, the length of the input data $\div$ 2, as every symbol must have a terminal character. One of the main advantages of this is that grammar information can be associated with the symbol itself and help bypass % It keys each symbol on its raw data representation, which means it can potentially be run in parallel with the content stream generator. One of the main advantages of using GIDs is that grammar information can be associated with the symbol itself and help bypass the lookup cost in the validation process. The final component of the parabix subsystem is the {\it content stream generator}. This component has a multitude of The final component of the parabix subsystem is the {\it Content Stream Generator}. This component has a multitude of responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce output-ready UTF-16 content for the markup processor. The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly to perform in bitspace, such as ensuring every start tag has a matching end tag. The {\it namespace processor} replaces Xerces's namespace binding functionality. Unlike Xerces, The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces, this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to be associated with each instance of a symbol. This is discussed in Section \ref{section:arch:namespacehandling}. The final {\it validation} process is responsible for the same tasks as Xerces's validator, however, the majority of the grammar look up operations is performed beforehand and stored within the symbols themselves. The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however, the majority of the grammar look up operations are performed beforehand and stored within the symbols themselves. \begin{figure}