source: docs/Working/icXML/arch-overview.tex @ 2471

Last change on this file since 2471 was 2471, checked in by nmedfort, 7 years ago

some edits

File size: 8.3 KB
3ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
4rearchitected with pipeline parallelism in mind.
5In this section, we highlight the core differences between the two systems and discuss how they
6differ design wise.
7As shown in Figure \ref{fig:xerces-arch}, Xerces
8is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
9The {\it Transcoder} converts all input data into UTF16; all text run through this module before
10being processed as XML. The majority of the character set encoding validation is performed
11as a byproduct of this process.
12The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text;
13it keeps track of the current line/column of the cursor (which is reported to the end user in
14the unlikely event that the input file contains an error), performs all line-break normalization
15and validates context-specific character set issues, such as tokenization of qualified-names and
16ensuring each character is legal w.r.t. the XML specification.
17The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final)
18representation of the document; it deals with all issues related to entity expansion, validates
19the XML wellformedness constraints and any character set encoding issues that cannot
20be completely handled by the reader or transcoder (e.g., surrogate characters, validation
21and normalization of character references, etc.)
22The {\it Namespace Binder}, which is a core piece of their element stack, is primarily tasked
23with handling all namespace scoping issues between different XML vocabularies and faciliates
24the scanner with the construction and utilization of Schema grammar structures.
25The {\it Validator} takes the intermediate representation produced by the Scanner (and
26potentially annotated by the Namespace Binder) and assesses whether the final output matches
27the user-defined DTD and Schema grammar(s) before passing the data to the end-user.
33\caption{Xerces Architecture} 
37In ICXML functions are grouped into logical components.
38As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
39All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
40The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
41mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
42set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
43These lexical bit streams are later transformed into UTF-16 in the Content Buffer Generator, after additional processing is performed.
44The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
45It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
46significant positions within the input data. One bit stream for each of the critical piece of information is created, such as
47the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
48Intra-element well-formedness validation is performed as an artifact of this process.
49Like Xerces, ICXML must provide the Line and Column position of each error.
50The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
51optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}.
55From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
56% The output of both are required by the \MP{}.
57Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols.
58A typical XML document will contain relatively few unique element and attribute names but each of them will occur
59frequently throughout the document.
60Each name is represented by a distinct symbol structure and global identifier (GID).
61Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to
62iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs.
63% It keys each symbol on its raw data representation, which means it can potentially be run in parallel with the content stream generator.
64One of the main advantages of using GIDs is that grammar information can be associated with the symbol itself and help bypass
65the lookup cost in the validation process.
66The final component of the \PS{} is the {\it Content Stream Generator}. This component has a multitude of
67responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
68output-ready UTF-16 content for the \MP{}.
70Everything in the \MP{} uses a compressed representation of the document, generated by the
71symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output.
72The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
73to perform in bitspace, such as ensuring every start tag has a matching end tag.
74The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
75this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to
76be associated with each instance of a symbol.
77This is discussed in Section \ref{section:arch:namespacehandling}.
78The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however,
79the majority of the grammar look up operations are performed beforehand and stored within the symbols themselves.
84\caption{ICXML Architecture}
87% Probably not the right area but should we discuss issues with Xerces design that we tried to correct?
88% - over-reliance on hash tables when domain knowledge dictated none would be needed
89% - constant buffering of text to ensure that every QName/NCName and content was contained within a single string
90% - abundant use of heap allocated memory
91% - text conversions done in multiple areas
92% - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk
94% As the previous section aluded, the greatest difference between sequential parsing methods
95% and the Parabix parsing model is how data is processed.
96% Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end
97% tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag
98% events in sequential order, with all attribute data associated with the correct tag.
102% The Parabix framework, however, does not allow for this (and would be hindered performance wise if
103% forced to.)
104% Thus our first question was, ``How can we how can we take full advantage
105% of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced
106% when given an input text.
108% By analyzing Xerces internal data structures and its produced output, two major observations were obvious:
109% (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both
110% internally (within the grammar structures and hash tables) and externally (for the end user).
111% (2) all elements and attributes (both qualified and unqualified) are associated with a unique element
112% declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate
113% grammar reference in place of the element or attribute string.
119%   From Xerces to icXML
121%   - Philosophy:  Maximizing Bit Stream Processing
123%   - Character Set Adapters vs. Transcoding
124%   - Bitstreams 1: Charset Validation and Transcoding equations
125%   - Bitstreams 2: Parabix style parsing and validation
127%   - Bitstreams 3: Parallel filtering and normalization
128%           - LB normalization
129%           - reference compression -> single code unit speculation
130%           - parallel string termination
132%   - Bitstreams 4: Symbol processing
134%   - From bit streams to doublebyte streams: the content buffer
136%   - Namespace Processing: A Bitset approach.
Note: See TracBrowser for help on using the repository browser.