source: docs/Working/icXML/arch-overview.tex @ 2516

Last change on this file since 2516 was 2516, checked in by cameron, 7 years ago

Updates, more streams to discuss

File size: 8.2 KB
5\caption{Xerces Architecture} 
10\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
11rearchitected with pipeline parallelism in mind.
12In this section, we highlight the core differences between the two systems.
13As shown in Figure \ref{fig:xerces-arch}, Xerces
14is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
15The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
16the majority of the character set encoding validation is performed as a byproduct of this process.
17The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text;
18it keeps track of the current line/column of the cursor (which is reported to the end user in
19the unlikely event that the input contains an error), performs all line-break normalization
20and validates context-specific character set issues, such as tokenization of qualified-names and
21ensures each character is legal w.r.t. the XML specification.
22The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
23of the document; it deals with all issues related to entity expansion, validates
24the XML well-formedness constraints and any character set encoding issues that cannot
25be completely handled by the reader or transcoder (e.g., surrogate characters, validation
26and normalization of character references, etc.)
27The {\it Namespace Binder}, which is a core piece of their element stack, is primarily tasked
28with handling namespace scoping issues between different XML vocabularies and faciliates
29the scanner with the construction and utilization of Schema grammar structures.
30The {\it Validator} takes the IR produced by the Scanner (and
31potentially annotated by the Namespace Binder) and assesses whether the final output matches
32the user-defined DTD and Schema grammar(s) before passing it to the end-user.
36\caption{\icXML{} Architecture}
40In \icXML{} functions are grouped into logical components.
41As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
42All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
43The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
44mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
45set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
46These lexical bit streams are later transformed into UTF-16 in the Content Stream Generator,
47after additional processing is performed.
48The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
49It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
50significant positions within the input data. One bit stream for each of the critical piece of information is created, such as
51the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
52Intra-element well-formedness validation is performed as an artifact of this process.
53Like Xerces, \icXML{} must provide the Line and Column position of each error.
54The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
55optimized population count algorithm; this is described in Section \ref{section:arch:errorhandling}.
56From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit.
58\icXML{} represents elements and attributes as distinct data structures, called symbols,
59each with their own global identifier (GID).
60Using the {\bf symbol marker streams} produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
61the raw data to produce a stream (series) of GIDs, called the {\it symbol stream}.
62A typical XML file will contain relatively few unique element and attribute names---but each of them will occur
63frequently. % throughout the document.
64% Grammar information can be associated with each symbol and can help reduce the look-up cost of the later Validation process.
66The final components of the \PS{} are the {\it Content Preperation Unit} and {\it Content Stream Generator}.
67The former takes the (transposed) basis bit streams and selectively filters them, according to the
68information provided by the Parallel Markup Parser, and the latter transforms the
69filtered streams into the tagged UTF-16 {\it content stream}.
70This is discussed in Section \ref{sec:parfilter}.
71Combined, the symbol stream and content stream form \icXML{}'s compressed IR of the XML document.
73% This component has a multitude of
74% responsibilities, which will be discussed in Section \ref{sec:parfilter}, but its primary function is to produce
75% near-final UTF-16 content.
77The {\it \MP{}} parses the IR to validate and produces the sequential output for the end user.
78The {\it WF checker} performs inter-element wellformedness validation that would be too costly
79to perform in bitspace, such as ensuring every start tag has a matching end tag.
80Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
81it's performed as a discrete phase, which produces a set of URI identifiers (URI IDs) that is
82associated with symbol occurrence.
83This is discussed in Section \ref{section:arch:namespacehandling}.
84The final {\it Validation} process is responsible for the same tasks as Xerces's validator; however
85the majority of the grammar look-ups are performed beforehand and stored within the symbol themselves.
88% Probably not the right area but should we discuss issues with Xerces design that we tried to correct?
89% - over-reliance on hash tables when domain knowledge dictated none would be needed
90% - constant buffering of text to ensure that every QName/NCName and content was contained within a single string
91% - abundant use of heap allocated memory
92% - text conversions done in multiple areas
93% - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk
95% As the previous section aluded, the greatest difference between sequential parsing methods
96% and the Parabix parsing model is how data is processed.
97% Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end
98% tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag
99% events in sequential order, with all attribute data associated with the correct tag.
103% The Parabix framework, however, does not allow for this (and would be hindered performance wise if
104% forced to.)
105% Thus our first question was, ``How can we how can we take full advantage
106% of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced
107% when given an input text.
109% By analyzing Xerces internal data structures and its produced output, two major observations were obvious:
110% (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both
111% internally (within the grammar structures and hash tables) and externally (for the end user).
112% (2) all elements and attributes (both qualified and unqualified) are associated with a unique element
113% declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate
114% grammar reference in place of the element or attribute string.
120%   From Xerces to icXML
122%   - Philosophy:  Maximizing Bit Stream Processing
124%   - Character Set Adapters vs. Transcoding
125%   - Bitstreams 1: Charset Validation and Transcoding equations
126%   - Bitstreams 2: Parabix style parsing and validation
128%   - Bitstreams 3: Parallel filtering and normalization
129%           - LB normalization
130%           - reference compression -> single code unit speculation
131%           - parallel string termination
133%   - Bitstreams 4: Symbol processing
135%   - From bit streams to doublebyte streams: the content buffer
137%   - Namespace Processing: A Bitset approach.
Note: See TracBrowser for help on using the repository browser.