source: docs/Working/icXML/arch-overview.tex @ 3633

Last change on this file since 3633 was 2872, checked in by nmedfort, 7 years ago

edits

File size: 5.3 KB
Line 
1\subsection{Overview}
2
3\def \CSG{Content Stream Generator}
4
5\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
6rearchitected with pipeline parallelism in mind.
7In this section, we highlight the core differences between the two systems.
8As shown in Figure \ref{fig:xerces-arch}, Xerces
9is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
10The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
11the majority of the character set encoding validation is performed as a byproduct of this process.
12The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text.
13It tracks the current line/column position,
14%(which is reported in the unlikely event that the input contains an error),
15performs line-break normalization and validates context-specific character set issues,
16such as tokenization of qualified-names.
17The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR)
18of the document; it deals with all issues related to entity expansion, validates
19the XML well-formedness constraints and any character set encoding issues that cannot
20be completely handled by the reader or transcoder (e.g., surrogate characters, validation
21and normalization of character references, etc.)
22The {\it Namespace Binder} is a core piece of the element stack.
23It handles namespace scoping issues between different XML vocabularies.
24This allows the scanner to properly select the correct schema grammar structures.
25The {\it Validator} takes the IR produced by the Scanner (and
26potentially annotated by the Namespace Binder) and assesses whether the final output matches
27the user-defined DTD and schema grammar(s) before passing it to the end-user.
28
29\begin{figure}[h]
30\begin{center}
31\includegraphics[height=0.45\textheight,keepaspectratio]{plots/xerces.pdf}
32\caption{Xerces Architecture} 
33\label{fig:xerces-arch}
34\end{center}
35\end{figure}
36
37In \icXML{} functions are grouped into logical components.
38As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
39All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
40The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
41mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
42set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
43These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
44after additional processing is performed.
45The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
46It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
47significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
48the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
49Intra-element well-formedness validation is performed as an artifact of this process.
50Like Xerces, \icXML{} must provide the Line and Column position of each error.
51The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
52optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
53From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
54
55A typical XML file contains few unique element and attribute names---but each of them will occur frequently.
56\icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID).
57Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through
58the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
59
60The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
61The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
62information provided by the Parallel Markup Parser, and the latter transforms the
63filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
64
65Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
66The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
67The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
68to perform in bit space, such as ensuring every start tag has a matching end tag.
69Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
70it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
71associated with each symbol occurrence.
72This is discussed in Section \ref{section:arch:namespacehandling}.
73Finally, the {\it Validation} layer implements the Xerces's validator.
74However, preprocessing associated with each symbol greatly reduces the work of this stage.
75
76\begin{figure}[h]
77\begin{center}
78\includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf}
79\end{center}
80\caption{\icXML{} Architecture}
81\label{fig:icxml-arch}
82\end{figure}
Note: See TracBrowser for help on using the repository browser.