# Changeset 2872 for docs/Working/icXML/arch-overview.tex

Ignore:
Timestamp:
Jan 30, 2013, 6:03:41 PM (7 years ago)
Message:

edits

File:
1 edited

### Legend:

Unmodified
 r2871 In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed into UTF-16 in the \CSG{}, set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}. These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{}, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies significant positions within the input data. One bit stream for each of the critical piece of information is created, such as It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit. A typical XML file contains few unique element and attribute names---but each of them will occur frequently. the raw data to produce a sequence of GIDs, called the {\it symbol stream}. The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}. The former takes the (transposed) basis bit streams and selectively filters them, according to the The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}. The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document. The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user. The {\it Final WF checker} performs inter-element wellformedness validation that would be too costly to perform in bitspace, such as ensuring every start tag has a matching end tag. The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly to perform in bit space, such as ensuring every start tag has a matching end tag. Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are \label{fig:icxml-arch} \end{figure} % Probably not the right area but should we discuss issues with Xerces design that we tried to correct? % - over-reliance on hash tables when domain knowledge dictated none would be needed % - constant buffering of text to ensure that every QName/NCName and content was contained within a single string % - abundant use of heap allocated memory % - text conversions done in multiple areas % - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk % As the previous section aluded, the greatest difference between sequential parsing methods % and the Parabix parsing model is how data is processed. % Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end % tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag % events in sequential order, with all attribute data associated with the correct tag. % % % % The Parabix framework, however, does not allow for this (and would be hindered performance wise if % forced to.) % Thus our first question was, How can we how can we take full advantage % of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced % when given an input text. % % By analyzing Xerces internal data structures and its produced output, two major observations were obvious: % (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both % internally (within the grammar structures and hash tables) and externally (for the end user). % (2) all elements and attributes (both qualified and unqualified) are associated with a unique element % declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate % grammar reference in place of the element or attribute string. %   From Xerces to icXML % %   - Philosophy:  Maximizing Bit Stream Processing % %   - Character Set Adapters vs. Transcoding %   - Bitstreams 1: Charset Validation and Transcoding equations %   - Bitstreams 2: Parabix style parsing and validation % %   - Bitstreams 3: Parallel filtering and normalization %           - LB normalization %           - reference compression -> single code unit speculation %           - parallel string termination % %   - Bitstreams 4: Symbol processing % %   - From bit streams to doublebyte streams: the content buffer % %   - Namespace Processing: A Bitset approach.