# Changeset 2439 for docs/Working

Ignore:
Timestamp:
Oct 11, 2012, 5:09:57 PM (7 years ago)
Message:

work on overview

Location:
docs/Working/icXML
Files:
5 edited

Unmodified
Removed
• ## docs/Working/icXML/arch-errorhandling.tex

 r2429 \subsection{Error Handling} \label{section:arch:errorhandling} Challenges / Line Col Tracker % Challenges / Line Col Tracker
• ## docs/Working/icXML/arch-namespace.tex

 r2429 \subsection{Namespace Handling} \label{section:arch:namespacehandling} % Xerces stack-oriented vs icXML's bit-field oriented approach
• ## docs/Working/icXML/arch-overview.tex

 r2429 \subsection{Overview} As the previous section aluded, the greatest difference between sequential parsing methods and the Parabix parsing model is how data is processed. Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag events in sequential order, with all attribute data associated with the correct tag. To better understand the difficulties in re-architecting Xerces, it is important to know how Xerces and ICXML differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator. The transcoder converts all input data into UTF16; all text run through this module before being processed as XML. The majority of the character set encoding validation is performed as a byproduct of this process. The reader is responsible for the streaming and buffering of all raw and transposed text; it keeps track of the current line/column of the cursor, performs all line-break normalization and validates context-specific character set issues, such as tokenization and ensuring each character is legal w.r.t. the XML specification at that position. The scanner pulls data through the reader and constructs the intermediate (and near-final) representation of the document; it deals with all issues related to entity expansion, validates the XML wellformedness constraints, and remaining character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references). The namespace binder is primarily tasked with handling all namespace scoping issues between different XML vocabularies and faciliates the scanner with the construction and utilization of Schema grammar structures. The validator's job is to take the intermediate representation produced by the scanner (and potentially annotated by the namespace binder) and assess whether the final output would match the user-created Schema or DTD grammar specification(s). The Parabix framework, however, does not allow for this (and would be hindered performance wise if forced to.) Thus our first question was, How can we how can we take full advantage of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced when given an input text. \begin{figure} \begin{center} \includegraphics[width=0.15\textwidth]{plots/xerces.pdf} \label{fig:xerces-arch} \caption{Xerces Architecture} \end{center} \end{figure} By analyzing Xerces internal data structures and its produced output, two major observations were obvious: (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both internally (within the grammar structures and hash tables) and externally (for the end user). (2) all elements and attributes (both qualified and unqualified) are associated with a unique element declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate grammar reference in place of the element or attribute string. ICXML differs substantially from Xerces in many ways: tasks, as shown in Figure \ref{fig:icxml-arch} were grouped into logical components, ready for pipeline parallism. Two major categories of functions exist: those in the parabix subsystem, and those in the markup processor. All tasks in the parabix subsystem use the parabix framework and represent data as bit streams. The character set adapter closely mirrors Xerces's transcoder in terms of responsibility; however it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}, from the raw input instead of UTF16. The line-column tracker uses the lexical information to keep track of the cursor position(s) through the use of an optimized population count algorithm, which is described in Section \ref{section:arch:errorhandling}. The parallel markup parser utilizes the same lexical stream to mark key positions within the input data, such as the beginning and ending of tags, element and attribute names, and content. Intra-element well-formedness validation is performed as an artifact of this process. From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}. % The output of both are required by the markup processor. Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols. A typical XML document will contain relatively few unique element and attribute names but each of them will occur frequently throughout the document. Each name is represented by a distinct symbol structure and global identifier (GID). Using the information produced by the parallel markup parser, the {\it symbol resolver} uses a bitscan instruction to iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs. % This size of this set is, at most, the length of the input data $\div$ 2, as every symbol must have a terminal character. One of the main advantages of this is that grammar information can be associated with the symbol itself and help bypass the lookup cost in the validation process. The final component of the parabix subsystem is the {\it content stream generator}. This component has a multitude of responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce output-ready UTF-16 content for the markup processor. Everything in the markup processor uses a compressed representation of the document, generated by the symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output. The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly to perform in bitspace, such as ensuring every start tag has a matching end tag. The {\it namespace processor} replaces Xerces's namespace binding functionality. Unlike Xerces, this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to be associated with each instance of a symbol. This is discussed in Section \ref{section:arch:namespacehandling}. The final {\it validation} process is responsible for the same tasks as Xerces's validator, however, the majority of the grammar look up operations is performed beforehand and stored within the symbols themselves. \begin{figure} \includegraphics[width=0.50\textwidth]{plots/icxml.pdf} \label{fig:icxml-arch} \caption{ICXML Architecture} \end{figure} % Probably not the right area but should we discuss issues with Xerces design that we tried to correct? % - over-reliance on hash tables when domain knowledge dictated none would be needed % - constant buffering of text to ensure that every QName/NCName and content was contained within a single string % - abundant use of heap allocated memory % - text conversions done in multiple areas % - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk % As the previous section aluded, the greatest difference between sequential parsing methods % and the Parabix parsing model is how data is processed. % Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end % tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag % events in sequential order, with all attribute data associated with the correct tag. % % % % The Parabix framework, however, does not allow for this (and would be hindered performance wise if % forced to.) % Thus our first question was, How can we how can we take full advantage % of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced % when given an input text. % % By analyzing Xerces internal data structures and its produced output, two major observations were obvious: % (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both % internally (within the grammar structures and hash tables) and externally (for the end user). % (2) all elements and attributes (both qualified and unqualified) are associated with a unique element % declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate % grammar reference in place of the element or attribute string.
• ## docs/Working/icXML/background-parabix.tex

 r2429 Using a mixture of boolean-logic and arithmetic operations, character-class bit streams can be transformed into lexical bit streams, where the presense of a 1 bit identifies a key position in the input data. As an artifact of this process, intra-element well-formedness validation is performed on each block of text. % Using a mixture of boolean-logic and arithmetic operations, character-class % bit streams can be transformed into lexical bit streams, where the presense of % a 1 bit identifies a key position in the input data. As an artifact of this % process, intra-element well-formedness validation is performed on each block % of text. Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
• ## docs/Working/icXML/icxml-main.tex

 r2429 \input{arch-errorhandling.tex} \begin{figure} \begin{center} \includegraphics[width=0.15\textwidth]{plots/xerces.pdf} \label{fig:xerces-arch} \caption{} \end{center} \end{figure} \begin{figure} \includegraphics[width=0.50\textwidth]{plots/icxml.pdf} \label{fig:icxml-arch} \caption{} \end{figure} \section{Performance}
Note: See TracChangeset for help on using the changeset viewer.