Changeset 2470 for docs/Working/icXML

Ignore:
Timestamp:
Oct 16, 2012, 5:48:15 PM (7 years ago)
Message:

More work; mostly edits

Location:
docs/Working/icXML
Files:
5 edited

Legend:

Unmodified
Removed

 r2429 \subsection{Character Set Adapters} \label{arch:character-set-adapter} The first major difference between Xerces and ICXML is the use of Character Set Adapters (CSAs). In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and to provide the end-consumer with a single encoding format.
• docs/Working/icXML/arch-errorhandling.tex

 r2455 One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16. During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. Bit streams marking the positions of the normalized new lines is a natural derivative of this process. and validated. A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of this process. Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text. % The optimization delays the counting process .... Column position is more difficult to calculate. It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units) between the position between which an error was detected and the last line feed. However, as some of these code-units are skipped over when tallying up the position, the CSA must generate a {\it skip mask} bit stream to represent those characters. This mask ORs together many relevant bit streams, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. between the position between which an error was detected and the last line feed. However, this distance may exceed than the acutal character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual column position. column number. \begin{figure}[h] \caption{} \end{figure} The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces. However, line/column tracking within it is a much more difficult problem. The Markup Processor parses the content stream, which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol data and unnecessary whitespace is eliminated from the stream. This means it is impossible to directly assess the current location with only the content stream. To calculate this, the Markup Processor borrows three additional pieces of information from the Parabix subsystem: the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream that denotes every code-unit that was surpressed from the raw data during the production of the content stream. Armed with the cursor position in the content stream,
• docs/Working/icXML/arch-namespace.tex

 r2449 In Xerces, every URI is mapped to a unique URI ID number. These IDs persist throughout the lifetime of the application. In both Xerces and ICXML, every URI has a one-to-one mapping to a URI ID. These persist for the lifetime of the application through the use of a global URI pool. Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs in the document. Because a namespace declaration affects the entire element, it must be processed prior to (1) those that declare a set of namespaces upfront and never change them, and (2) those that repeatidly modify the namespace scope within the document in predictable patterns. For that reason, ICXML contains an independent namespace stack and utilizes bit vectors to cheaply perform % speculation and scope resolution options with a single XOR operation---even if many alterations are performed. % performance advantage figure?? average cycles/byte cost? When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively. To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope NSID can be found using a bit scan instruction. A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID. \begin{table}[h] \end{table} For that reason, ICXML contains an independent namespace stack and utilizes bit vectors to cheaply perform % speculation and scope resolution options with a single XOR operation---even if many alterations are performed. % performance advantage figure?? average cycles/byte cost? When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps the prefix, which are assigned prefix ids in the symbol resolution process, to the URI. Each unique URI is provided with an URI ID through the use of a global URI pool, similar to Xerces. Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively. To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope NSID can be found using a bit scan instruction. A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID. % PrefixBindings = PrefixBindingTable[prefixID]; % VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces; within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the currently visible namespaces is XORed with the vector at the top of the stack. This allows any number of changes to be performed at each scope-level with a constant time. % Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.
• docs/Working/icXML/arch-overview.tex

 r2439 \subsection{Overview} To better understand the difficulties in re-architecting Xerces, it is important to know how Xerces and ICXML differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator. The transcoder converts all input data into UTF16; all text run through this module before ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected into pipeline-parallel ready structure. In this section, we highlight the core differences between the two systems and discuss how they differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator. The {\it Transcoder} converts all input data into UTF16; all text run through this module before being processed as XML. The majority of the character set encoding validation is performed as a byproduct of this process. The reader is responsible for the streaming and buffering of all raw and transposed text; it keeps track of the current line/column of the cursor, performs all line-break normalization and validates context-specific character set issues, such as tokenization and ensuring each character is legal w.r.t. the XML specification at that position. The scanner pulls data through the reader and constructs the intermediate (and near-final) The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text; it keeps track of the current line/column of the cursor (which is reported to the end user in the unlikely event that the input file contains an error), performs all line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names and ensuring each character is legal w.r.t. the XML specification. The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final) representation of the document; it deals with all issues related to entity expansion, validates the XML wellformedness constraints, and remaining character set encoding issues that cannot the XML wellformedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references). The namespace binder is primarily tasked with handling all namespace scoping issues between and normalization of character references, etc.) The {\it Namespace binder} is primarily tasked with handling all namespace scoping issues between different XML vocabularies and faciliates the scanner with the construction and utilization of Schema grammar structures. The validator's job is to take the intermediate representation produced by the scanner (and potentially annotated by the namespace binder) and assess whether the final output would match the user-created Schema or DTD grammar specification(s). The {\it Validator} takes the intermediate representation produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and Schema grammar(s). \begin{figure} \end{figure} ICXML differs substantially from Xerces in many ways: tasks, as shown in Figure \ref{fig:icxml-arch} were grouped into logical components, ready for pipeline parallism. Two major categories of functions exist: those in the parabix subsystem, and those in the markup processor. All tasks in the parabix subsystem use the parabix framework and represent data as bit streams. The character set adapter closely mirrors Xerces's transcoder in terms of responsibility; however it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}, from the raw input instead of UTF16. The line-column tracker uses the lexical information to keep track of the cursor position(s) through the use of an optimized population count algorithm, which is described in Section \ref{section:arch:errorhandling}. The parallel markup parser utilizes the same lexical stream to mark key positions within the input data, such as the beginning and ending of tags, element and attribute names, and content. Intra-element well-formedness validation is performed as an artifact of this process. In ICXML, tasks, as shown in Figure \ref{fig:icxml-arch} are grouped into logical components. Two major categories of functions exist: those in the parabix subsystem, and those in the markup processor. All tasks in the parabix subsystem use the parabix framework {\bf (citation?)} and represent data as a series of bit streams, which are discussed in Section \ref{background:parabix}. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, closely mirrors Xerces's transcoder duties; however instead of producing UTF16 it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed into UTF-16 in the Content Buffer Generator, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies significant positions within the input data. One bit stream for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, ICXML must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}. From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}. % The output of both are required by the markup processor. frequently throughout the document. Each name is represented by a distinct symbol structure and global identifier (GID). Using the information produced by the parallel markup parser, the {\it symbol resolver} uses a bitscan instruction to Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs. % This size of this set is, at most, the length of the input data $\div$ 2, as every symbol must have a terminal character. One of the main advantages of this is that grammar information can be associated with the symbol itself and help bypass % It keys each symbol on its raw data representation, which means it can potentially be run in parallel with the content stream generator. One of the main advantages of using GIDs is that grammar information can be associated with the symbol itself and help bypass the lookup cost in the validation process. The final component of the parabix subsystem is the {\it content stream generator}. This component has a multitude of The final component of the parabix subsystem is the {\it Content Stream Generator}. This component has a multitude of responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce output-ready UTF-16 content for the markup processor. The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly to perform in bitspace, such as ensuring every start tag has a matching end tag. The {\it namespace processor} replaces Xerces's namespace binding functionality. Unlike Xerces, The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces, this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to be associated with each instance of a symbol. This is discussed in Section \ref{section:arch:namespacehandling}. The final {\it validation} process is responsible for the same tasks as Xerces's validator, however, the majority of the grammar look up operations is performed beforehand and stored within the symbols themselves. The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however, the majority of the grammar look up operations are performed beforehand and stored within the symbols themselves. \begin{figure}
• docs/Working/icXML/background-parabix.tex

 r2439 \subsection{The Parabix Framework} \label{background:parabix} \begin{figure*}[tbhp] \begin{center}
Note: See TracChangeset for help on using the changeset viewer.