Changeset 2470 for docs


Ignore:
Timestamp:
Oct 16, 2012, 5:48:15 PM (7 years ago)
Author:
nmedfort
Message:

More work; mostly edits

Location:
docs/Working/icXML
Files:
5 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-charactersetadapters.tex

    r2429 r2470  
    11\subsection{Character Set Adapters}
     2\label{arch:character-set-adapter}
     3
     4The first major difference between Xerces and ICXML is the use of Character Set Adapters (CSAs). In Xerces, all input
     5is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and to provide the end-consumer with a single
     6encoding format.
  • docs/Working/icXML/arch-errorhandling.tex

    r2455 r2470  
    2929One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16.
    3030During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
    31 and validated. Bit streams marking the positions of the normalized new lines is a natural derivative of this process.
     31and validated.
     32A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of
     33this process.
    3234Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
    3335% The optimization delays the counting process ....
    3436Column position is more difficult to calculate.
    3537It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units)
    36 between the position between which an error was detected and the last line feed. However, as some of these code-units
    37 are skipped over when tallying up the position, the CSA must generate a {\it skip mask} bit stream to represent those
    38 characters. This mask ORs together many relevant bit streams, such as all trailing multi-code-unit and surrogate
    39 characters, and any characters that were removed during the normalization process.
     38between the position between which an error was detected and the last line feed. However, this distance may exceed
     39than the acutal character position for the reasons discussed in (2).
     40To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams,
     41such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
     42normalization process.
    4043When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
    41 column position.
     44column number.
    4245
    4346\begin{figure}[h]
     
    4750\caption{}
    4851\end{figure}
     52
     53The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces.
     54However, line/column tracking within it is a much more difficult problem. The Markup Processor parses the content stream,
     55which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol
     56data and unnecessary whitespace is eliminated from the stream.
     57This means it is impossible to directly assess the current location with only the content stream.
     58To calculate this, the Markup Processor borrows three additional pieces of information from the Parabix subsystem:
     59the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream that denotes every code-unit that
     60was surpressed from the raw data during the production of the content stream.
     61
     62
     63Armed with the cursor position in
     64the content stream,
     65
  • docs/Working/icXML/arch-namespace.tex

    r2449 r2470  
    3232
    3333
    34 In Xerces, every URI is mapped to a unique URI ID number.
    35 These IDs persist throughout the lifetime of the application.
     34In both Xerces and ICXML, every URI has a one-to-one mapping to a URI ID.
     35These persist for the lifetime of the application through the use of a global URI pool.
    3636Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs
    3737in the document. Because a namespace declaration affects the entire element, it must be processed prior to
     
    4040(1) those that declare a set of namespaces upfront and never change them, and
    4141(2) those that repeatidly modify the namespace scope within the document in predictable patterns.
     42
     43For that reason, ICXML contains an independent namespace stack and utilizes bit vectors to cheaply perform
     44% speculation and
     45scope resolution options with a single XOR operation---even if many alterations are performed.
     46% performance advantage figure?? average cycles/byte cost?
     47When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps
     48the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI.
     49Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
     50NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
     51prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively.
     52To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
     53maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
     54NSID can be found using a bit scan instruction.
     55A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
    4256
    4357\begin{table}[h]
     
    5468\end{table}
    5569
    56 For that reason, ICXML contains an independent namespace stack and utilizes bit vectors to cheaply perform
    57 % speculation and
    58 scope resolution options with a single XOR operation---even if many alterations are performed.
    59 % performance advantage figure?? average cycles/byte cost?
    60 When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps
    61 the prefix, which are assigned prefix ids in the symbol resolution process, to the URI.
    62 Each unique URI is provided with an URI ID through the use of a global URI pool, similar to Xerces.
    63 Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every
    64 NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the
    65 prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively.
    66 To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
    67 maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
    68 NSID can be found using a bit scan instruction. A namespace binding table, similar to Table \ref{tbl:namespace1},
    69 provides the actual URI ID.
    70 
    7170% PrefixBindings = PrefixBindingTable[prefixID];
    7271% VisiblePrefixBinding = PrefixBindings & CurrentlyVisibleNamespaces;
     
    7877within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the
    7978currently visible namespaces is XORed with the vector at the top of the stack.
     79This allows any number of changes to be performed at each scope-level with a constant time.
    8080% Speculation can be handled by probing the historical information within the stack but that goes beyond the scope of this paper.
  • docs/Working/icXML/arch-overview.tex

    r2439 r2470  
    11\subsection{Overview}
    22
    3 To better understand the difficulties in re-architecting Xerces, it is important to know
    4 how Xerces and ICXML differ design wise. As shown in Figure \ref{fig:xerces-arch}, Xerces
    5 is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and
    6 validator.
    7 The transcoder converts all input data into UTF16; all text run through this module before
     3ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
     4rearchitected into pipeline-parallel ready structure.
     5In this section, we highlight the core differences between the two systems and discuss how they
     6differ design wise.
     7As shown in Figure \ref{fig:xerces-arch}, Xerces
     8is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator.
     9The {\it Transcoder} converts all input data into UTF16; all text run through this module before
    810being processed as XML. The majority of the character set encoding validation is performed
    911as a byproduct of this process.
    10 The reader is responsible for the streaming and buffering of all raw and transposed text;
    11 it keeps track of the current line/column of the cursor, performs all line-break normalization
    12 and validates context-specific character set issues, such as tokenization and ensuring each
    13 character is legal w.r.t. the XML specification at that position.
    14 The scanner pulls data through the reader and constructs the intermediate (and near-final)
     12The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text;
     13it keeps track of the current line/column of the cursor (which is reported to the end user in
     14the unlikely event that the input file contains an error), performs all line-break normalization
     15and validates context-specific character set issues, such as tokenization of qualified-names and
     16ensuring each character is legal w.r.t. the XML specification.
     17The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final)
    1518representation of the document; it deals with all issues related to entity expansion, validates
    16 the XML wellformedness constraints, and remaining character set encoding issues that cannot
     19the XML wellformedness constraints and any character set encoding issues that cannot
    1720be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    18 and normalization of character references).
    19 The namespace binder is primarily tasked with handling all namespace scoping issues between
     21and normalization of character references, etc.)
     22The {\it Namespace binder} is primarily tasked with handling all namespace scoping issues between
    2023different XML vocabularies and faciliates the scanner with the construction and utilization
    2124of Schema grammar structures.
    22 The validator's job is to take the intermediate representation produced by the scanner (and
    23 potentially annotated by the namespace binder) and assess whether the final output would match
    24 the user-created Schema or DTD grammar specification(s).
     25The {\it Validator} takes the intermediate representation produced by the Scanner (and
     26potentially annotated by the Namespace Binder) and assesses whether the final output matches
     27the user-defined DTD and Schema grammar(s).
    2528
    2629\begin{figure}
     
    3235\end{figure}
    3336
    34 ICXML differs substantially from Xerces in many ways: tasks, as shown in Figure \ref{fig:icxml-arch} were grouped into
    35 logical components, ready for pipeline parallism. Two major categories of functions exist: those in the parabix subsystem, and
    36 those in the markup processor. All tasks in the parabix subsystem use the parabix framework and represent data as bit streams.
    37 The character set adapter closely mirrors Xerces's transcoder in terms of responsibility; however it produces a set of lexical
    38 bit streams, similar to those shown in Figure \ref{fig:parabix1}, from the raw input instead of UTF16.
    39 The line-column tracker uses the lexical information to keep track of the cursor position(s) through the use of an
    40 optimized population count algorithm, which is described in Section \ref{section:arch:errorhandling}.
    41 The parallel markup parser utilizes the same lexical stream to mark key positions within the input data, such as the beginning
    42 and ending of tags, element and attribute names, and content. Intra-element well-formedness validation is performed as an
    43 artifact of this process.
     37In ICXML, tasks, as shown in Figure \ref{fig:icxml-arch} are grouped into logical components.
     38Two major categories of functions exist: those in the parabix subsystem, and
     39those in the markup processor. All tasks in the parabix subsystem use the parabix framework {\bf (citation?)} and represent
     40data as a series of bit streams, which are discussed in Section \ref{background:parabix}.
     41The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
     42 closely mirrors Xerces's transcoder duties; however instead of producing UTF16 it produces a
     43set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed
     44into UTF-16 in the Content Buffer Generator, after additional processing is performed.
     45The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
     46It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
     47significant positions within the input data. One bit stream for each of the critical piece of information is created, such as
     48the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
     49Intra-element well-formedness validation is performed as an artifact of this process.
     50Like Xerces, ICXML must provide the Line and Column position of each error.
     51The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
     52optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}.
     53
     54
     55
    4456From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
    4557% The output of both are required by the markup processor.
     
    4860frequently throughout the document.
    4961Each name is represented by a distinct symbol structure and global identifier (GID).
    50 Using the information produced by the parallel markup parser, the {\it symbol resolver} uses a bitscan instruction to
     62Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to
    5163iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs.
    52 % This size of this set is, at most, the length of the input data $\div$ 2, as every symbol must have a terminal character.
    53 One of the main advantages of this is that grammar information can be associated with the symbol itself and help bypass
     64% It keys each symbol on its raw data representation, which means it can potentially be run in parallel with the content stream generator.
     65One of the main advantages of using GIDs is that grammar information can be associated with the symbol itself and help bypass
    5466the lookup cost in the validation process.
    55 The final component of the parabix subsystem is the {\it content stream generator}. This component has a multitude of
     67The final component of the parabix subsystem is the {\it Content Stream Generator}. This component has a multitude of
    5668responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
    5769output-ready UTF-16 content for the markup processor.
     
    6173The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
    6274to perform in bitspace, such as ensuring every start tag has a matching end tag.
    63 The {\it namespace processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
     75The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
    6476this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to
    6577be associated with each instance of a symbol.
    6678This is discussed in Section \ref{section:arch:namespacehandling}.
    67 The final {\it validation} process is responsible for the same tasks as Xerces's validator, however,
    68 the majority of the grammar look up operations is performed beforehand and stored within the symbols themselves.
     79The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however,
     80the majority of the grammar look up operations are performed beforehand and stored within the symbols themselves.
    6981
    7082\begin{figure}
  • docs/Working/icXML/background-parabix.tex

    r2439 r2470  
    11\subsection{The Parabix Framework}
     2\label{background:parabix}
     3
    24\begin{figure*}[tbhp]
    35\begin{center}
Note: See TracChangeset for help on using the changeset viewer.