Ignore:
Timestamp:
Jan 30, 2013, 6:03:41 PM (7 years ago)
Author:
nmedfort
Message:

edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-overview.tex

    r2871 r2872  
    3737In \icXML{} functions are grouped into logical components.
    3838As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
    39 All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
     39All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
    4040The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
    4141mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
    42 set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
    43 These lexical bit streams are later transformed into UTF-16 in the \CSG{},
     42set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
     43These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
    4444after additional processing is performed.
    4545The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
    46 It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
    47 significant positions within the input data. One bit stream for each of the critical piece of information is created, such as
     46It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
     47significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
    4848the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
    4949Intra-element well-formedness validation is performed as an artifact of this process.
     
    5151The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
    5252optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
    53 From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit.
     53From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
    5454
    5555A typical XML file contains few unique element and attribute names---but each of them will occur frequently.
     
    5858the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
    5959
    60 The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}.
    61 The former takes the (transposed) basis bit streams and selectively filters them, according to the
     60The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
     61The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
    6262information provided by the Parallel Markup Parser, and the latter transforms the
    6363filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
     
    6565Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
    6666The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
    67 The {\it Final WF checker} performs inter-element wellformedness validation that would be too costly
    68 to perform in bitspace, such as ensuring every start tag has a matching end tag.
     67The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
     68to perform in bit space, such as ensuring every start tag has a matching end tag.
    6969Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
    7070it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
     
    8181\label{fig:icxml-arch}
    8282\end{figure}
    83 
    84 
    85 % Probably not the right area but should we discuss issues with Xerces design that we tried to correct?
    86 % - over-reliance on hash tables when domain knowledge dictated none would be needed
    87 % - constant buffering of text to ensure that every QName/NCName and content was contained within a single string
    88 % - abundant use of heap allocated memory
    89 % - text conversions done in multiple areas
    90 % - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk
    91 
    92 % As the previous section aluded, the greatest difference between sequential parsing methods
    93 % and the Parabix parsing model is how data is processed.
    94 % Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end
    95 % tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag
    96 % events in sequential order, with all attribute data associated with the correct tag.
    97 %
    98 %
    99 
    100 % The Parabix framework, however, does not allow for this (and would be hindered performance wise if
    101 % forced to.)
    102 % Thus our first question was, ``How can we how can we take full advantage
    103 % of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced
    104 % when given an input text.
    105 %
    106 % By analyzing Xerces internal data structures and its produced output, two major observations were obvious:
    107 % (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both
    108 % internally (within the grammar structures and hash tables) and externally (for the end user).
    109 % (2) all elements and attributes (both qualified and unqualified) are associated with a unique element
    110 % declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate
    111 % grammar reference in place of the element or attribute string.
    112 
    113 
    114 
    115 
    116 
    117 %   From Xerces to icXML
    118 %
    119 %   - Philosophy:  Maximizing Bit Stream Processing
    120 %
    121 %   - Character Set Adapters vs. Transcoding
    122 %   - Bitstreams 1: Charset Validation and Transcoding equations
    123 %   - Bitstreams 2: Parabix style parsing and validation
    124 %
    125 %   - Bitstreams 3: Parallel filtering and normalization
    126 %           - LB normalization
    127 %           - reference compression -> single code unit speculation
    128 %           - parallel string termination
    129 %
    130 %   - Bitstreams 4: Symbol processing
    131 %
    132 %   - From bit streams to doublebyte streams: the content buffer
    133 %     
    134 %   - Namespace Processing: A Bitset approach.
Note: See TracChangeset for help on using the changeset viewer.