Changeset 2471 for docs/Working


Ignore:
Timestamp:
Oct 17, 2012, 2:54:12 PM (7 years ago)
Author:
nmedfort
Message:

some edits

Location:
docs/Working/icXML
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-errorhandling.tex

    r2470 r2471  
    22\label{section:arch:errorhandling}
    33
    4 % Challenges / Line Col Tracker
    5 
     4% XML errors are rare but they do happen, especially with untrustworthy data sources.
    65Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
    76As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal or not;
    87if not, the current state determines the type and severity of the error.
    9 ICXML emits errors in the similar manner---but how it discovers them differs substantially.
    10 Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the Parabix subsystem and
    11 the markup processor. Each section has its own system for producing the error messages, geared towards the type
     8ICXML emits errors in the similar manner---but how it discovers them is substantially different.
     9
     10Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the \PS{} and
     11the \MP{}. Each section has its own system for producing the error messages, geared towards the type
    1212of processing handled by the module.
    1313
    14 Within the Parabix subsystem, all computations are performed in parallel, a block at a time.
     14Within the \PS{}, all computations are performed in parallel, a block at a time.
    1515Errors are derived as artifacts of bit stream calculations, with a 1-bit marking the byte-position of an error within a block,
    1616and the type of error is determined by the equation that discovered it.
     
    5151\end{figure}
    5252
    53 The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces.
    54 However, line/column tracking within it is a much more difficult problem. The Markup Processor parses the content stream,
    55 which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol
    56 data and unnecessary whitespace is eliminated from the stream.
    57 This means it is impossible to directly assess the current location with only the content stream.
    58 To calculate this, the Markup Processor borrows three additional pieces of information from the Parabix subsystem:
    59 the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream that denotes every code-unit that
    60 was surpressed from the raw data during the production of the content stream.
    61 
    62 
    63 Armed with the cursor position in
    64 the content stream,
    65 
     53The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
     54However, reporting the correct line/column is a much more difficult problem.
     55The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
     56Each string is normalized in accordance with the XML specification.
     57All symbol data and unnecessary whitespace is eliminated from the stream.
     58This means it is impossible to directly assess the current location using only the cursor position within the content stream.
     59To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
     60the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream denoting the (code-unit) position of every
     61datum that was surpressed from the source during the production of the content stream.
     62Armed with these, it is possible to calculate the actual line/column using
     63the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the cursor position.
  • docs/Working/icXML/arch-namespace.tex

    r2470 r2471  
    3939in one of two forms:
    4040(1) those that declare a set of namespaces upfront and never change them, and
    41 (2) those that repeatidly modify the namespace scope within the document in predictable patterns.
     41(2) those that repeatidly modify the namespaces in predictable patterns.
    4242
    4343For that reason, ICXML contains an independent namespace stack and utilizes bit vectors to cheaply perform
  • docs/Working/icXML/arch-overview.tex

    r2470 r2471  
    22
    33ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
    4 rearchitected into pipeline-parallel ready structure.
     4rearchitected with pipeline parallelism in mind.
    55In this section, we highlight the core differences between the two systems and discuss how they
    66differ design wise.
    77As shown in Figure \ref{fig:xerces-arch}, Xerces
    8 is comprised of five main modules: the reader, transcoder, scanner, namespace binder, and validator.
     8is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
    99The {\it Transcoder} converts all input data into UTF16; all text run through this module before
    1010being processed as XML. The majority of the character set encoding validation is performed
     
    2020be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    2121and normalization of character references, etc.)
    22 The {\it Namespace binder} is primarily tasked with handling all namespace scoping issues between
    23 different XML vocabularies and faciliates the scanner with the construction and utilization
    24 of Schema grammar structures.
     22The {\it Namespace Binder}, which is a core piece of their element stack, is primarily tasked
     23with handling all namespace scoping issues between different XML vocabularies and faciliates
     24the scanner with the construction and utilization of Schema grammar structures.
    2525The {\it Validator} takes the intermediate representation produced by the Scanner (and
    2626potentially annotated by the Namespace Binder) and assesses whether the final output matches
    27 the user-defined DTD and Schema grammar(s).
     27the user-defined DTD and Schema grammar(s) before passing the data to the end-user.
    2828
    2929\begin{figure}
     
    3535\end{figure}
    3636
    37 In ICXML, tasks, as shown in Figure \ref{fig:icxml-arch} are grouped into logical components.
    38 Two major categories of functions exist: those in the parabix subsystem, and
    39 those in the markup processor. All tasks in the parabix subsystem use the parabix framework {\bf (citation?)} and represent
    40 data as a series of bit streams, which are discussed in Section \ref{background:parabix}.
    41 The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
    42  closely mirrors Xerces's transcoder duties; however instead of producing UTF16 it produces a
    43 set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed
    44 into UTF-16 in the Content Buffer Generator, after additional processing is performed.
     37In ICXML functions are grouped into logical components.
     38As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
     39All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
     40The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
     41mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
     42set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
     43These lexical bit streams are later transformed into UTF-16 in the Content Buffer Generator, after additional processing is performed.
    4544The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
    4645It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
     
    5554
    5655From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
    57 % The output of both are required by the markup processor.
     56% The output of both are required by the \MP{}.
    5857Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols.
    5958A typical XML document will contain relatively few unique element and attribute names but each of them will occur
     
    6564One of the main advantages of using GIDs is that grammar information can be associated with the symbol itself and help bypass
    6665the lookup cost in the validation process.
    67 The final component of the parabix subsystem is the {\it Content Stream Generator}. This component has a multitude of
     66The final component of the \PS{} is the {\it Content Stream Generator}. This component has a multitude of
    6867responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
    69 output-ready UTF-16 content for the markup processor.
     68output-ready UTF-16 content for the \MP{}.
    7069
    71 Everything in the markup processor uses a compressed representation of the document, generated by the
     70Everything in the \MP{} uses a compressed representation of the document, generated by the
    7271symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output.
    7372The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
  • docs/Working/icXML/background-fundemental-differences.tex

    r2429 r2471  
    2323phases. Each layer is pipeline parallel, as they require no speculation nor
    2424pre-parsing stages\cite{HPCA2012}.
    25 The disadvantage of this approach is that, taken individually, the resultant lexical
     25The disadvantage of this approach is that, taken individually, the resultant parallel
    2626bit streams may out-of-order w.r.t. the source document and must be amalgamated and
    2727iterated through to produce sequential output.
     28% The end user should not be expected to work with out-of-order data ...
    2829
    2930% a block of input
  • docs/Working/icXML/background-parabix.tex

    r2470 r2471  
    105105often just computing a single bit of information per iteration:
    106106is the scan complete at this position yet?  Rather than
    107 computing these bits one at a time, an approach that computes
    108 many of them in parallel (e.g., 128 with SSE registers) should
    109 provide substantial {\bf benefit}.
    110 Previous studies have shown the performance {\bf benefits} of the
    111 Parabix approach in many aspects of XML processing, including transcoding\cite{Cameron2008},
    112 character classification and validation, tag parsing and well-formedness
    113 checking.  The first Parabix parser used processor bit scan instructions
    114 to considerably accelerate sequential scanning loops for individual
    115 characters \cite{CameronHerdyLin2008}.
     107computing these individual decision-bits, an approach that computes
     108many of them in parallel (e.g., 128) should provide substantial benefit.
     109
     110Previous studies have shown Parabix approach improves many aspects of XML processing,
     111including transcoding \cite{Cameron2008}, character classification and validation,
     112tag parsing and well-formedness checking. 
     113The first Parabix parser used processor bit scan instructions to considerably accelerate
     114sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
    116115Recent work has incorporated a method of parallel
    117116scanning using bitstream addition \cite{cameron-EuroPar2011}, as
    118117well as combining SIMD methods with 4-stage pipeline parallelism to further improve
    119118throughput \cite{HPCA2012}.
    120 
    121119Although these research prototypes handle the full syntax of
    122 DTD-less XML documents, including well-formedness checking, they fall
    123 short of the functionality required in full XML parser for several reasons. Namely,
    124 commercial XML processors, such as Xerces, include a number of additional facilities such
     120DTD-less XML documents, they lacked the functionality required by full XML parsers.
     121Namely, commercial XML processors, such as Xerces,
    125122as support for transcoding of multiple character sets,
    126123the ability to parse and validate against DTDs, both internal and external,
    127124facilities for handling different XML vocabularies through namespace
    128 processing, as well validation against XML schema.  In addition,
    129 commercial parsers can be expected to provide a number of API
     125processing, as well validation against XML Schema grammars. 
     126Additionally, commercial parsers can be expected to provide a number of API
    130127facilities beyond those found in research prototypes, including
    131128full implementations of the widely used SAX, SAX2 and DOM interfaces.
  • docs/Working/icXML/icxml-main.tex

    r2455 r2471  
    5252\maketitle
    5353
     54\def \icXML {icXML}
     55\def \PS {Parabix Subsystem}
     56\def \MP {Markup Processor}
     57
    5458\begin{abstract}
    5559\input{abstract.tex}
    5660\end{abstract}
    57 
    58 \def \icXML {icXML}
    5961
    6062\category{CR-number}{subcategory}{third-level}
Note: See TracChangeset for help on using the changeset viewer.