# Changeset 2866

Ignore:
Timestamp:
Jan 30, 2013, 4:12:43 PM (6 years ago)
Message:

edits

Location:
docs/Working/icXML
Files:
8 edited

### Legend:

Unmodified
Removed

 r2522 additional cost. In short, the cost of individual character transcoding is avoided whenever they constitute lexical items, whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Furthermore, when transcoding is required, the parallel bit stream representation generally supports efficient transcoding operations.   In the important supports efficient transcoding operations.   In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
• ## docs/Working/icXML/arch-errorhandling.tex

 r2505 datum that was surpressed from the source during the production of the content stream. Armed with these, it is possible to calculate the actual line/column using the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the cursor position. the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
• ## docs/Working/icXML/arch-overview.tex

 r2532 \subsection{Overview} \begin{figure} \begin{center} \includegraphics[width=0.15\textwidth]{plots/xerces.pdf} \caption{Xerces Architecture} \label{fig:xerces-arch} \end{center} \end{figure} \def \CSG{Stream Generator} \icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML; the majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text; it keeps track of the current line/column of the cursor (which is reported to the end user in the unlikely event that the input contains an error), performs all line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names and ensures each character is legal \wrt{} the XML specification. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current line/column position, %(which is reported in the unlikely event that the input contains an error), performs line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names. The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR) of the document; it deals with all issues related to entity expansion, validates be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references, etc.) The {\it Namespace Binder}, which is a core piece of the element stack, is primarily tasked with handling namespace scoping issues between different XML vocabularies and faciliates the scanner with the construction and utilization of Schema grammar structures. The {\it Namespace Binder} is a core piece of the element stack. It handles namespace scoping issues between different XML vocabularies. This allows the scanner to properly select the correct schema grammar structures. The {\it Validator} takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and Schema grammar(s) before passing it to the end-user. the user-defined DTD and schema grammar(s) before passing it to the end-user. \begin{figure} \includegraphics[width=0.47\textwidth]{plots/icxml.pdf} \caption{\icXML{} Architecture} \label{fig:icxml-arch} \begin{figure}[h] \begin{center} \includegraphics[height=0.45\textheight,keepaspectratio]{plots/xerces.pdf} \caption{Xerces Architecture} \label{fig:xerces-arch} \end{center} \end{figure} mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed into UTF-16 in the Content Stream Generator, These lexical bit streams are later transformed into UTF-16 in the \CSG{}, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, \icXML{} must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit. \icXML{} represents elements and attributes as distinct data structures, called symbols, each with their own global identifier (GID). Using the {\bf symbol marker streams} produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through the raw data to produce a stream (series) of GIDs, called the {\it symbol stream}. A typical XML file will contain relatively few unique element and attribute names---but each of them will occur frequently. % throughout the document. % Grammar information can be associated with each symbol and can help reduce the look-up cost of the later Validation process. A typical XML file contains few unique element and attribute names---but each of them will occur frequently. \icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID). Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through the raw data to produce a sequence of GIDs, called the {\it symbol stream}. The final components of the \PS{} are the {\it Content Preperation Unit} and {\it Content Stream Generator}. The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}. The former takes the (transposed) basis bit streams and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the associated with each symbol occurrence. This is discussed in Section \ref{section:arch:namespacehandling}. Finally, the {\it Validation} layer mimics the Xerces's validator; however the majority of the grammar look-ups are performed beforehand and stored within the symbol themselves. Finally, the {\it Validation} layer implements the Xerces's validator. However, preprocessing associated with each symbol greatly reduces the work of this stage. \begin{figure}[h] \begin{center} \includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf} \end{center} \caption{\icXML{} Architecture} \label{fig:icxml-arch} \end{figure}
• ## docs/Working/icXML/background-fundemental-differences.tex

 r2522 XML-specific markup, such as a left angle bracket \verb<'', and the content held within the document. As the parser moves its cursor through the document, it alternates between markup scanning, validation, and content processing operations.  In other words, Xerces is a complex finite-state machine that use byte comparisons to transition between data and metadata states. Each state transition indicates the context for subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in generally unpredictable patterns; thus any character could be a state transition until deemed otherwise. As the parser progresses through the document, it alternates between markup scanning, validation and content processing modes. In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{ Herein FSM applications are software systems whose behavior is defined by the inputs, current state and the events associated with transitions of states.}. Each state transition indicates the processing context of subsequent characters. Unfortunately, textual data tends to be unpredictable and any character could induce a state transition. % Unfortunately, textual data tends to consist of variable-length strings sequenced in % unpredictable patterns. % Each character must be examined in sequence because any character could be a state transition until deemed otherwise. % Parallel: blocks/segments/buffers through layers Parabix-style XML parsers utilize a concept of layers: as each block of source text is transformed into a set of lexical bit streams, it undergoes a series of operations that can be grouped together in logical layers, such as transposition, character classification, and the lexical analysis phases. Each layer is pipeline parallel, requiring no speculation nor pre-parsing\cite{HPCA2012}. In adapting to the requirements of the Xerces sequential parsing API, however, the resultant parallel bit streams, taken individually, may out-of-order \wrt{} the source document.  Hence they must be amalgamated and iterated through to produce sequential output. % The end user should not be expected to work with out-of-order data ... % a block of input % text is transformed into a set of lexical bit streams. Operations are then % performed on these streams to identify key positions in the input data and % perform intra-element well-formedness validation (as an artifact of the % identification process.) Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical bit streams, which undergo a series of operations that can be grouped into logical layers, e.g., transposition, character classification, and lexical analysis. Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}. % In adapting to the requirements of the Xerces sequential parsing API, % however, the resultant parallel bit streams may out-of-order \wrt{} the source document. % Hence they must be amalgamated and iterated through to produce sequential output. To meet the API requirements of the document-ordered Xerces output, the results of the Parabix processing layers must be interleaved to produce the equivalent behavior.
• ## docs/Working/icXML/background-parabix.tex

 r2530 \subsection{The Parabix Framework} \label{background:parabix} \begin{figure*}[tbh] \begin{center} \begin{tabular}{cr}\\ Source Data & \verbfeefum\\ Tag Openers & \verb1____________1____________________________1____________1__________\\ Start Tag Marks & \verb_1____________1___________________________________________________\\ End Tag Marks & \verb___________________________________________1____________1_________\\ Empty Tag Marks & \verb__________________________________________________________________\\ Element Names & \verb_11111111_____1111111_____________________________________________\\ Attribute Names & \verb______________________11_______11_________________________________\\ Attribute Values & \verb__________________________111________111__________________________\\ % String Ends & \verb1____________1_______________1__________1_1____________1__________\\ % Markup Identifiers & \verb_________1______________1_________1______1_1____________1_________\\ % Deletion Mask & \verb_11111111_____1111111111_1____1111_11_______11111111_____111111111\\ % Undeleted Data & \verb{\tt\it 0}\verb________>fee{\tt\it 0}\verb__________=_fie{\tt\it 0}\verb____=__foe{\tt\it 0}\verb>{\tt\it 0}\verb/________fum{\tt\it 0}\verb/_________ \end{tabular} \end{center} \caption{XML Source Data and Derived Parallel Bit Streams} \label{fig:parabix1} \end{figure*} The Parabix (parallel bit stream) framework is a transformative approach to XML parsing are used to classify the input bits into a set of {\it character-class bit streams}, which identify key characters (or groups of characters) with a $1$. For example, one of the fundemental characters in XML is a left-angle bracket. For example, one of the fundamental characters in XML is a left-angle bracket. A character is an \verb<' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. The second is that byte-at-a-time loop scanning loops are actually often just computing a single bit of information per iteration: is the scan complete at this position yet?  Rather than computing these individual decision-bits, an approach that computes is the scan complete yet? Rather than computing these individual decision-bits, an approach that computes many of them in parallel (e.g., 128 bytes at a time using 128-bit registers) should provide substantial benefit. Previous studies have shown Parabix approach improves many aspects of XML processing, Previous studies have shown that the Parabix approach improves many aspects of XML processing, including transcoding \cite{Cameron2008}, character classification and validation, tag parsing and well-formedness checking. well as combining SIMD methods with 4-stage pipeline parallelism to further improve throughput \cite{HPCA2012}. Although these research prototypes handled the full syntax of {\bf grammarless} XML documents, they lacked the functionality required by full XML parsers. Namely, commercial XML processors, such as Xerces, as support for transcoding of multiple character sets, the ability to parse and validate against DTD grammars, both internal and external, facilities for handling different XML vocabularies through namespace processing, as well validation against XML Schema grammars. Additionally, commercial parsers can be expected to provide a number of API facilities beyond those found in research prototypes, including full implementations of the widely used SAX, SAX2 and DOM interfaces. Although these research prototypes handled the full syntax of schema-less XML documents, they lacked the functionality required by full XML parsers. Commercial XML processors support transcoding of multiple character sets and can parse and validate against multiple document vocabulaties. Additionally, they provide API facilities beyond those found in research prototypes, including the widely used SAX, SAX2 and DOM interfaces.
• ## docs/Working/icXML/background-xerces.tex

 r2530 of structure and data validation through multiple grammars declared using either legacy DTDs (document type definitions) or modern XML schema facilities. definitions) or modern XML Schema facilities. Xerces also supports several APIs for accessing parser services, including event-based parsing
• ## docs/Working/icXML/icxml-main.tex

 r2862 \input{abstract.tex} \section{Introduction} \section{Introduction} Parallelization and acceleration of XML parsing is a widely end-to-end acceleration of XML processing. To achieve the best results possible, we undertook a comprehensive restructuring of the Xerces-C++ parser, a nine-month comprehensive restructuring of the Xerces-C++ parser, seeking to expose as many critical aspects of XML parsing as possible for parallelization.   Overall, we have employed Parabix-style methods in transcoding, tokenization and tag parsing,  parallel string comparison methods in symbol resolution, bit parallel methods in namespace processing, as well as staged processing with pipeline parallelism to take advantage of as possible for parallelization, the result of which we named icXML. Overall, we employed Parabix-style methods of transcoding, tokenization and tag parsing, parallel string comparison methods in symbol resolution, bit parallel methods in namespace processing, as well as staged processing using pipeline parallelism to take advantage of multiple cores. applying the techniques discussed herein in other application domains. \section{Background} \label{background} \input{background-xerces} \begin{figure*}[th] \begin{center} \begin{tabular}{rr}\\ Source Data & \verbfeefum\\ Tag Openers & \verb1____________1____________________________1____________1__________\\ Start Tag Marks & \verb_1____________1___________________________________________________\\ End Tag Marks & \verb___________________________________________1____________1_________\\ Empty Tag Marks & \verb__________________________________________________________________\\ Element Names & \verb_11111111_____1111111_____________________________________________\\ Attribute Names & \verb______________________11_______11_________________________________\\ Attribute Values & \verb__________________________111________111__________________________\\ % String Ends & \verb1____________1_______________1__________1_1____________1__________\\ % Markup Identifiers & \verb_________1______________1_________1______1_1____________1_________\\ % Deletion Mask & \verb_11111111_____1111111111_1____1111_11_______11111111_____111111111\\ % Undeleted Data & \verb{\tt\it 0}\verb________>fee{\tt\it 0}\verb__________=_fie{\tt\it 0}\verb____=__foe{\tt\it 0}\verb>{\tt\it 0}\verb/________fum{\tt\it 0}\verb/_________ \end{tabular} \end{center} \caption{XML Source Data and Derived Parallel Bit Streams} \label{fig:parabix1} \end{figure*} \input{background-parabix} \input{background-fundemental-differences.tex}
• ## docs/Working/icXML/parfilter.tex

 r2530 \begin{figure*}[tbh] \begin{center} \begin{tabular}{cr}\\ \begin{tabular}{rr}\\ Source Data & \verbfeefum\\ % Tag Openers & \verb1____________1____________________________1____________1__________`\\
Note: See TracChangeset for help on using the changeset viewer.