# Changeset 2496

Ignore:
Timestamp:
Oct 19, 2012, 3:01:59 PM (7 years ago)
Message:

temp checkin

Location:
docs/Working/icXML
Files:
5 edited

### Legend:

Unmodified
Removed

 r2470 \label{arch:character-set-adapter} The first major difference between Xerces and ICXML is the use of Character Set Adapters (CSAs). In Xerces, all input The first major difference between Xerces and \icXML{} is the use of Character Set Adapters (CSAs). In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and to provide the end-consumer with a single encoding format.
• ## docs/Working/icXML/arch-errorhandling.tex

 r2471 % XML errors are rare but they do happen, especially with untrustworthy data sources. Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal or not; As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal; if not, the current state determines the type and severity of the error. ICXML emits errors in the similar manner---but how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the \PS{} and the \MP{}. Each section has its own system for producing the error messages, geared towards the type of processing handled by the module. \icXML{} emits errors in the similar manner---but how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{}, each with its own system for detecting and producing error messages. Within the \PS{}, all computations are performed in parallel, a block at a time. (2) column position is counted in characters, not bytes or code units; thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position. Exacerbating these problems is the fact that typical XML documents are error-free but the calculation of the line/column position is a constant overhead in Xerces that must be maintained in the case that one occurs. To reduce this overhead, ICXML pushes the bulk cost of the line/column calculation to the occurence of the error and performs the minimal amount of book-keeping necessary to facilitate the function. ICXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information Note that typical XML documents are error-free but the calculation of the line/column position is a constant overhead in Xerces. % that must be maintained in the case that one occurs. To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and performs the minimal amount of book-keeping necessary to facilitate it. \icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information within the Line Column Tracker (LCT). One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16. One of the CSA's major responsibilities is transcoding an input text. % from some encoding format to near-output-ready UTF-16. During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. column number. \begin{figure}[h] \begin{figure}[ht] {\bf TODO: An example of a skip mask, error mask, and the raw data and transcoded data for it. Should a multi-byte character be used and/or some CRLFs to show the difficulties?}
• ## docs/Working/icXML/arch-overview.tex

 r2483 \subsection{Overview} ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and \icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. In this section, we highlight the core differences between the two systems. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator. The {\it Transcoder} converts all input data into UTF16; all text run through this module before being processed as XML. The majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text; The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML; the majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text; it keeps track of the current line/column of the cursor (which is reported to the end user in the unlikely event that the input file contains an error), performs all line-break normalization the unlikely event that the input contains an error), performs all line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names and ensuring each character is legal w.r.t. the XML specification. ensures each character is legal w.r.t. the XML specification. The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final) representation of the document; it deals with all issues related to entity expansion, validates the XML wellformedness constraints and any character set encoding issues that cannot the XML well-formedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references, etc.) The {\it Namespace Binder}, which is a core piece of their element stack, is primarily tasked with handling all namespace scoping issues between different XML vocabularies and faciliates with handling namespace scoping issues between different XML vocabularies and faciliates the scanner with the construction and utilization of Schema grammar structures. The {\it Validator} takes the intermediate representation produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and Schema grammar(s) before passing the data to the end-user. the user-defined DTD and Schema grammar(s) before passing the information to the end-user. \begin{figure} \begin{center} \includegraphics[width=0.15\textwidth]{plots/xerces.pdf} \caption{Xerces Architecture} \label{fig:xerces-arch} \caption{Xerces Architecture} \end{center} \end{figure} In ICXML functions are grouped into logical components. In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams. the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, ICXML must provide the Line and Column position of each error. Like Xerces, \icXML{} must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}. optimized population count algorithm; this is described in Section \ref{section:arch:errorhandling}. From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}. % The output of both are required by the \MP{}. Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols. A typical XML document will contain relatively few unique element and attribute names but each of them will occur frequently throughout the document. Each name is represented by a distinct symbol structure and global identifier (GID). Apart from the Parabix framework, another core difference between Xerces and \icXML{} is the use of symbols. A typical XML document will contain relatively few unique element and attribute names---but each of them will occur frequently throughout the document. In \icXML{}, names are represented by distinct symbol structures and global identifiers (GIDs). Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs. the lookup cost in the validation process. The final component of the \PS{} is the {\it Content Stream Generator}. This component has a multitude of responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce output-ready UTF-16 content for the \MP{}. responsibilities, which will be discussed in Section \ref{sec:parfilter}, but its primary function is to produce near-final UTF-16 content. Everything in the \MP{} uses a compressed representation of the document, generated by the symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output. The {\it \MP{}} parses a compressed representation of the XML document, generated by the symbol resolver and content stream generator, to validate and produce the final (sequential) output for the end user. The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly to perform in bitspace, such as ensuring every start tag has a matching end tag. The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces, this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to be associated with each instance of a symbol. this is performed as a discrete phase and simply produces a set of URI identifiers (URI IDs), to be associated with each occurrence of a symbol. This is discussed in Section \ref{section:arch:namespacehandling}. The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however, \begin{figure} \includegraphics[width=0.50\textwidth]{plots/icxml.pdf} \caption{\icXML{} Architecture} \label{fig:icxml-arch} \caption{ICXML Architecture} \end{figure}
• ## docs/Working/icXML/icxml-main.tex

 r2490 \usepackage{graphicx} \usepackage{CJKutf8} \usepackage{morefloats} \begin{document} \preprintfooter{short description of paper}   % 'preprint' option specified. \title{ICXML:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies} \def \icXML {icXML} \def \PS {Parabix Subsystem} \def \MP {Markup Processor} \title{\icXML{}:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies} %\subtitle{Subtitle Text, if any} \authorinfo{Anonymous Hackers} {} {} % \authorinfo{Nigel Medforth \and Dan Lin \and Kenneth S. Herdy \and Arrvindh Shriraman \and Robert D. Cameron } %            {International Characters, Inc., and Simon Fraser University} \maketitle \def \icXML {icXML} \def \PS {Parabix Subsystem} \def \MP {Markup Processor} \begin{abstract} the structure of the Xerces and Parabix XML parsers and the fundamental differences between the two parsing models.   Section 3 then presents the icXML design based on a restructured Xerces architecture to the \icXML{} design based on a restructured Xerces architecture to incorporate SIMD parallelism using Parabix methods.   Section 4 presents a performance study demonstrating substantial end-to-end acceleration of a GML-to-SVG translation application written against the Xerces API. Section 5 moves on to consider the multithreading of the icXML architecture Section 5 moves on to consider the multithreading of the \icXML{} architecture using the pipeline parallelism model.  Section 6 concludes the paper with a discussion of future work and the potential for
 r2473 \section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism} \subsection{Pipeline Strategy for ICXML} As discussed in section \ref{}, Xerces can be considered as a complex finite-state machine. Finite-state machine belongs to the hardest application class to parallelize and process efficiently among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}. However, ICXML reconstructs Xerces and provides logical layers between modules, \subsection{Pipeline Strategy for \icXML{}} % As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine % Finite-state machine belongs to the hardest application class to parallelize and process efficiently % among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}. % However, \icXML{} reconstructs Xerces and provides logical layers between modules, % which naturally enables pipeline parallel processing. As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine, the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}. However, \icXML{} provides logical layers between modules, which naturally enables pipeline parallel processing. In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time and process all the modules in Parabix Subsystem to generates content buffer, symbol array, URI array, context ID array and store them to a pre-allocated shared data structure $S$. The second thread $T_2$ reads the shared data provided by the first thread and % content buffer, symbol array, URI array, context ID array and store them to a pre-allocated shared data structure $S$. content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$. The second thread $T_2$ consumes the data provided by the first thread and goes through all the modules in Markup Processor and writes output $O$. \begin{figure} \includegraphics[width=0.50\textwidth]{plots/threads_timeline1.pdf} \includegraphics[width=0.45\textwidth]{plots/threads_timeline1.pdf} \caption{} \label{threads_timeline1} \end{figure} \clearpage \begin{figure} \includegraphics[width=0.50\textwidth]{plots/threads_timeline2.pdf} \includegraphics[width=0.45\textwidth]{plots/threads_timeline2.pdf} \caption{} \label{threads_timeline2} \end{figure} \clearpage \subsection{Performance Comparison} \begin{figure} \begin{center} \includegraphics[width=0.50\textwidth]{plots/single-multi-thread.pdf} \includegraphics[width=0.45\textwidth]{plots/single-multi-thread.pdf} \caption{Performance comparison of single-thread vs. multithread without namespace} \label{single-multi-thread} \end{figure} \begin{figure} \includegraphics[width=0.50\textwidth]{plots/single-multi-thread_ns.pdf} \includegraphics[width=0.45\textwidth]{plots/single-multi-thread_ns.pdf} \caption{Performance comparison of single-thread vs. multithread with namespace} \label{single-multi-thread_ns} \begin{figure} \begin{center} \includegraphics[width=0.50\textwidth]{plots/threads_comp.pdf} \includegraphics[width=0.45\textwidth]{plots/threads_comp.pdf} \caption{Performance comparison of the two threads without namespace} \label{threads_comp} \end{figure} \begin{figure} \includegraphics[width=0.50\textwidth]{plots/threads_comp_ns.pdf} \includegraphics[width=0.45\textwidth]{plots/threads_comp_ns.pdf} \caption{Performance comparison of the two threads with namespace} \label{threads_comp_ns}