Changeset 2496


Ignore:
Timestamp:
Oct 19, 2012, 3:01:59 PM (7 years ago)
Author:
nmedfort
Message:

temp checkin

Location:
docs/Working/icXML
Files:
5 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-charactersetadapters.tex

    r2470 r2496  
    22\label{arch:character-set-adapter}
    33
    4 The first major difference between Xerces and ICXML is the use of Character Set Adapters (CSAs). In Xerces, all input
     4The first major difference between Xerces and \icXML{} is the use of Character Set Adapters (CSAs). In Xerces, all input
    55is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and to provide the end-consumer with a single
    66encoding format.
  • docs/Working/icXML/arch-errorhandling.tex

    r2471 r2496  
    44% XML errors are rare but they do happen, especially with untrustworthy data sources.
    55Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
    6 As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal or not;
     6As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
    77if not, the current state determines the type and severity of the error.
    8 ICXML emits errors in the similar manner---but how it discovers them is substantially different.
    9 
    10 Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the \PS{} and
    11 the \MP{}. Each section has its own system for producing the error messages, geared towards the type
    12 of processing handled by the module.
     8\icXML{} emits errors in the similar manner---but how it discovers them is substantially different.
     9Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{},
     10each with its own system for detecting and producing error messages.
    1311
    1412Within the \PS{}, all computations are performed in parallel, a block at a time.
     
    2119(2) column position is counted in characters, not bytes or code units;
    2220thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
    23 Exacerbating these problems is the fact that typical XML documents are error-free but the calculation of the
    24 line/column position is a constant overhead in Xerces that must be maintained in the case that one occurs.
    25 To reduce this overhead, ICXML pushes the bulk cost of the line/column calculation to the occurence of the error and
    26 performs the minimal amount of book-keeping necessary to facilitate the function.
    27 ICXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
     21Note that typical XML documents are error-free but the calculation of the
     22line/column position is a constant overhead in Xerces. % that must be maintained in the case that one occurs.
     23To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and
     24performs the minimal amount of book-keeping necessary to facilitate it.
     25\icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
    2826within the Line Column Tracker (LCT).
    29 One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16.
     27One of the CSA's major responsibilities is transcoding an input text. % from some encoding format to near-output-ready UTF-16.
    3028During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
    3129and validated.
     
    4442column number.
    4543
    46 \begin{figure}[h]
     44\begin{figure}[ht]
    4745{\bf TODO: An example of a skip mask, error mask, and the raw data and transcoded data for it.
    4846Should a multi-byte character be used and/or some CRLFs to show the difficulties?}
  • docs/Working/icXML/arch-overview.tex

    r2483 r2496  
    11\subsection{Overview}
    22
    3 ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and
     3\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and
    44rearchitected with pipeline parallelism in mind.
    55In this section, we highlight the core differences between the two systems.
    66As shown in Figure \ref{fig:xerces-arch}, Xerces
    77is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
    8 The {\it Transcoder} converts all input data into UTF16; all text run through this module before
    9 being processed as XML. The majority of the character set encoding validation is performed
    10 as a byproduct of this process.
    11 The {\it Reader} is responsible for the streaming and buffering of all raw and transposed text;
     8The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML;
     9the majority of the character set encoding validation is performed as a byproduct of this process.
     10The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text;
    1211it keeps track of the current line/column of the cursor (which is reported to the end user in
    13 the unlikely event that the input file contains an error), performs all line-break normalization
     12the unlikely event that the input contains an error), performs all line-break normalization
    1413and validates context-specific character set issues, such as tokenization of qualified-names and
    15 ensuring each character is legal w.r.t. the XML specification.
     14ensures each character is legal w.r.t. the XML specification.
    1615The {\it Scanner} pulls data through the reader and constructs the intermediate (and near-final)
    1716representation of the document; it deals with all issues related to entity expansion, validates
    18 the XML wellformedness constraints and any character set encoding issues that cannot
     17the XML well-formedness constraints and any character set encoding issues that cannot
    1918be completely handled by the reader or transcoder (e.g., surrogate characters, validation
    2019and normalization of character references, etc.)
    2120The {\it Namespace Binder}, which is a core piece of their element stack, is primarily tasked
    22 with handling all namespace scoping issues between different XML vocabularies and faciliates
     21with handling namespace scoping issues between different XML vocabularies and faciliates
    2322the scanner with the construction and utilization of Schema grammar structures.
    2423The {\it Validator} takes the intermediate representation produced by the Scanner (and
    2524potentially annotated by the Namespace Binder) and assesses whether the final output matches
    26 the user-defined DTD and Schema grammar(s) before passing the data to the end-user.
     25the user-defined DTD and Schema grammar(s) before passing the information to the end-user.
    2726
    2827\begin{figure}
    2928\begin{center}
    3029\includegraphics[width=0.15\textwidth]{plots/xerces.pdf}
     30\caption{Xerces Architecture}
    3131\label{fig:xerces-arch}
    32 \caption{Xerces Architecture}
    3332\end{center}
    3433\end{figure}
    3534
    36 In ICXML functions are grouped into logical components.
     35In \icXML{} functions are grouped into logical components.
    3736As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
    3837All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
     
    4645the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
    4746Intra-element well-formedness validation is performed as an artifact of this process.
    48 Like Xerces, ICXML must provide the Line and Column position of each error.
     47Like Xerces, \icXML{} must provide the Line and Column position of each error.
    4948The {\it Line-Column Tracker} uses the lexical information to keep track of the cursor position(s) through the use of an
    50 optimized population count algorithm. This is described in Section \ref{section:arch:errorhandling}.
    51 
    52 
    53 
     49optimized population count algorithm; this is described in Section \ref{section:arch:errorhandling}.
    5450From here, two major data-independent branches remain: the {\bf symbol resolver} and the {\bf content stream generator}.
    5551% The output of both are required by the \MP{}.
    56 Apart from the use of the Parabix framework, one of the core differences between ICXML and Xerces is the use of symbols.
    57 A typical XML document will contain relatively few unique element and attribute names but each of them will occur
    58 frequently throughout the document.
    59 Each name is represented by a distinct symbol structure and global identifier (GID).
     52Apart from the Parabix framework, another core difference between Xerces and \icXML{} is the use of symbols.
     53A typical XML document will contain relatively few unique element and attribute names---but each of them will occur frequently throughout the document.
     54In \icXML{}, names are represented by distinct symbol structures and global identifiers (GIDs).
    6055Using the information produced by the parallel markup parser, the {\it Symbol Resolver} uses a bitscan intrinsic to
    6156iterate through a symbol bit stream (64-bits at a time) to generate a set of GIDs.
     
    6459the lookup cost in the validation process.
    6560The final component of the \PS{} is the {\it Content Stream Generator}. This component has a multitude of
    66 responsibilities, which will be discussed in Section \ref{sec:parfilter}, but the primary function of this is to produce
    67 output-ready UTF-16 content for the \MP{}.
     61responsibilities, which will be discussed in Section \ref{sec:parfilter}, but its primary function is to produce
     62near-final UTF-16 content.
    6863
    69 Everything in the \MP{} uses a compressed representation of the document, generated by the
    70 symbol resolver and content stream generator, to produce and validate the sequential (state-dependent) output.
     64The {\it \MP{}} parses a compressed representation of the XML document, generated by the
     65symbol resolver and content stream generator, to validate and produce the final (sequential) output for the end user.
    7166The {\it WF checker} performs all remaining inter-element wellformedness validation that would be too costly
    7267to perform in bitspace, such as ensuring every start tag has a matching end tag.
    7368The {\it Namespace Processor} replaces Xerces's namespace binding functionality. Unlike Xerces,
    74 this is performed as a discrete phase and simply produces a set of URI identifiers (URIIDs), to
    75 be associated with each instance of a symbol.
     69this is performed as a discrete phase and simply produces a set of URI identifiers (URI IDs), to
     70be associated with each occurrence of a symbol.
    7671This is discussed in Section \ref{section:arch:namespacehandling}.
    7772The final {\it Validation} process is responsible for the same tasks as Xerces's validator, however,
     
    8075\begin{figure}
    8176\includegraphics[width=0.50\textwidth]{plots/icxml.pdf}
     77\caption{\icXML{} Architecture}
    8278\label{fig:icxml-arch}
    83 \caption{ICXML Architecture}
    8479\end{figure}
    8580
  • docs/Working/icXML/icxml-main.tex

    r2490 r2496  
    3232\usepackage{graphicx}
    3333\usepackage{CJKutf8}
     34\usepackage{morefloats}
    3435\begin{document}
    3536
     
    4142\preprintfooter{short description of paper}   % 'preprint' option specified.
    4243
    43 \title{ICXML:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies}
     44\def \icXML {icXML}
     45\def \PS {Parabix Subsystem}
     46\def \MP {Markup Processor}
     47
     48\title{\icXML{}:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies}
    4449%\subtitle{Subtitle Text, if any}
    4550\authorinfo{Anonymous Hackers}
    46 {}
    47 {}
     51
    4852% \authorinfo{Nigel Medforth \and Dan Lin \and Kenneth S. Herdy \and Arrvindh Shriraman \and Robert D. Cameron }
    4953%            {International Characters, Inc., and Simon Fraser University}
     
    5155
    5256\maketitle
    53 
    54 \def \icXML {icXML}
    55 \def \PS {Parabix Subsystem}
    56 \def \MP {Markup Processor}
    5757
    5858\begin{abstract}
     
    118118the structure of the Xerces and Parabix XML parsers and the fundamental
    119119differences between the two parsing models.   Section 3 then presents
    120 the icXML design based on a restructured Xerces architecture to
     120the \icXML{} design based on a restructured Xerces architecture to
    121121incorporate SIMD parallelism using Parabix methods.   Section 4 presents a performance
    122122study demonstrating substantial end-to-end acceleration of
    123123a GML-to-SVG translation application written against the Xerces API.
    124 Section 5 moves on to consider the multithreading of the icXML architecture
     124Section 5 moves on to consider the multithreading of the \icXML{} architecture
    125125using the pipeline parallelism model.  Section 6 concludes the
    126126paper with a discussion of future work and the potential for
  • docs/Working/icXML/multithread.tex

    r2473 r2496  
    11\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
    2 \subsection{Pipeline Strategy for ICXML}
    3 As discussed in section \ref{}, Xerces can be considered as a complex finite-state machine.
    4 Finite-state machine belongs to the hardest application class to parallelize and process efficiently
    5 among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}.
    6 However, ICXML reconstructs Xerces and provides logical layers between modules,
     2\subsection{Pipeline Strategy for \icXML{}}
     3% As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine
     4% Finite-state machine belongs to the hardest application class to parallelize and process efficiently
     5% among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}.
     6% However, \icXML{} reconstructs Xerces and provides logical layers between modules,
     7% which naturally enables pipeline parallel processing.
     8
     9As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine,
     10the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}.
     11However, \icXML{} provides logical layers between modules,
    712which naturally enables pipeline parallel processing.
    813
     
    1116In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
    1217and process all the modules in Parabix Subsystem to generates
    13 content buffer, symbol array, URI array, context ID array and store them to a pre-allocated shared data structure $S$.
    14 The second thread $T_2$ reads the shared data provided by the first thread and
     18% content buffer, symbol array, URI array, context ID array and store them to a pre-allocated shared data structure $S$.
     19content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
     20The second thread $T_2$ consumes the data provided by the first thread and
    1521goes through all the modules in Markup Processor and writes output $O$.
    1622
     
    2632
    2733\begin{figure}
    28 \includegraphics[width=0.50\textwidth]{plots/threads_timeline1.pdf}
     34\includegraphics[width=0.45\textwidth]{plots/threads_timeline1.pdf}
    2935\caption{}
    3036\label{threads_timeline1}
    3137\end{figure}
     38\clearpage
    3239
    3340\begin{figure}
    34 \includegraphics[width=0.50\textwidth]{plots/threads_timeline2.pdf}
     41\includegraphics[width=0.45\textwidth]{plots/threads_timeline2.pdf}
    3542\caption{}
    3643\label{threads_timeline2}
    3744\end{figure}
     45\clearpage
    3846
    3947\subsection{Performance Comparison}
     
    4250\begin{figure}
    4351\begin{center}
    44 \includegraphics[width=0.50\textwidth]{plots/single-multi-thread.pdf}
     52\includegraphics[width=0.45\textwidth]{plots/single-multi-thread.pdf}
    4553\caption{Performance comparison of single-thread vs. multithread without namespace}
    4654\label{single-multi-thread}
     
    4957\end{figure}
    5058\begin{figure}
    51 \includegraphics[width=0.50\textwidth]{plots/single-multi-thread_ns.pdf}
     59\includegraphics[width=0.45\textwidth]{plots/single-multi-thread_ns.pdf}
    5260\caption{Performance comparison of single-thread vs. multithread with namespace}
    5361\label{single-multi-thread_ns}
     
    5664\begin{figure}
    5765\begin{center}
    58 \includegraphics[width=0.50\textwidth]{plots/threads_comp.pdf}
     66\includegraphics[width=0.45\textwidth]{plots/threads_comp.pdf}
    5967\caption{Performance comparison of the two threads without namespace}
    6068\label{threads_comp}
     
    6371\end{figure}
    6472\begin{figure}
    65 \includegraphics[width=0.50\textwidth]{plots/threads_comp_ns.pdf}
     73\includegraphics[width=0.45\textwidth]{plots/threads_comp_ns.pdf}
    6674\caption{Performance comparison of the two threads with namespace}
    6775\label{threads_comp_ns}
Note: See TracChangeset for help on using the changeset viewer.