 r2471 ICXML is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. In this section, we highlight the core differences between the two systems and discuss how they differ design wise. In this section, we highlight the core differences between the two systems. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator.
 r2471 \subsection{The Parabix Framework} \label{background:parabix} \begin{figure*}[tbhp] \begin{center} \begin{tabular}{cr}\\ Source Data & \verbtextmore\\ Tag Openers & \verb1_____1_______1____1___________________________1____1_______________1______\\ Start Tag Marks & \verb_1_____1____________1________________________________1_____________________\\ End Tag Marks & \verb_______________1________________________________1____________________1_____\\ Element Names & \verb_1111__11___________11_______________________________1111__________________\\ Att Names & \verb_______________________11_______11________________________1111_____________\\ Att Values & \verb__________________________11111______11111_____________________111_________ \end{tabular} \end{center} \caption{XML Source Data and Derived Parallel Bit Streams} \label{fig:parabix1} \end{figure*} The Parabix (parallel bit stream) framework is a transformative approach to XML parsing (e.g., 128-bit) SIMD registers in commodity processors to represent data from long blocks of input data by using one register bit per single input byte. To facilitate this, the input data is first transposed into a set of basis bit streams and then boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of character-class (and eventually lexical) bit streams. For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string {\ttfamily b7\verb<A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example. To facilitate this, the input data is first transposed into a set of basis bit streams. In Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string {\ttfamily b7\verb|<|A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. % The bits used to construct $\tt b_7$ have been highlighted in this example. Boolean-logic operations\footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of {\it character-class bit streams}, which identify key characters (or groups of characters) with a $1$. For example, one of the fundemental characters in XML is a left-angle bracket. A character is an \verb<' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. Similarly, a character is numeric {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$. % An important observation here is that a range of characters can sometimes % take fewer operations and require fewer basis bit streams to compute % than individual characters. Finding optimal solutions to all % character-classes is non-trivial and goes beyond the scope of this % paper. \begin{figure}[h] \begin{figure}[hp] \begin{center} \begin{tabular}{r c c c c } \end{figure} Character-class bit streams allow us to perform 128 comparisons in parallel with a single operation by using a series of boolean-logic operations to merge multiple basis bit streams into a single character-class stream that marks the positions of key characters with a $1$. For example, one of the fundemental markers in XML is the one that identifies all left-angle brackets \verb<''. A character is an \verb<'' if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. Classes of characters can be found with similar formulas.  For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute than individual characters. Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper. % Using a mixture of boolean-logic and arithmetic operations, character-class % bit streams can be transformed into lexical bit streams, where the presense of % process, intra-element well-formedness validation is performed on each block % of text. \begin{figure*}[tbhp] \begin{center} \begin{tabular}{cr}\\ Source Data & \verbtextmore\\ Tag Openers & \verb1_____1_______1____1___________________________1____1_______________1______\\ Start Tag Marks & \verb_1_____1____________1________________________________1_____________________\\ End Tag Marks & \verb_______________1________________________________1____________________1_____\\ Element Names & \verb_1111__11___________11_______________________________1111__________________\\ Att Names & \verb_______________________11_______11________________________1111_____________\\ Att Values & \verb__________________________11111______11111_____________________111_________` \end{tabular} \end{center} \caption{XML Source Data and Derived Parallel Bit Streams} \label{fig:parabix1} \end{figure*} Consider, for example, the XML source data stream shown in the first line of Figure \ref{fig:parabix1}.
 r2429 \subsection{Xerces C++ Structure} \label{background:xerces} The Xerces C++ parser is a widely-used standards-conformant XML parser produced as open-source software by the Apache Software Foundation.  It features comprehensive support for a variety of character encodings The Xerces C++ parser % is a widely-used standards-conformant % XML parser produced as open-source software % by the Apache Software Foundation. % It features comprehensive support for a variety of character encodings both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple XML vocabularies through the XML namespace Xerces also supports several APIs for accessing parser services, including event-based parsing using either pull parsing or SAX push-style using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM tree-based parsing interface.
 r2478 \section{Background} \label{background} \input{background-xerces}
