Changeset 1003 for docs/PACT2011

Ignore:
Timestamp:
Mar 25, 2011, 4:59:09 PM (8 years ago)
Message:

Edits.

Location:
docs/PACT2011
Files:
2 edited

Legend:

Unmodified
 r991 \section{Background} \label{section:background} This section provides a brief overview of XML and traditional and parallel XML processing technology, and describes the key design and performance aspects of successive generations of the Parabix parallel XML processing technology. % clunky sounding... This section provides a brief overview of XML and traditional and parallel XML processing technology. Section \ref{section:reserach} describes the key design and performance aspects of both generations of the Parabix parallel XML processing technology. \subsection{XML} In 1998, the W3C officially adopted XML as a standard. XML is a platform-independent data interchange format. The defining characteristics of XML are that it can represent virtually any type of information through the use of self-describing markup tags and can easily store semi-structured data in a descriptive fashion. XML markup encodes a description of an XML document's storage layout and logical structure. Because XML was intended to be human-readable, XML markup tags are often verbose by design \cite{TR:XML}. For example, a typical XML file could be: \end{figure} % éšä»¶ % can't represent in verbose and not really sure if the google auto-translater is correct % éšä»¶ % can't represent in verbose and not really sure if the google auto-translater is correct XML files can be classified as documents-oriented'' or data-oriented'' \cite{DuCharme04}. Documented-oriented XML is designed to be human readable, such as Figure \ref{fig:sample_xml}; data-oriented XML files are intended to be parsed by machines and omit any human-friendly'' formatting techniques, such as the use of whitespace and descriptive natural language'' naming schemes.  Although the XML specification does not distinguish between XML for documents'' and XML for data'' \cite{TR:XML}, the latter often requires the use of an XML parser to extract the information within. The role of an XML parser is to transform the text-based XML data into an application-ready format. XML files can be classified as document-oriented'' or data-oriented'' \cite{DuCharme04}. Documented-oriented XML is designed to be human readable, such as Figure \ref{fig:sample_xml}; data-oriented XML files are intended to be parsed by machines and omit any human-friendly'' formatting techniques, such as the use of whitespace and descriptive natural language'' naming schemes.  Although the XML specification itself does not distinguish between XML for documents'' and XML for data'' \cite{TR:XML}, the latter often requires the use of an XML parser to extract the information within. The role of an XML parser is to transform the text-based XML data into an application-ready format. %For example, an XML parser for a web browser may take a XML file, apply a style sheet to it, and display it to the end user in an attractive yet informative way; an XML database parser may take a XML file and construct indexes and/or compress the tree into a proprietary format to provide the end user with efficient relational, hierarchical, and/or object-based query access to it. % However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}. Traditional XML parsers process XML sequentially a single byte-at-a-time. Following this approach, an XML parser processes a source document serially, from the first to the last byte in the source file in a top-down manner. Each character of text is examined to distinguish between the XML-specific markup, such as an opening angle bracket <', and the content held within the document. As the parser moves through the source document, it alternates between markup scanning, and data validation and processing operations. At each processing step, the parser scans the source document and either locates the expected markup, or reports an error condition and terminates. % not happy with the phrasing of this line Traditional XML parsers process XML sequentially a single byte-at-a-time. Following this approach, an XML parser processes a source document serially, from the first to the last byte in the source file in a top-down manner. Each character of text is examined to distinguish between the XML-specific markup, such as an opening angle bracket <', and the content held within the document. The current character that the parser is processing is refered to as its cursor position. As the parser moves the cursor through the source document, the parser alternates between markup scanning, and data validation and processing operations. At each processing step, the parser scans the source document and either locates the expected markup, or reports an error condition and terminates. In other words, traditional XML parsers are complex finite-state machines that use byte comparisons to transition between data and metadata states. Each state transition indicates the context in which to interpret the subsequent characters. Unfortunetly, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}; thus any character could be a state transition until deemed otherwise. Expat and Xerces-C are popular byte-a-time sequential parsers. Both are C/C++ based open-source XML parsers. Expat was originally released in 1998; it is currently used in Mozilla Firefox and Open Office \cite{expat}. Xerces-C was released in 1999 and is the foundation of the Apache XML project \cite{xerces}. The major disadvantage of the byte-at-a-time sequential approach to XML parsering is that each character incurs at least one conditional branch. The cummulative effect of branch mispredictions penalties are known to degrade parsing performance in proportion to the markup density of the source document \cite{CameronHerdyLin2008} (i.e., the proportion of XML-markup vs. XML-data). Expat and Xerces-C are popular byte-a-time sequential parsers. Both are C/C++ based open-source XML parsers. Expat was originally released in 1998; it is currently used in Mozilla Firefox and Open Office \cite{expat}. Xerces-C was released in 1999 and is the foundation of the Apache XML project \cite{xerces}. For example, the main loop of Xerces-C well-formedness scanner contains: \subsection{Parallel XML Parsing} \begin{verbatim} XXXXXXXXXX   XERCES CODE   XXXXXXXXXX \end{verbatim} In general, parallel XML acceleration methods comes in one of two forms --- multithreaded approaches and SIMDized techniques. Multithreaded XML parsers take advantage of multiple cores by first quickly preparsing the XML file to locate key partitioning points. The XML workload is then divided and processed independently across the available cores \cite{ZhangPanChiu09}. A join step typically follows. SIMD XML parsers leverage the SIMD registers to overcome the performance limitations of the byte-at-a-time sequential paradigm and inherent data dependent branch misprediction rates \cite{Cameron2010}. The SIMDized XML parsers, Parabix1 and Parabix2, both utilize parallel bit stream processing technology. With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}. The major disadvantage of the byte-at-a-time sequential approach to XML parsering is that each character incurs at least one conditional branch. The cummulative effect of branch mispredictions penalties are known to degrade parsing performance in proportion to the markup density of the source document \cite{CameronHerdyLin2008} (i.e., the proportion of XML-markup vs. XML-data). \subsubsection{Parabix1} \subsection {Parallel XML Parsing} Our first generation parallel bitstream XML parser, Parabix1, uses a less conventional approach of SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. As mentioned a transposition step first transforms sequential byte stream data into eight basis bitstreams for the bits of each byte. Bitwise logical combinations of these basis bitstreams are then be used to classify bytes in various ways, while the bit scan operations common to commodity processors are used for fast sequential scanning. At a high level, Parabix1 processes source XML in a functionally equivalent manner as a traditional processor. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor scanning position throughout the parse. However, this scanning operation itself is accelerated significantly which leads to dramatic performance improvements, since bit scan operations can perform up to general register width (32-bit, 64-bit) finite state transitions per clock cycle. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, CameronLin2009, Cameron2010}. In general, parallel XML acceleration methods comes in one of two forms: multithreaded approaches and SIMD-ized techniques. Multithreaded XML parsers take advantage of multiple cores by first quickly preparsing the XML file to locate key partitioning points. The XML workload is then divided and processed independently across the available cores \cite{ZhangPanChiu09}. A serial join step typically follows. SIMD XML parsers leverage the SIMD registers to overcome the performance limitations of the byte-at-a-time sequential processing paradigm and inherent data dependent branch misprediction rates \cite{Cameron2010}. SIMD instructions allows the processor to perform the same operation on multiple pieces of data simultaneously. To our knowledge, the only SIMD-based XML parsers are Parabix1 and Parabix2, both of which were designed and developed by Cameron et al. \cite{CameronHerdyLin2008}. We discuss both versions of Parabix in Section \ref{section:reserach}. \subsubsection{Parabix2} \subsection {SIMD Operations} In our second generation XML parser, Parabix2, we address the replacement of sequential parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single cursor approach of Parabix1 and conceptually of traditional sequential approach, in Parabix2 multiple cursors positions are processed in parallel. To deal with these parallel cursors, three additional categories of bitstreams are introduced. Marker bitstreams are used to represent positions of interest in the parsing of a source data stream \cite{Cameron2010}. The appearance of a 1 at a position in a marker bitstream could, for example, denote the starting position an XML tag in the data stream. In general, the set of bit positions in a marker bitstream may be considered to be the current parsing positions of multiple parses taking place in parallel throughout the source data stream. A further aspect of the parallel method is that conditional branch statements used to identify syntax error at each each parsing position are eliminated. Instead, error bitstreams are used to identify the position of parsing or well-formedness errors during the parsing process. Error positions are gathered and processed in as a final post processing step. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing as well as significanlty reduces branch misprediction penalty. % Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology. % Extract section 2.2 and merge into 3.   Add a new subsection % in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical % operations and then mention the pack operations that allow % us to implement transposition efficiently in parallel. % Also note that the SIMD registers support bitwise logic across % their full width and that this is extensively used in our work. % \subsection{Parallel XML Parsing} % % Parallel XML processing generally comes in one of two forms: multithreading and SIMD. Multithreaded XML parsers take advantage of parallism by first quickly preparsing the XML file to locate the key markup entities and determine the best workload distribution in which process the XML file using $n$-cores \cite{ZhangPanChiu09}. SIMD XML parsers leverage the SIMD registers to overcome the performance limitations of the sequential paradigm and inherently data dependent branch misprediction rates \cite{Cameron2010}. Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology. With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}. % % \subsubsection{Parabix1} % % Our first generation parallel bitstream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bitstreams for the bits of each byte. Bitwise logical combinations of these basis bitstreams can then be used to classify bytes in various ways, while the bit scan operations common to commodity processors can be used for fast sequential scanning. At a high level, Parabix1 processes source XML in a functionally equivalent manner as a traditional processor. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor scanning position throughout the parse. However, this scanning operation itself is accelerated significantly which leads to dramatic performance improvements, since bit scan operations can perform up to general register width (32-bit, 64-bit) finite state transitions per clock cycle. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, CameronLin2009, Cameron2010}. % % \subsubsection{Parabix2} % % In our second generation XML parser---Parabix2---we address the replacement of sequential parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single cursor approach of Parabix1 and conceptually of traditional sequential approach, in Parabix2 multiple cursors positions are processed in parallel. To deal with these parallel cursors, three additional categories of bitstreams are introduced. Marker bitstreams are used to represent positions of interest in the parsing of a source data stream \cite{Cameron2010}. The appearance of a 1 at a position in a marker bitstream could, for example, denote the starting position an XML tag in the data stream. In general, the set of bit positions in a marker bitstream may be considered to be the current parsing positions of multiple parses taking place in parallel throughout the source data stream. A further aspect of the parallel method is that conditional branch statements used to identify syntax error at each each parsing position are eliminated. Instead, error bitstreams are used to identify the position of parsing or well-formedness errors during the parsing process. Error positions are gathered and processed in as a final post processing step. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing as well as significanlty reduces branch misprediction penalty. %
 r954 \section{Parabix} \label{section:reserach} Describe key technology behind Parabix Introduce SIMD; Talk about SSE Highlight which SSE instructions are important TAlk about each pass in the parser; How SSE is used in every phase... Benefits of SSE in each phase. The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0. It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte. This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources. The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption. Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage. %Describe key technology behind Parabix %Introduce SIMD; %Talk about SSE %Highlight which SSE instructions are important %TAlk about each pass in the parser; How SSE is used in every phase... %Benefits of SSE in each phase. % Extract section 2.2 and merge into 3.   Add a new subsection % in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical % operations and then mention the pack operations that allow % us to implement transposition efficiently in parallel. % Also note that the SIMD registers support bitwise logic across % their full width and that this is extensively used in our work. % % Also, it could be good to have a small excerpt of a byte-at-a-time % scanning loop for XML, e.g., extracted from Xerces in section 2.1. % Just a few lines showing the while loop - Linda can tell you the file. % % This section focuses on the % With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}. % The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0. % It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte. % This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources. % The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption. % Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage. \subsection{Parabix1} % Our first generation parallel bitstream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bitstreams for the bits of each byte. At a high level, Parabix1 processes source XML in a functionally equivalent manner as a traditional processor. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor position throughout the parsing process. Where Parabix1 differs from the traditional parser is that it scans for key markup characters using a series of basis bitstreams. A bitstream is simply a sequence of $0$s and $1$s, where there is one such bit in the bitstream for each character in a source data stream. A basis bitstream is a bitstream that consists of only transposed textual XML data. In other words, a source character consisting of $M$ bits can be represented with $M$ bitstreams and by utilizing $M$ SIMD registers of width $W$, it is possible to scan through $W$ characters in parallel. The register width $W$ varies between 64-bit for MMX, 128-bit for SSE, and 256-bit for AVX. Figure \ref{fig:inputstreams} presents an example of how we represent 8-bit ASCII characters using eight bitstreams. $B_0 \ldots B_7$ are the individual bitstreams. The $0$ bits in the bitstreams are represented by periods, so that the $1$ bits stand out. \begin{figure}[h] \begin{center} \begin{tabular}{cr}\\ source data $\vartriangleright$ & \verbabc\\ $B_0$ & \verb..1.1.1.1.1....1.\\ $B_1$ & \verb...1.11.1..1..111\\ $B_2$ & \verb11.1...111.111.11\\ $B_3$ & \verb1..1...11..11..11\\ $B_4$ & \verb1111...1.111111.1\\ $B_5$ & \verb11111111111111111\\ $B_6$ & \verb.1..111..1...1...\\ $B_7$ & \verb.................\\ \end{tabular} \end{center} \caption{Parallel Bitstream Example} \label{fig:inputstreams} \end{figure} In order represent the byte-oriented character data as parallel bitstreams, the source data is first loaded in sequential order and converted into its transposed representation through a series of packs, shifts, and bitwise operations. Using the SIMD capabilities of current commodity processors, this transposition of source data to bitstreams incurs an amortized overhead of about 1 CPU cycle per byte for transposition \cite{CameronHerdyLin2008}. When parsing, we need to consider multiple properties of characters at different stages during the process. Using the basis bitstreams, it is possible to combine them using bitwise logic in order to compute character-class bitstreams;t hat is, streams that identify the positions at which characters belonging to a specific character class occur. For example, a ASCII character is an open angle bracket <' if and only if $B_2 \land \ldots \land B_5 =$ 1 and the other basis bitstreams are 0 at the same position within the basis bitstreams. Once these character-class bitstreams are created, bit-scan operations, common to commodity processors, can be used for sequential markup scanning and data validation operations. A common operation in all XML parsers is identifying the start tags (<') and their accompanying end tags (either />'' or >'' depending whether the element tag is an empty element tag or not, respectively). \begin{figure}[h] \begin{center} \begin{tabular}{lr}\\ source data $\vartriangleright$ & \verbabc\\ % $N =$ name chars & \verb.11.111.11...11..\\ $M_0 = 1$ & \verb1................\\ $M_1 = advance(M_0)$ & \verb.1...............\\ $M_2 = bitscan('>')$ & \verb...1.............\\ $M_3 = advance(M_2)$ & \verb....1............\\ $M_4 = bitscan('<')$ & \verb.......1.........\\ $M_5 = bitscan('/')$ & \verb..........1......\\ $M_6 = advance(M_5)$ & \verb...........1.....\\ $M_7 = bitscan('<')$ & \verb.............1...\\ $M_8 = bitscan('/')$ & \verb...............1.\\ $M_9 = advance(M_8)$ & \verb................1\\ % $M_2 \lor M_6 \lor M_9$    & \verb...1.......1....1\\ \end{tabular} \end{center} \caption{Parabix1 Start and End Tag Identification} \label{fig:Parabix1StarttagExample} \end{figure} Unlike traditional parsers, these sequential operations are accelerated significantly since bit scan operations can perform up to $W$ finite state transitions per clock cycle. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, CameronLin2009, Cameron2010}. % In section 3, we should try to explain a bit more detail of the % operation.   Under Parabix 1, a little bit on transposition % and calculation of the [<] bitstream would be good, perhaps % using the examples from the 2010 Technical Report or EuroPar submission. \subsection{Parabix2} % Under Parabix 2 a little discussion of bitwise addition for % scanning, perhaps again excerpted from the TR/EuroPar submission % would be good. %In Parabix2, we replace the sequential single-cursor parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single cursor approach of Parabix1 and conceptually of traditional sequential approach, in Parabix2 multiple cursors positions are processed in parallel. In Parabix2, we replace the sequential single-cursor parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single-cursor approach of Parabix1 (and conceptually of all sequential XML parsers), Parabix2 processes multiple cursors in parallel. For example, using the source data from Figure \ref{fig:Parabix1StarttagExample}, Figure \ref{fig:Parabix2StarttagExample} shows how Parabix2 identifies and moves each of the start tag markers forwards to the corresponding end tag. Like Parabix1, we assume that $N$ (the name chars) has been computed using the basis bitstreams and that \begin{figure}[h] \begin{center} \begin{tabular}{lr}\\ source data $\vartriangleright$ & \verbabc\\ $N =$ name chars & \verb.11.111.11...11..\\ $M_0 = [<]$ & \verb1......1....1....\\ $M_1 = \texttt{advance}(M_0)$ & \verb.1......1....1...\\ $M_2 = \texttt{scanto}('/','>')$ & \verb...1......1....1.\\ $M_3 = \texttt{scanto}(>)$ & \verb...1.......1....1 \end{tabular} \end{center} \caption{Parabix2 Start and End Tag Identification} \label{fig:Parabix2StarttagExample} \end{figure} In general, the set of bit positions in a marker bitstream may be considered to be the current parsing positions of multiple parses taking place in parallel throughout the source data stream. A further aspect of the parallel method is that conditional branch statements used to identify syntax error at each each parsing position are eliminated. Although we do not show it in the prior examples, error bitstreams can be used to identify any well-formedness errors found during the parsing process. Error positions are gathered and processed in as a final post processing step. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing as well as significanlty reduces branch misprediction penalty.