# Changeset 895 for docs/EuroPar2011/europar-cameron.tex

Ignore:
Timestamp:
Feb 5, 2011, 5:05:29 PM (8 years ago)
Message:

Trim intro and tighten section 2.

File:
1 edited

### Legend:

Unmodified
 r894 \section{Introduction} Traditional byte-at-a-time parsing technology is increasingly mismatched to the capabilities of modern processors.   Current commodity processors generally possess 64-bit general purpose registers as well as 128-bit SIMD registers, with 256-bit registers now appearing.   General purpose processing on graphics processors can make available 512-bit or wider registers.   Parsing models based on the traditional loading and processing of 8 bits at a time would seem to be greatly underutilizing processor resources. Unfortunately, parsing is hard to parallelize.   Indeed, in their seminal report outlining the landscape of parallel computing research, researchers from Berkeley identified the finite state machine methods underlying parsing and lexical processing as the hardest of the "13 dwarves" to parallelize, concluding at one point that "nothing helps." \cite{Asanovic:EECS-2006-183}   SIMD methods, in particular, would seem to be ill-suited to parsing, because textual data streams are seldom organized in convenient 16-byte blocks, tending to consist instead of variable-length items in generally unpredictable patterns. Nevertheless, there have been some notable works such as that of Scarpazza in applying the multicore and SIMD capabilities of the Cell/BE processor to regular expression matching \cite{Scarpazza:2009}. Intel has also signalled the importance of accelerated string processing to its customers through the introduction of new string processing instructions in the SSE 4.2 instruction set extension, demonstrating how those features may be used to advantage in activities such as XML parsing \cite{XMLSSE42}. Our research has been exploring a promising alternative approach, however, based on the concept of {\em parallel bitstreams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}. In this approach, byte streams are first sliced into eight basis bitstreams, one for each bit position within the byte.  Bit stream $i$ thus comprises the $i$th bit of each byte.   Using 128-bit SIMD registers, then, bitwise logic operations on these basis bitstreams allows byte classification operations to be carried out in parallel for 128 bytes at a time.  For example, consider a character class bitstream \verb:[<]: using a 1 bit to mark the position of opening angle brackets in a byte stream.  This stream may computed as logical combination of the basis bitstreams using only seven bitwise logical operations per 128 bytes. Based on this approach, our prior work has shown how parallel bitstreams may be used to accelerate XML parsing by further taking advantage of processor {\em bit scan} instructions, commonly found within commodity processors \cite{CameronHerdyLin2008}. On current Intel or AMD processors, for example, these instructions allow one to determine the position of the first 1 bit in a group of 64 in a single instruction.   Using these techniques, our Parabix 1 parser demonstrated offered considerable accelaration of XML parsing in statistics gathering \cite{CameronHerdyLin2008} as well as GML to SVG conversion \cite{Herdy2008}. Although the finite state machine methods used in the scanning and parsing of text streams is considered to be the hardest of the 13 dwarves'' to parallelize \cite{Asanovic:EECS-2006-183}, parallel bitstream technology shows considerable promise for these types of applications\cite{PPoPP08,CameronHerdyLin2008,Green2009}. In this approach, character streams are processed $N$ positions at a time using the $N$-bit SIMD registers commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit position within the byte. These basis bitstreams are then combined with bitwise logic and shifting operations to compute further parallel bit streams of interest, such as the \verb:[<]: bit stream marking the position of all opening angle brackets in an XML document. Using these techniques as well as the {\em bit scan} instructions also available on commodity processors, the Parabix 1 XML parser was shown to considerably accelerate XML parsing in comparison with conventional byte-at-a-time parser in applications such as statistics gathering \cite{CameronHerdyLin2008} and as GML to SVG conversion \cite{Herdy2008}. Other efforts to accelerate XML parsing include the use of custom XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and In this paper, we further increase the parallelism in our methods by introducing a new parallel scanning primitive using bitstream addition.   In essence, multiple 1 bits in a marker stream identify current scanning positions for multiple instances of a particular syntactic context within a byte stream. These multiple marker positions may each be independently advanced in parallel using addition and masking. The net result is a new scanning primitive that allows multiple instances of syntactic elements to be parsed simultaneously.   For example, in dense XML markup, one might find several instances of particular types of markup tags within a given 64-byte block of text; parallel addition on 64-bit words allows all such instances to be processed at once. addition.   In essence, this primitive replaces the sequential bit scan operations underlying Parabix 1 with a new approach that independently advances multiple marker bits in parallel using simple addition and logic operations.   This paper documents the technique and evaluates it in application to the problem of XML parsing and well-formedness checking. Section 2 reviews the basics of parallel bitstream technology and introduces our new parallel scanning primitive.  Section 3 goes on to show how this primitive may be used in XML scanning and parsing, while Section 4 discusses the construction of a complete XML well-formedness checker based on these techniques. Section 5 then briefly describes the compiler technology used to generate the low level code for our approach.  A performance study in Section 6 shows that the new Parabix 2 parser is dramatically faster than traditional byte-at-a-time parsers as well as the original Parabix 1 parser, particularly for dense XML markup.  Section 7 concludes the paper. % The remainder of this paper is organized as follows. A bitstream is simply a sequence of $0$s and $1$s, where there is one such bit in the bitstream for each character in a source data stream. For parsing, and other text processing tasks, we need to consider multiple properties of characters at different stages during the parsing process. A bitstream can be associated with each of these properties, and hence there will be multiple (parallel) bitstreams associated with a source data stream of characters \cite{Cameron2009,PPoPP08}. For parsing, and other text processing tasks, we need to consider multiple properties of characters at different stages during the parsing process. A bitstream can be associated with each of these properties, and hence there will be multiple (parallel) bitstreams associated with a source data stream of characters. The starting point for bitstream methods are \emph{basis} bitstreams transposition and less than 1 CPU cycle per byte for all the character classes needed for XML parsing \cite{CameronHerdyLin2008}. Improved instruction sets using parallel extract operations or inductive doubling techniques may further reduce this overhead significantly \cite{CameronLin2009,HilewitzLee2006}. %Improved instruction sets using parallel extract operations or %inductive doubling techniques may further reduce this overhead significantly \cite{CameronLin2009,HilewitzLee2006}. Beyond the bitwise logic needed for character class determination, Figure \ref{fig:scan1} illustrates the basic concept underlying parallel parsing with bitstream addition. As with the previous figures, all streams are shown in little-endian All streams are shown in little-endian representation, with streams reading from right-to-left. The first row shows a source data stream that includes three spans of digits, $13840$, $1139845$, and $127$, with other nondigit characters shown The first row shows a source data stream that includes several spans of digits, together with other nondigit characters shown as hyphens.  The second row specifies the parsing problem using a marker bitstream $M_0$ to mark three initial marker positions at the start of each span of digits. using a marker bitstream $M_0$ to mark four initial marker positions.  In three instances, these markers are at the beginning (i.e., little end) of a span, while one is in the middle of a span. The parallel parsing task is to move each of the three markers forward through the corresponding spans of of the four markers forward through the corresponding spans of digits to the immediately following positions. \begin{figure}[tbh] \begin{center} \begin{tabular}{l@{}lr}\\ \multicolumn{2}{l}{source data $\vartriangleleft$} & \verb--721----5489311-----04831------\\ $M_0$ &                   & \verb....1..........1.........1......\\ $D$   & $= \verb[0..9]$ & \verb..111....1111111.....11111......\\ $M_1$ & $= M_0 + D$      & \verb.1......1...........1........... \begin{tabular}{cr}\\ source data $\vartriangleleft$ & \verb----173942---654----1----49731----321--\\ $M_0 =$ & \verb.........1.....1....1......1...........\\ $D =$\verb:[0-9]: & \verb....111111...111....1....11111....111..\\ $M_0 + D$ & \verb...1........1......1....1...11....111..\\ $M_1 = (M_0 + D) \wedge \neg D$ & \verb...1........1......1....1.............. \end{tabular} \end{center} \caption{Bitstream addition} \caption{Parallel Scan Using Bitstream Addition and Mask} \label{fig:scan1} \end{figure} class bitstreams.  As a marker 1 bit is combined using binary addition to a span of 1s, each 1 in the span becomes 0, generating a carry to add to the next position to the left. For each span, the process terminates at the left end a carry to add to the next position to the left. For each such span, the process terminates at the left end of the span, generating a 1 bit in the immediately following position.   In this way, binary addition produces the marker bitstream $M_1$, with each of the three markers moved independently through their respective spans of digits to the position at the end. However, the simple addition technique shown in Figure \ref{fig:scan1} does not account for digits in the source stream that do not play a role in a particular scanning operation. Figure \ref{fig:scan2} shows an example and how this may be resolved.   The source data stream is again shown in row 1, and the marker bitstream defining the initial marker positions for the the parallel parsing tasks shown in row 2. Row 3 again contains the character class bitstream for digits $D$. Row 4 shows the result of bitstream addition, in which marker bits are advanced, but additional bits not involved in the scan operation are included in the result. However, these are easily removed in row 5, by masking following position.   These generated 1 bits represent the moved marker bits.   However, the result of the addition also produces some additional bits that are not involved in the scan operation. However, these are easily removed as shown in the fifth row, by applying bitwise logic to mask off any bits from the digit bitstream; these can never be marker positions resulting from a scan. (conflict-free) set of initial markers specified in $M_0$. \begin{figure}[tbh] \begin{center} \begin{tabular}{l@{}lr}\\ \multicolumn{2}{l}{source data $\vartriangleleft$} & \verb--134--31--59127---3--3474--\\ $M_0$ &                                                 & \verb....1.........1..........1..\\ $D$   & $= \verb[0..9]$ & \verb..111..11..11111...1..1111..\\ $M_1$ & $= M_0 + D$       & \verb.1.....11.1....1...1.1......\\ $M_2$ & $= (M_0 + D) \wedge \neg D$ & \verb.1........1..........1...... \end{tabular} \end{center} \caption{Parallel Scan Using Addition and Mask} \label{fig:scan2} \end{figure} % The addition and masking technique allows matching of including both document-oriented and data-oriented XML files. The jawiki.xml and dewiki.xml XML files are document-oriented XML instances of Wikimedia books, written in Japanese and German, respectively. The remaining files are data-oriented.  The roads.gml file is an instance of Geography Markup Language (GML), a modeling language for geographic information systems as well as an open interchange format for geographic transactions on the Internet \cite{GML04}.  The po.xml file is an example of purchase order data, while the soap.xml file contains a large SOAP message. a modeling language for geographic information systems as well as an open interchange format for geographic transactions on the Internet.  The po.xml file is an example of purchase order data, while the soap.xml file contains a large SOAP message. Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size. This metric is reported for each document.