source: docs/HPCA2012/03-research.tex @ 1331

Last change on this file since 1331 was 1331, checked in by lindanl, 8 years ago

section 4

File size: 10.9 KB
3%Describe key technology behind Parabix
4%Introduce SIMD;
5%Talk about SSE
6%Highlight which SSE instructions are important
7%TAlk about each pass in the parser; How SSE is used in every phase...
8%Benefits of SSE in each phase.
11% Extract section 2.2 and merge into 3.   Add a new subsection
12% in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
13% operations and then mention the pack operations that allow
14% us to implement transposition efficiently in parallel. 
15% Also note that the SIMD registers support bitwise logic across
16% their full width and that this is extensively used in our work.
18% Also, it could be good to have a small excerpt of a byte-at-a-time
19% scanning loop for XML, e.g., extracted from Xerces in section 2.1. 
20% Just a few lines showing the while loop - Linda can tell you the file.
23% This section focuses on the
26% With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}.
28% The results of \cite{CameronHerdyLin2008} showed that Parabix, the predecessor of Parabix2, was dramatically faster than both Expat 2.0.1 and Xerces-C++ 2.8.0.
29% It is our expectation is that Parabix2 will outperform both Expat 2.0.1 and Xerces-C++ 3.1.1 in terms of energy consumption per source XML byte.
30% This expectation is based on the relatively-branchless code composition of Parabix2 and the more-efficient utilization of last-level cache resources.
31% The authors of \cite {bellosa2001, bircher2007, bertran2010} indicate that such factors have a considerable effect on overall energy consumption.
32% Hence, one of the foci in our study is the manner in which straight line SIMD code influences energy usage.
34This section provides an overview of the SIMD-based parallel bit stream XML parsers, Parabix1 and Parabix2. A comprehensive study of Parabix2 can be found in the technical report ``Parallel Parsing with Bitstream Addition: An XML Case Study'' \cite{Cameron2010}.
38% Our first generation parallel bitstream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bitstreams for the bits of each byte.
40Parabix1 processes source XML in a functionally equivalent manner as a traditional recursive descent XML parser. That is, Parabix1 moves sequentially through the source document, maintains a single parser cursor position, and parsers recursively and depth-first. Where Parabix1 differs from the traditional parser is that it scans for key markup characters using a series of bit streams. A bit stream is simply a sequence of $0$s and $1$s. A $1$-bit marks the postion of each key character in the corresponding source data stream. A single stream is generated for each of the key markup characters.
42In Parabix1, basis bit streams are used to generate character-class streams for key markup characters. Basis bit streams are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams. Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented by periods as to emphasize the $1$ bits.
47source data & \verb`<t1>abc</t1><tag2/>`\\
48$B_0$ & \verb`..`\\
49$B_1$ & \verb`...1.11.1..1...1111`\\
50$B_2$ & \verb`11.1...`\\
51$B_3$ & \verb`1..1...11..11....11`\\
52$B_4$ & \verb`1111...1.11111..1.1`\\
53$B_5$ & \verb`1111111111111111111`\\
54$B_6$ & \verb`.1..111..1...111...`\\
55$B_7$ & \verb`...................`\\
58\caption{Example 8-bit ASCII Character Basis Bit Streams}
62To transform byte-oriented character data to parallel bit stream representation, source data is first loaded into SIMD registered in sequential order. It is then converted to the transposed basis bit stream representation through a series of parallel SIMD pack, shift, and logical bitwise operations. Using the SIMD capabilities of current commodity processors, the transposition of source data to basis bit stream format incurs an amortized cost of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
64Throughout the XML parsing process we must identify key XML characters. For example, the opening angle bracket character `<'. For this purpose, we combine the basis bit streams using bitwise logic and compute character-class bit streams. For example, the $j$-th character is an open angle bracket `<' if and only if the $j$-th bit of $B_2, B_3, B_4, B_5 =$ 1 and the $j$-th bit of $B_0, B_1, B_6, B_7 =$ 0. Character-class streams mark the positions of source characters as a single $1$-bit. Each bit position in the computed bit stream is in one-to-one correspondence with its source byte position.  Once generated, single cycle built-in {\em bitscan} operations are used to locate the positions of key XML characters throughout the parsing process. Utilizing $M$ SIMD registers of width $W$, it is possible to scan through $W$ characters in parallel. The register width $W$ is processor dependent and ranges from 64-bit for MMX, to 128-bit for SSE, and 256-bit for AVX.
66A common operation in XML parsing is XML start tag validation. Starts tags begin with `<' and end with either ``/>'' or ``>'' (depending on whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} conceptually demonstrates start tag validation as performed in Parabix1 using character-class streams together with the processor built-in $bitscan$ operation. We proceeed as follows. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket `<', starting at position 1. Since the source data begins with `<', $M_0$ is assigned a cursor position of 1. The $advance$ operation then shifts $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an `>'. It finds a `>' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the {\em bitscan} operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}.
71source data                     & \verb`<t1>abc</t1><tag2/>`\\
72$M_0 = 1$                       & \verb`1..................`\\
73$M_1 = advance(M_0)$            & \verb`.1.................`\\
74$M_2 = bitscan('>')$            & \verb`...1...............`\\
75$M_3 = advance(M_2)$            & \verb`....1..............`\\
76$M_4 = bitscan('<')$            & \verb`.......1...........`\\
77$M_5 = advance(M_4)$            & \verb`........1..........`\\
78$M_6 = advance(M_5)$            & \verb`.........1.........`\\
79$M_7 = bitscan('<')$            & \verb`............1......`\\
80$M_{8} = advance(M_7)$  & \verb`.............1.....`\\
81$M_{9} = bitscan('/')$  & \verb`.................1.`\\
82$M_{10} = advance(M_{9})$       & \verb`..................1`\\
85\caption{Parabix1 Start Tag Validation}
91In Parabix2, the sequential single-cursor parsing approach using {\em bitscan} instructions is replaced by a parallel parsing approach, that uses multiple cursors when possible, and bit stream addition operations to advance multiple cursor positions in parallel.
92Unlike the single-cursor approach of Parabix1 (and conceptually of all sequential XML parsers),
93Parabix2 processes multiple cursors in parallel. For example, using the source data from
94Figure \ref{fig:Parabix1StarttagExample}, Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2 begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream. $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by adding $M_1$ to $N$, and uses the bitwise AND operation of the negation of $N$ to find only the true end tags of $M_1$. Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional details, refer to the technical report \cite{Cameron2010}.
100source data                     & \verb`<t1>abc</t1><tag2/>`\\
101$N = $ Tag Names                & \verb`.11......11..1111..`\\
102$M_0 = \texttt{[<]}$            & \verb`1...........1......`\\
103$M_1 = advance(M_0)$            & \verb`.1...........1.....`\\
104$M_2 = scanthru(M_1, A)$                & \verb`...1.............1.`\\
107\caption{Parabix2 Start Tag Validation}
111In general, the set of bit positions in a bit stream may be considered to be the current parsing
112positions of multiple parses taking place in parallel throughout the source data stream. Although it is not explicitly shown in these prior examples, error bit streams can be used to identify any well-formedness errors found during the parsing process. Error positions are gathered and
113processed in as a final post processing step. A further aspect of the parallel cursor method with bit stream addition is that the conditional branch statements used to identify syntax error at each each parsing position are eliminated. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing and further reduces branch misprediction penalties.
Note: See TracBrowser for help on using the repository browser.