Changeset 1102 for docs


Ignore:
Timestamp:
Apr 8, 2011, 9:00:21 PM (8 years ago)
Author:
ksherdy
Message:

Edits.

Location:
docs/PACT2011
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/03-research.tex

    r1101 r1102  
    3838% Our first generation parallel bitstream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bitstreams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bitstreams for the bits of each byte.
    3939
    40 Parabix1 processes source XML in a functionally equivalent manner as a traditional recursive descent XML parser. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor position throughout the parsing process, and parsers recursively, depth first. Where Parabix1 differs from the traditional parser is that it scans for key markup characters using a series of basis bit streams.
    41 A bit stream is simply a sequence of $0$s and $1$s, where there is one such bit in the bit stream for each character in a source data stream. A basis bit stream is a bit stream that consists of only transposed textual XML data.
    42 In other words, a source character consisting of $M$ bits can be represented with $M$ bit streams and
    43 by utilizing $M$ SIMD registers of width $W$, it is possible to scan through $W$ characters in parallel.
    44 The register width $W$ varies between 64-bit for MMX, 128-bit for SSE, and 256-bit for AVX.
    45 Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented by periods to emphasize the $1$ bits.
     40Parabix1 processes source XML in a functionally equivalent manner as a traditional recursive descent XML parser. That is, Parabix1 moves sequentially through the source document, maintains a single parser cursor position, and parsers recursively and depth-first. Where Parabix1 differs from the traditional parser is that it scans for key markup characters using a series of bit streams. A bit stream is simply a sequence of $0$s and $1$s. A $1$-bit marks the postion of each key character in the corresponding source data stream. A single stream is generated for each of the key markup characters.
     41
     42In Parabix1, basis bit streams are used to generate character-class streams for key markup characters. Basis bit streams are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other words, $M$ bit source characters are represented in transposed representation using $M$ basis bit streams. Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented by periods as to emphasize the $1$ bits.
    4643
    4744\begin{figure}[h]
     
    6562To transform byte-oriented character data to parallel bit stream representation, source data is first loaded into SIMD registered in sequential order. It is then converted to the transposed basis bit stream representation through a series of packs, shifts, and logical bitwise operations. Using the SIMD capabilities of current commodity processors, the transposition of source data to basis bit stream format incurs an amortized cost of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
    6663
    67 Throughout the XML parsing process we must identify significant XML characters. For example, the next opening angle bracket character `<'. For this purpose, we combine the basis bit streams using bitwise logic and compute character-class bit streams. Character-class streams mark the positions of characters as a single $1$-bit if present, $0$ otherwise. Each bit position in the computed bit stream is in one-to-one correspondence to the source data byte positions. For example, the $j$-th character is an open angle bracket `<' if and only if the $j$-th bit of $B_2, B_3, B_4, B_5 =$ 1 and the $j$-th bit of $B_0, B_1, B_6, B_7 =$ 0. Once generated, single cycle built-in {\em bit scan} operations are used to locate the positions significant XML character throughout parsing.
     64Throughout the XML parsing process we must identify key XML characters. For example, the opening angle bracket character `<'. For this purpose, we combine the basis bit streams using bitwise logic and compute character-class bit streams. For example, the $j$-th character is an open angle bracket `<' if and only if the $j$-th bit of $B_2, B_3, B_4, B_5 =$ 1 and the $j$-th bit of $B_0, B_1, B_6, B_7 =$ 0. Character-class streams mark the positions of source characters as a single $1$-bit. Each bit position in the computed bit stream is in one-to-one correspondence with its source byte position.  Once generated, single cycle built-in {\em bitscan} operations are used to locate the positions of key XML character throughout the parsing process. Utilizing $M$ SIMD registers of width $W$, it is possible to scan through $W$ characters in parallel. The register width $W$ varies between 64-bit for MMX, 128-bit for SSE, and 256-bit for AVX.
    6865
    69 A common operation in XML parsing is XML start tag validation. Starts tags begin with `<' and end with either ``/>'' or ``>'' (depending whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} demonstrates the concept of start tag validation as performed in Parabix1 using character-class streams together with the processor built-in bit scan operation. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket `<', starting at position 1. Since the source data begins with `<', $M_0$ is assigned a cursor position of 1. The $advance$ operation then then shifts the $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an `>'. It finds a `>' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the bit scan operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}.
     66A common operation in XML parsing is XML start tag validation. Starts tags begin with `<' and end with either ``/>'' or ``>'' (depending whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} demonstrates the concept of start tag validation as performed in Parabix1 using character-class streams together with the processor built-in bit scan operation. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket `<', starting at position 1. Since the source data begins with `<', $M_0$ is assigned a cursor position of 1. The $advance$ operation then shifts $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an `>'. It finds a `>' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the {\em bitscan} operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}.
    7067
    7168\begin{figure}[h]
     
    9289\subsection{Parabix2}
    9390
    94 In Parabix2, the sequential single-cursor parsing approach using bit scan instructions is replaced by a parallel parsing approach, using multiple cursors when possible, and bit stream addition operations to advance cursor positions.
     91In Parabix2, the sequential single-cursor parsing approach using {\em bitscan} instructions is replaced by a parallel parsing approach, that uses multiple cursors when possible, and bit stream addition operations to advance cursor positions.
    9592Unlike the single-cursor approach of Parabix1 (and conceptually of all sequential XML parsers),
    9693Parabix2 processes multiple cursors in parallel. For example, using the source data from
     
    118115processed in as a final post processing step. A further aspect of the parallel cursor method with bit stream addition is that the conditional branch statements used to identify
    119116syntax error at each each parsing position are eliminated. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor
    120 parsing as well as further reducing branch misprediction penalties.
     117parsing and further reduces branch misprediction penalties.
    121118
    122119
Note: See TracChangeset for help on using the changeset viewer.