Changeset 1104 for docs


Ignore:
Timestamp:
Apr 10, 2011, 7:47:40 PM (8 years ago)
Author:
ksherdy
Message:

Add chart data. Minor edits.

Location:
docs/PACT2011
Files:
1 added
3 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/03-research.tex

    r1102 r1104  
    6464Throughout the XML parsing process we must identify key XML characters. For example, the opening angle bracket character `<'. For this purpose, we combine the basis bit streams using bitwise logic and compute character-class bit streams. For example, the $j$-th character is an open angle bracket `<' if and only if the $j$-th bit of $B_2, B_3, B_4, B_5 =$ 1 and the $j$-th bit of $B_0, B_1, B_6, B_7 =$ 0. Character-class streams mark the positions of source characters as a single $1$-bit. Each bit position in the computed bit stream is in one-to-one correspondence with its source byte position.  Once generated, single cycle built-in {\em bitscan} operations are used to locate the positions of key XML character throughout the parsing process. Utilizing $M$ SIMD registers of width $W$, it is possible to scan through $W$ characters in parallel. The register width $W$ varies between 64-bit for MMX, 128-bit for SSE, and 256-bit for AVX.
    6565
    66 A common operation in XML parsing is XML start tag validation. Starts tags begin with `<' and end with either ``/>'' or ``>'' (depending whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} demonstrates the concept of start tag validation as performed in Parabix1 using character-class streams together with the processor built-in bit scan operation. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket `<', starting at position 1. Since the source data begins with `<', $M_0$ is assigned a cursor position of 1. The $advance$ operation then shifts $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an `>'. It finds a `>' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the {\em bitscan} operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}.
     66A common operation in XML parsing is XML start tag validation. Starts tags begin with `<' and end with either ``/>'' or ``>'' (depending whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} conceptually demonstrates start tag validation as performed in Parabix1 using character-class streams together with the processor built-in bit scan operation. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket `<', starting at position 1. Since the source data begins with `<', $M_0$ is assigned a cursor position of 1. The $advance$ operation then shifts $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an `>'. It finds a `>' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the {\em bitscan} operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}.
    6767
    6868\begin{figure}[h]
     
    8383\end{tabular}
    8484\end{center}
    85 \caption{Parabix1 Start Tag Validation (Conceptual)}
     85\caption{Parabix1 Start Tag Validation}
    8686\label{fig:Parabix1StarttagExample}
    8787\end{figure}
     
    9292Unlike the single-cursor approach of Parabix1 (and conceptually of all sequential XML parsers),
    9393Parabix2 processes multiple cursors in parallel. For example, using the source data from
    94 Figure \ref{fig:Parabix1StarttagExample}, Figure \ref{fig:Parabix2StarttagExample} shows how Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2 begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream. $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by adding $M_1$ to $N$, and uses the bitwise AND operation of the negation of $N$ to find only the true end tags of $M_1$. Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional details, refer to the technical report \cite{Cameron2010}.
     94Figure \ref{fig:Parabix1StarttagExample}, Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2 begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream. $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by adding $M_1$ to $N$, and uses the bitwise AND operation of the negation of $N$ to find only the true end tags of $M_1$. Because an end tag may end on an `/' or '>', $scanto$ is called again to advance any cursor from `/' to `>'. For additional details, refer to the technical report \cite{Cameron2010}.
    9595
    9696
     
    106106\end{tabular}
    107107\end{center}
    108 \caption{Parabix2 Start Tag Validation (Conceptual)}
     108\caption{Parabix2 Start Tag Validation}
    109109\label{fig:Parabix2StarttagExample}
    110110\end{figure}
Note: See TracChangeset for help on using the changeset viewer.