Changeset 1104

Ignore:
Timestamp:
Apr 10, 2011, 7:47:40 PM (8 years ago)
Message:

Location:
docs/PACT2011
Files:
 r1102 Throughout the XML parsing process we must identify key XML characters. For example, the opening angle bracket character <'. For this purpose, we combine the basis bit streams using bitwise logic and compute character-class bit streams. For example, the $j$-th character is an open angle bracket <' if and only if the $j$-th bit of $B_2, B_3, B_4, B_5 =$ 1 and the $j$-th bit of $B_0, B_1, B_6, B_7 =$ 0. Character-class streams mark the positions of source characters as a single $1$-bit. Each bit position in the computed bit stream is in one-to-one correspondence with its source byte position.  Once generated, single cycle built-in {\em bitscan} operations are used to locate the positions of key XML character throughout the parsing process. Utilizing $M$ SIMD registers of width $W$, it is possible to scan through $W$ characters in parallel. The register width $W$ varies between 64-bit for MMX, 128-bit for SSE, and 256-bit for AVX. A common operation in XML parsing is XML start tag validation. Starts tags begin with <' and end with either />'' or >'' (depending whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} demonstrates the concept of start tag validation as performed in Parabix1 using character-class streams together with the processor built-in bit scan operation. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket <', starting at position 1. Since the source data begins with <', $M_0$ is assigned a cursor position of 1. The $advance$ operation then shifts $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an >'. It finds a >' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the {\em bitscan} operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}. A common operation in XML parsing is XML start tag validation. Starts tags begin with <' and end with either />'' or >'' (depending whether the element tag is an empty element tag or not, respectively). Figure \ref{fig:Parabix1StarttagExample} conceptually demonstrates start tag validation as performed in Parabix1 using character-class streams together with the processor built-in bit scan operation. The first bit stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket <', starting at position 1. Since the source data begins with <', $M_0$ is assigned a cursor position of 1. The $advance$ operation then shifts $M_0$'s cursor position by 1, resulting in the creation of a new bit stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an >'. It finds a >' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues in sequence until until all start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the {\em bitscan} operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}. \begin{figure}[h] \end{tabular} \end{center} \caption{Parabix1 Start Tag Validation (Conceptual)} \caption{Parabix1 Start Tag Validation} \label{fig:Parabix1StarttagExample} \end{figure} Unlike the single-cursor approach of Parabix1 (and conceptually of all sequential XML parsers), Parabix2 processes multiple cursors in parallel. For example, using the source data from Figure \ref{fig:Parabix1StarttagExample}, Figure \ref{fig:Parabix2StarttagExample} shows how Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2 begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream. $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by adding $M_1$ to $N$, and uses the bitwise AND operation of the negation of $N$ to find only the true end tags of $M_1$. Because an end tag may end on an /' or '>', $scanto$ is called again to advance any cursor from /' to >'. For additional details, refer to the technical report \cite{Cameron2010}. Figure \ref{fig:Parabix1StarttagExample}, Figure \ref{fig:Parabix2StarttagExample} conceptually demonstrates the manner in which Parabix2 identifies and advances each of the start tag bit streams. Unlike Parabix1, Parabix2 begins scanning by creating two character-class bit streams, $N$, denoting the position of every alpha numeric character within the basis stream, and $M_0$, marking the position of every potential start tag in the bit stream. $M_0$ is advanced to create $M_1$, which is fed into the first $scanto$ operation along with $N$.  To handle variable length tag names, the $scanto$ operation effectively locates the cursor positions of the end tags in parallel by adding $M_1$ to $N$, and uses the bitwise AND operation of the negation of $N$ to find only the true end tags of $M_1$. Because an end tag may end on an /' or '>', $scanto$ is called again to advance any cursor from /' to >'. For additional details, refer to the technical report \cite{Cameron2010}. \end{tabular} \end{center} \caption{Parabix2 Start Tag Validation (Conceptual)} \caption{Parabix2 Start Tag Validation} \label{fig:Parabix2StarttagExample} \end{figure}