# Changeset 1045 for docs

Ignore:
Timestamp:
Mar 25, 2011, 10:01:14 PM (8 years ago)
Message:

Update Table 4 and 5.

File:
1 edited

### Legend:

Unmodified
 r1044 \end{figure} In order represent the byte-oriented character data as parallel bitstreams, the source data is first loaded in sequential order and converted into its transposed representation through a series of packs, shifts, and bitwise operations. In order to represent the byte-oriented character data as parallel bitstreams, the source data is first loaded in sequential order and converted into its transposed representation through a series of packs, shifts, and bitwise operations. Using the SIMD capabilities of current commodity processors, this transposition of source data to bitstreams incurs an amortized overhead of about 1 CPU cycle per byte for transposition \cite{CameronHerdyLin2008}. When parsing, we need to consider multiple properties of characters at different stages during the process. Using the basis bitstreams, it is possible to combine them using bitwise logic in order to compute character-class bitstreams; that is, streams that identify the positions at which characters belonging to a specific character class occur. For example, the $j\th$ character is an open angle bracket <' if and only if the $j$-th bit of $B_2, B_3, B_4, B_5 =$ 1 and the $j\th$ bit of $B_0, B_1, B_6, B_7 =$ 0. Once these character-class bitstreams are created, a {\em bit-scan} operation, which is an 1-cycle intrinsic function for commodity processors, can be used for sequential markup scanning and data validation operations. A common operation in all XML parsers is identifying the start tags (<') and their accompanying end tags (either />'' or >'' depending whether the element tag is an empty element tag or not, respectively). \begin{center} \begin{tabular}{lr}\\ source data                                     & \verbabc\\ % $N =$ name chars                             & \verb.11.111.11...11..\\ $M_0 = bitscan(1, \texttt{<})$                  & \verb1..................\\ $M_1 = advance(M_0)$                            & \verb.1.................\\ $M_2 = bitscan(M_1, \texttt{/}, \texttt{>})$    & \verb...1...............\\ $M_3 = advance(M_2)$                            & \verb....1..............\\ $M_4 = bitscan(M_3, \texttt{<})$                & \verb.......1...........\\ $M_5 = bitscan(M_4, \texttt{/}, \texttt{>})$    & \verb..........1........\\ $M_6 = advance(M_5)$                            & \verb...........1.......\\ $M_7 = bitscan(M_6, \texttt{<})$                & \verb.............1.....\\ $M_8 = bitscan(M_7, \texttt{/}, \texttt{>})$    & \verb.................1.\\ $M_9 = advance(M_8)$                            & \verb..................1\\ source data                     & \verbabc\\ $M_0 = 1$                       & \verb1..................\\ $M_1 = advance(M_0)$            & \verb.1.................\\ $M_2 = bitscan('>')$            & \verb...1...............\\ $M_3 = advance(M_2)$            & \verb....1..............\\ $M_4 = bitscan('<')$            & \verb.......1...........\\ $M_5 = advance(M_4)$            & \verb........1..........\\ $M_6 = advance(M_5)$            & \verb.........1.........\\ $M_7 = bitscan('<')$            & \verb............1......\\ $M_{8} = advance(M_7)$ & \verb.............1.....\\ $M_{9} = bitscan('/')$  & \verb.................1.\\ $M_{10} = advance(M_{9})$       & \verb..................1\\ \end{tabular} \end{center} \caption{Parabix1 Start and End Tag Identification} \caption{Parabix1 Start Tag Validation} \label{fig:Parabix1StarttagExample} \end{figure} In Figure \ref{fig:Parabix1StarttagExample}, the first marker stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket <', starting at position 1. Since the source data begins with <', $M_0$ is assigned a cursor position of 1. The $advance$ operation then then shifts the $M_0$'s cursor position by 1, resulting in the creation of a new marker stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an /' or >'. It finds a >' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues until every open and closing angle bracket is located within the basis stream, resulting in a process that requires 3 logical operations to find each individual start and end tag pair. In Figure \ref{fig:Parabix1StarttagExample}, the first marker stream $M_0$ is created and the parser begins scanning the source data for an open angle bracket <', starting at position 1. Since the source data begins with <', $M_0$ is assigned a cursor position of 1. The $advance$ operation then then shifts the $M_0$'s cursor position by 1, resulting in the creation of a new marker stream, $M_1$, with the cursor position at 2. The following $bitscan$ operation takes the cursor position from $M_1$ and sequentially scans every position until it locates either an >'. It finds a >' at position 4 and returns that as the new cursor position for $M_2$. Calculating $M_3$ advances the cursor again, and the $bitscan$ used to create $M_4$ locates the new opening angle bracket. This process continues until in this manner as the start tags are validated. Unlike traditional parsers, these sequential operations are accelerated significantly since the bit-scan operation can skip up to $w$ positions, where $w$ is the processor word width in bits. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, Herdy2008, Cameron2009}. \begin{center} \begin{tabular}{lr}\\ source data                     & \verbabc\\ $N =$ alphanumerics            & \verb.11.111.11...1111..\\ $M_0 = \texttt{[<]}$            & \verb1......1....1......\\ $M_1 = advance(M_0)$            & \verb.1......1....1.....\\ $M_2 = scanto(M_1, N)$          & \verb...1......1......1.\\ $M_3 = scanto(M_2, N)$          & \verb...1.......1......1 source data                     & \verbabc\\ $N =$ alphanumerics            & \verb.11......11..1111..\\ $M_0 = \texttt{[<]}$            & \verb1...........1......\\ $M_1 = advance(M_0)$            & \verb.1...........1.....\\ $M_2 = scanto(M_1, N)$          & \verb...1.............1.\\ $M_3 = scanto(M_2, N)$          & \verb...1..............1` \end{tabular} \end{center} \caption{Parabix2 Start and End Tag Identification} \caption{Parabix2 Start Validation} \label{fig:Parabix2StarttagExample} \end{figure}