# Changeset 1774 for docs/HPCA2012/final_ieee/03-research.tex

Ignore:
Timestamp:
Dec 13, 2011, 4:50:42 PM (8 years ago)
Message:

minor changes

File:
1 edited

### Legend:

Unmodified
 r1751 traditional text processing models is in how Parabix represents the source data.  Given a traditional byte-oriented text stream, Parabix first transposes the text data to a transform domain consisting of 8 parallel bit streams, known as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream of $k$-th bit of first transposes the text data into a transform representation consisting of 8 parallel bit streams, known as {\em basis bit streams} wherein basis bit stream $b_{k}$ represents the stream of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source text is in the $i$-th (bit) position of the $k$-th basis in the source text is in one-to-one correspondence with the $i$-th bit of the $k$-th basis bit stream, $b_{k}$.  For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string {\ttfamily b7\verb<A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been \ldots 7}$. The bits used to construct$\tt b_7$are highlighted in this example. use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on Intel) to process 128 byte positions at a time using bitwise logic, shifting and other operations. bitwise logical, shift and arithmetic operations. Just as forward and inverse Fourier transforms are used to transform between the time and frequency domains in signal processing, bit stream transposition and inverse transposition provides byte space'' and bit space'' views of text. The goal of the Parabix framework is and bit space'' domains for text. The goal of the Parabix framework is to support efficient text processing using these two equivalent representations in the same way that efficient signal processing representations in the analogous way that efficient signal processing benefits from the use of the frequency domain in some cases and the time domain in others. metadata parsing. % For example, in a CSV file, any ,' or \textbackslash n' can indicate the start of a new column or row respectively. For example, in XML, any opening angle bracket character, \verb<', may indicate that we are starting a new markup tag. For example, in XML, any opening angle bracket character, \verb<', may indicate the start of a markup tag. Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set of known significant characters and branching appropriately when one is found, typically using an if or switch statement. % However, a <' is legal within an XML comment so not every <' necessarily means that we are opening a new tag. Character-class bit streams allow us to perform up to 128 comparisons'' in parallel with a single operation by using a series of boolean-logic operations \footnote{$\land$,$\lor$and$\lnot$denote the boolean AND, OR and NOT operations.} to merge multiple basis bit streams into a single character-class stream that marks the positions of key characters with a$1$. For example, a character is Character-class bit streams enable up to 128 comparisons'' in parallel through a series of boolean-logic operations \footnote{$\land$,$\lor$and$\lnot$denote the boolean AND, OR and NOT operations.} that merge multiple basis bit streams into a single character-class stream that marks the positions of key characters. For example, a character is an \verb<' if and only if$\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. Classes of characters can be found with similar formulas. For example, a b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Addition character classes can be determined with similar formulas.  For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using a mixture of both boolean logic and arithmetic operations. Lexical bit streams typically mark multiple current parsing positions. Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel. Unlike the single-cursor approach of traditional text parsers, the marking of multiple lexical items allows Parabix to process multiple items in parallel. Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness issues found during the parsing process. The presence of a $\tt 1$ in an error stream indicates that the lexical stream cannot be trusted to be completely accurate and it may be necessary to perform some sequential parsing on that section to determine the cause and severity of the error. %How errors are handled depends on the logical implications of the error and go beyond the scope of this paper. To form lexical bit streams, we have to introduce a few new operations: {\tt Advance} and {\tt ScanThru}. issues found during the parsing process. A $\tt 1$ bit in an error stream indicates the precense of a potential error that may require further processing to determine cause and severity. To form lexical bit streams we introduce two new operations: {\tt Advance} and {\tt ScanThru}. The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits, and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right. {\tt ScanThru} accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$. {\tt ScanThru} accepts two input parameters, $c$ and $m$, where $c$ denotes an initial set of cursor positions, and $m$ denotes a set of marked'' lexical item positions. The ScanThru operation determines the cursor positions immediately following any run of marker positions by calculating $(c + m) \land \lnot m$. For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb<[a-zA-Z]+> and wish to find all instances of it in the source text. token. By computing $E_{0}$, the parser notes that \verb<>'' does not match the expected pattern. To find the end positions of each token, the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each token ends with a \verb>' and discovers that \verb' and discovers that \verb`