# Changeset 1397 for docs

Ignore:
Timestamp:
Aug 30, 2011, 6:45:58 PM (8 years ago)
Message:

edits

File:
1 edited

### Legend:

Unmodified
 r1393 \subsection{Parallel Bit Streams} The fundamental difference between the Parabix framework and \caption{Example 7-bit ASCII Basis Bit Streams} \label{fig:BitstreamsExample} \label{fig:bit streamsExample} \end{figure} process and validate the source document. The remainder of this section will discuss each type of bit stream. {\bf Basis Bit Streams:} % are defined as the set of bit streams that represent the transposed data format of the source XML byte data. In other % words, $M$-bit source characters are represented in transposed representation using $M$ basis bit streams. % Figure \ref{fig:BitstreamsExample} presents an example of the basis bit stream representation of 8-bit % Figure \ref{fig:bit streamsExample} presents an example of the basis bit stream representation of 8-bit % ASCII characters. $B_0 \ldots B_7$ are the individual bit streams. The $0$ bits in the bit streams are represented % by periods as to emphasize the $1$ bits. metadata parsing. % For example, in a CSV file, any ,' or \textbackslash n' can indicate the start of a new column or row respectively. For example, in XML, any opening angle bracket character, \verb<', may indicate that we are starting a new markup tag.  Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set of known code points and branching appropriately when one is found, typically using an if or switch statement.  Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms to do so correctly. For example, in XML, any opening angle bracket character, \verb<', may indicate that we are starting a new markup tag. Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set of known significant characters and branching appropriately when one is found, typically using an if or switch statement. Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms to do so correctly. % However, a <' is legal within an XML comment so not every <' necessarily means that we are opening a new tag. {\bf Lexical and Error Bit Streams:} To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.  Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.  Error bit streams are derived from the lexical bit streams and can be used to identify any well-formedness issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity of the error. To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.  The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits, and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.  $\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$.  For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb<[a-zA-Z]+> and wish to find all instances of it in the source text.  We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all \verb>'s, and $C_{2}$, which marks all \verb<'s. In $L_0$ the position of every \verb<' is advanced by one to locate the first character of each token. By computing $E_{0}$, the parser notes that \verb<>'' does not match the expected pattern. To find the end positions of each token, the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each token ends with a \verb>' and discovers that \verb and wish to find all instances of it in the source text. We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all \verb>'s, and $C_{2}$, which marks all \verb<'s. In $L_0$ the position of every \verb<' is advanced by one to locate the first character of each token. By computing $E_{0}$, the parser notes that \verb<>'' does not match the expected pattern. To find the end positions of each token, the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each token ends with a \verb>' and discovers that \verb`