# Changeset 4492 for docs/Working

Ignore:
Timestamp:
Feb 10, 2015, 8:14:11 PM (4 years ago)
Message:

Location:
docs/Working/icGrep
Files:
2 edited

### Legend:

Unmodified
 r4490 \section{Bitwise Methods for Unicode}\label{sec:Unicode} \subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8} \section{Bitwise Methods for UTF-8}\label{sec:Unicode} % The \icGrep{} parser produces a representation of an input regular expression over code points. % Parabix, however, operates on fixed size code \emph{units} instead of varying sized code \emph{points}. % We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units. % %The transformation accomplishes this by first determining the number of UTF-8 bytes % %required to represent each code point contained within each character class. % The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix. % %The UTF-8 encoded bytes are each assigned to % and assigns these code units to new character classes.% in the new AST. % % % Consider the multibyte Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. % The parser produces a sequence starting with a character class for \verb:0x244: % followed by a character class for the range from \verb:0x2030: to \verb:0x2137:. % After toUTF-8, %this AST becomes more complex. % the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:, % and the character class for the range%, which is a range of three byte sequences, % expands into a series of sequences and alternations necessary to specify the byte encodings within the range. % The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: % \newline The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points. To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units. \% Consider the Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. % The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:. % After toUTF8, the first codepoint in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'', and the range expands into the series of sequences and alternations shown below: As described in the following section, \icGrep{} is a reimplementation of the bitwise data parallel method implemented on top of LLVM infrastructure and adapted for Unicode regular expression search through data streams represented in UTF-8.  In this section, we present the techniques we have used to extend the bitwise matching techniques to the variable-length encodings of UTF-8. The first requirement in implementing a regular expression processor over UTF-8 data streams is to translate Unicode regular expressions over codepoints to corresponding regular expressions over sequences of UTF-8 bytes.   The {\tt toUTF8} transformation performs this as a regular expression transformation, transforming input expressions such as \verb:\u{244}[\u{2030}-\u{2137}]: to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below: \newline { \small } %The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size. \paragraph{Unicode Advance.}  As illustrated above,  a bitwise shift (Advance) operation is frequently used in shifting a marker stream from positions matched by one regular expression element to the next. However, characters are represented by variable-length byte sequences in UTF-8, so that a simple shift operation is inadequate to implement the operation of advancing bit stream positions from one Unicode character to the next. \subsection{Match Unicode Characters} After the transformation, we can then generate the character class bitstreams with our character class compiler. As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes. In order to address the requirements of Unicode advance, we use the $\mathit{ScanThru}$ \cite{cameron2011parallel} operation to move a set of markers each through the nonfinal bytes of UTF-8 sequences to the final byte position.   Figure \ref{fig:multibytesequence} shows this technique in operation in the case of advancing through byte sequences (each 3 bytes in length) corresponding to Chinese characters. To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters. $CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}. To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters. We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte. To match a two UTF-8 character sequence \emph{ni3hao}, we first create an \emph{Initial} steam that marks the first byte of all the valid characters. We also produce a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte. Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character. An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character. As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}. The final result stream shows 1 match for Multibyte Sequence ni3hao. The final result stream shows 1 match for the multibyte sequence ni3hao. \begin{figure}[tbh] $\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb.....1.................... \end{tabular} \end{center} \caption{Processing of a Multibyte Sequence ni3hao} \end{figure} \subsection{Predefined Unicode classes} \paragraph{Unicode MatchStar.} The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation. These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep. Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category. This optimization assumes that most input documents only contain code points from a small number of writing systems. Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application. The $\mathit{MatchStar}(M, C)$ operation directly implements the operation of finding all positions reachable from a marker bit in $M$ through a character class repetition of an ASCII byte class $C$.   In UTF-8 matching, however, the character class byte streams are marked at their {\em final} positions.   Thus the one bits of a Unicode character class stream are not necessarily contiguous.  This in turn means that the addition operation within the MatchStar operation may terminate prematurely. In order to remedy this problem, \icGrep{} forms two helper bitstreams \emph{Initial} and \emph{NonFinal}.  The {\em Initial} bitstream marks the locations of all single byte characters and the first bytes of all multibyte characters.  Any full match to a multibyte sequence must reach the initial position of the next character. The {\em NonFinal} bitstream consists of all positions except those that are final positions of UTF-8 sequences. It is used to fill in the gaps'' in the CC bitstream so that the MatchStar addition can move through a contiguous sequence of one bits.  In this way, matching of an arbitrary Unicode character class $C$ (with a 1 bit set at final positions of any members of the class), can be implemented using ${\mathit{MatchStar}(M, C) | \mathit{NonFinal})$. \paragraph{Predefined Unicode Classes.} \icGrep{} employs a set of predefined bitstreams that are precompiled into the executable.  These include all bitstreams corresponding to Unicode property expressions such as \verb:\p{Grek}:. Each category potentially contains  many code points, so we further devised an \emph{If Hierarchy} optimization for these properties. This optimization applies whenever an input document contains codepoints from only a small number of writing systems -- the normal case. The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class equations for observed code point ranges.