Feb 10, 2015, 8:14:11 PM (4 years ago)


1 edited


  • docs/Working/icGrep/unicode-re.tex

    r4490 r4492  
    1 \section{Bitwise Methods for Unicode}\label{sec:Unicode}
    2 \subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8}
     1\section{Bitwise Methods for UTF-8}\label{sec:Unicode}
    4 % The \icGrep{} parser produces a representation of an input regular expression over code points.
    5 % Parabix, however, operates on fixed size code \emph{units} instead of varying sized code \emph{points}.
    6 % We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
    7 % %The transformation accomplishes this by first determining the number of UTF-8 bytes
    8 % %required to represent each code point contained within each character class.
    9 % The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
    10 % %The UTF-8 encoded bytes are each assigned to
    11 % and assigns these code units to new character classes.% in the new AST.
    12 % %
    13 % Consider the multibyte Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
    14 % The parser produces a sequence starting with a character class for \verb:0x244:
    15 % followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
    16 % After toUTF-8, %this AST becomes more complex.
    17 % the first code point in the sequence becomes the two byte sequence `\verb:\u{C9}\u{84}:`,
    18 % and the character class for the range%, which is a range of three byte sequences,
    19 % expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
    20 % The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
    21 % \newline
    22 The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points.
    23 To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units.   
    24 \%
    25 Consider the Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
    26 %
    27 The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:.
    28 %
    29 After toUTF8, the first codepoint in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
    30 and the range expands into the series of sequences and alternations shown below:
     3As described in the following section, \icGrep{} is a reimplementation
     4of the bitwise data parallel method implemented on top of LLVM
     5infrastructure and adapted for Unicode regular expression search
     6through data streams represented in UTF-8.  In this section,
     7we present the techniques we have used to extend the bitwise
     8matching techniques to the variable-length encodings of UTF-8.
     10The first requirement in implementing a regular expression processor
     11over UTF-8 data streams is to translate Unicode regular expressions
     12over codepoints to corresponding regular expressions over
     13sequences of UTF-8 bytes.   The {\tt toUTF8} transformation
     14performs this as a regular expression transformation, transforming
     15input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:`
     16to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below:
    39 %The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
     24\paragraph{Unicode Advance.}  As illustrated above,  a bitwise shift
     25(Advance) operation is frequently used in shifting a marker stream
     26from positions matched by one regular expression element to the next.
     27However, characters are represented by variable-length byte sequences
     28in UTF-8, so that a simple shift operation is inadequate to implement
     29the operation of advancing bit stream positions from one Unicode
     30character to the next. 
    42 \subsection{Match Unicode Characters}
    43 After the transformation, we can then generate the character class bitstreams with our character class compiler.
    44 As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes.
     32In order to address the requirements of Unicode advance, we use
     33the $\mathit{ScanThru}$ \cite{cameron2011parallel} operation to move a set of markers
     34each through the nonfinal bytes of UTF-8 sequences to the final
     35byte position.   Figure \ref{fig:multibytesequence} shows this
     36technique in operation in the case of advancing through byte
     37sequences (each 3 bytes in length) corresponding to Chinese characters.   
    4538To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters.
    4639$CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}.
    47 To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters.
    48 We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
     40To match a two UTF-8 character sequence \emph{ni3hao}, we first create an \emph{Initial} steam that marks the first byte of all the valid characters.
     41We also produce a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
    4942Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character.
    5043An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character.
    5144As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}.
    52 The final result stream shows 1 match for Multibyte Sequence ni3hao.
     45The final result stream shows 1 match for the multibyte sequence ni3hao.
    6960$\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb`.....1....................`
    7463\caption{Processing of a Multibyte Sequence ni3hao}
    78 \subsection{Predefined Unicode classes}
     67\paragraph{Unicode MatchStar.}
    80 The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation.
    81 These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep.
    82 Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category.
    83 This optimization assumes that most input documents only contain code points from a small number of writing systems.
    84 Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application.
     69The $\mathit{MatchStar}(M, C)$ operation directly implements
     70the operation of finding all positions reachable from a
     71marker bit in $M$ through a character class repetition of
     72an ASCII byte class $C$.   In UTF-8 matching, however,
     73the character class byte streams are marked at their
     74{\em final} positions.   Thus the one bits of a Unicode character
     75class stream are not necessarily contiguous.  This in turn
     76means that the addition operation within the MatchStar
     77operation may terminate prematurely.
     79In order to remedy this problem, \icGrep{} forms two helper bitstreams
     80\emph{Initial} and \emph{NonFinal}.  The {\em Initial} bitstream marks the
     81locations of all single byte characters and the first bytes of all
     82multibyte characters.  Any full match to a multibyte sequence must
     83reach the initial position of the next character. 
     84The {\em NonFinal} bitstream consists of all positions except those
     85that are final positions of UTF-8 sequences. 
     86 It is used to ``fill in the gaps'' in the CC bitstream so that the
     87 MatchStar addition can move through a contiguous sequence of one
     88 bits.  In this way, matching of an arbitrary Unicode character class
     89 $C$ (with a 1 bit set at final positions of any members of the
     90 class),
     91can be implemented using ${\mathit{MatchStar}(M, C) | \mathit{NonFinal})$.
     93\paragraph{Predefined Unicode Classes.}
     94\icGrep{} employs a set of predefined bitstreams that are precompiled
     95into the executable.  These include all bitstreams corresponding
     96to Unicode property expressions such as \verb:\p{Grek}:.
     97Each category potentially contains  many code points, so we further
     98devised an \emph{If Hierarchy} optimization for
     99these properties.
     100This optimization applies whenever an input document contains
     101codepoints from only a small number of writing systems -- the normal case.
    85102The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
    86103equations for observed code point ranges.
Note: See TracChangeset for help on using the changeset viewer.