Changeset 4555

May 14, 2015, 7:26:46 PM (3 years ago)

Initial revision of UTF-8 matching

3 edited


  • docs/Working/icGrep/background.tex

    r4553 r4555  
    1515practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
    1616there are 1490 characters categorized as upper case and 1841 categorized as lower case.
    17 Rather than explicit listing of all characters of interest, then, it is more
     17Rather than explicitly listing the individual characters, then, it is more
    1818practical to use named character classes, such as \verb:Lu: for upper case letters and
    1919\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
    5050position corresponding to one code-unit position within input character stream.
    5151Each bit stream may be considered an $L$-bit integer, where $L$ is the length
    52 of the input stream.
    53 {\em Character class streams}, such as {\tt [d]} for the stream that marks the
    54 position of ``d'' characters and {\tt [a-z]} for the stream of lower case
    55 ASCII alphabetics are first computed in a fully data-parallel manner.
    56 Then the matching process proper begins taking advantage of bitwise logic
    57 and shifting operations as well as an operation for finding all matches to a
    58 character class repetition known as MatchStar.  At each step of the
    59 process a {\em marker} bit stream identifies the full set of positions
    60 within the input data stream that match the regular expression to this point.
     52of the input stream.  Matching is performed using bitwise logic and addition on these long integers.
    62 Consider the following simplified definition of regular expressions
     54Consider the following simplified definition of regular expressions.
    6355A regular expression may be a character class $C$ or formed by
    6456composition.  If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular
    6658one string from $R$ and one string from $S$, $R|S$ is the alternation
    6759standing for the union of the sets of strings from $R$ and $S$, and
    68 $R*$ is the Kleene-star closure consists of the set of strings
     60$R*$ is the Kleene-star closure denoting the set of strings
    6961formed from 0 or more occurrences of strings from $R$.
    71 Bitwise data parallel search for substring occurrences matching a regular expression
    72 $R$ proceeds as follows.  We start with a marker stream $m_0$ consisting of all
    73 ones, i.e., $m_0 = 2^{L} - 1$. Matches are then computed according to the following
    74 equations.
     63{\em Character class streams}, such as CharClass({\tt [d]}) for the stream that marks the
     64position of ``d'' characters and CharClass({\tt [a-z]}) for the stream of lower case
     65ASCII alphabetics are first computed in a fully data-parallel manner. 
     66Then the search process proceeds by computing {\em marker streams} that mark
     67the positions of matches so far.  Each bit in a marker stream indicates
     68that all characters prior to the given position have been matched so far.
     69The initial marker stream $m_0$ consists of all
     70ones, i.e., $m_0 = 2^{L} - 1$, indicating that all positions are in play.
     71Matching then proceeds in accord with the  following equations.
    7674\mbox{\rm Match}(m, C) & = &  \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\
    8078\mbox{\rm Match}(m, R*) & = &  m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R) \wedge (\neg m), R*)
     80Here, Advance is an operation that advances all markers by a single position. \[\mbox{\rm Advance}(m) = m+m\]
     81MatchStar finds all matches of character class repetitions using in a
     82surprisingly simple manner \cite{cameron2014bitwise}.
     83\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
     84Interestingly, the MatchStar operation also has application to the
     85parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
     86the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
     87For repetitions of other regular expressions, the final recursive equation
     88accounts for the repeated extension of current matches; this
     89equation is applied until no new match positions result.
    114121at which the entire regular expression is matched.
    116 The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}.
    117 \[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
    118 A single bit stream addition operation suffices to find all reachable
    119 positions from marker stream $M$ through character class stream $C$.
    120 Interestingly, the MatchStar operation also has application to the
    121 parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
    122 the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
    124123The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
    125124and run-time libraries that target the SIMD instructions of commodity
  • docs/Working/icGrep/unicode-re.tex

    r4507 r4555  
    24 \paragraph{Unicode Advance.}  As illustrated in Section \ref{sec:background},  a bitwise shift
    25 (Advance) operation is frequently used in shifting a marker stream
    26 from positions matched by one regular expression element to the next.
    27 However, characters are represented by variable-length byte sequences
    28 in UTF-8, so that a simple shift operation is inadequate to implement
    29 the operation of advancing bit stream positions from one Unicode
    30 character to the next. 
     24\paragraph{UTF-8 Byte Classification and Validation.}
     25In UTF-8, bytes are classified as individual ASCII bytes, or as
     26prefixes of two-, three-, or four-byte sequences, or as suffix bytes.
     27In addition, we say that the {\em scope} bytes of a prefix are the
     28immediately following byte positions at which a suffix byte is expected.
     29Two helper streams are also useful.
     30The Initial stream marks ASCII bytes and prefixes of multibyte sequences,
     31while the NonFinal stream marks any position that is not the final
     32position of a Unicode character.
     34\mbox{\rm ASCII} & = & \mbox{\rm CharClass}(\verb:[\x00-\x7F]:) \\
     35\mbox{\rm Prefix2} & = & \mbox{\rm CharClass}(\verb:[\xC2-\xDF]:) \\
     36\mbox{\rm Prefix3} & = & \mbox{\rm CharClass}(\verb:[\xE0-\xEF]:) \\
     37\mbox{\rm Prefix4} & = & \mbox{\rm CharClass}(\verb:[\xF0-\xF4]:) \\
     38\mbox{\rm Suffix} & = & \mbox{\rm CharClass}(\verb:[\x80-\xBF]:) \\
     39\mbox{\rm Scope2} & = & \mbox{\rm Advance}(\mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4}) \\
     40\mbox{\rm Scope3} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4})) \\
     41\mbox{\rm Scope4} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))) \\
     42\mbox{\rm Initial} & = & \mbox{\rm ASCII} \vee \mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4} \\
     43\mbox{\rm NonFinal} & = & \mbox{\rm Prefix} \vee \mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4}) \vee \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))
    32 To address the requirements of Unicode advance, we use
     46\paragraph{Unicode Character Classes.}  Whereas ASCII character classes
     47can be determined by simple bitwise logic at a single code unit position,
     48the UnicodeClass stream for a given class involves logic for up to four positions.
     49By convention, we define UnicodeClass($U$) for a given Unicode character class
     50$U$ to be the stream marking the {\em final} position of
     51Unicode character classes.
     53Using these definitions, it is then possible to extend the matching
     54equations to operate with Unicode character classes as follows.
     57\mbox{\rm Match}(m, U) & = &  \mbox{\rm Advance}(\mbox{\rm ScanThru}(m, \mbox{\rm NonFinal}) \wedge \mbox{\rm UnicodeClass}(U)) \\
     58\mbox{\rm Match}(m, U*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm UnicodeClass}(U) \vee \mbox{\rm NonFinal}) \wedge \mbox{\rm Initial}\\
     61Here, we use
    3362the ScanThru~\cite{cameron2011parallel} operation to move a set of markers
    3463each through the nonfinal bytes of UTF-8 sequences to the final
    3564byte position.
     65\[{\mbox ScanThru}(m, c) = (m + c)  \wedge \neg c\]
    3666Figure~\ref{fig:multibytesequence} shows this
    3767technique in operation in the case of advancing through byte
Note: See TracChangeset for help on using the changeset viewer.