# Changeset 4555

Ignore:
Timestamp:
May 14, 2015, 7:26:46 PM (4 years ago)
Message:

Initial revision of UTF-8 matching

Location:
docs/Working/icGrep
Files:
3 edited

### Legend:

Unmodified
 r4553 practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database, there are 1490 characters categorized as upper case and 1841 categorized as lower case. Rather than explicit listing of all characters of interest, then, it is more Rather than explicitly listing the individual characters, then, it is more practical to use named character classes, such as \verb:Lu: for upper case letters and \verb:Ll: for lower case letters.   Using these names, our search might be rewritten position corresponding to one code-unit position within input character stream. Each bit stream may be considered an $L$-bit integer, where $L$ is the length of the input stream. {\em Character class streams}, such as {\tt [d]} for the stream that marks the position of d'' characters and {\tt [a-z]} for the stream of lower case ASCII alphabetics are first computed in a fully data-parallel manner. Then the matching process proper begins taking advantage of bitwise logic and shifting operations as well as an operation for finding all matches to a character class repetition known as MatchStar.  At each step of the process a {\em marker} bit stream identifies the full set of positions within the input data stream that match the regular expression to this point. of the input stream.  Matching is performed using bitwise logic and addition on these long integers. Consider the following simplified definition of regular expressions Consider the following simplified definition of regular expressions. A regular expression may be a character class $C$ or formed by composition.  If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular one string from $R$ and one string from $S$, $R|S$ is the alternation standing for the union of the sets of strings from $R$ and $S$, and $R*$ is the Kleene-star closure consists of the set of strings $R*$ is the Kleene-star closure denoting the set of strings formed from 0 or more occurrences of strings from $R$. Bitwise data parallel search for substring occurrences matching a regular expression $R$ proceeds as follows.  We start with a marker stream $m_0$ consisting of all ones, i.e., $m_0 = 2^{L} - 1$. Matches are then computed according to the following equations. {\em Character class streams}, such as CharClass({\tt [d]}) for the stream that marks the position of d'' characters and CharClass({\tt [a-z]}) for the stream of lower case ASCII alphabetics are first computed in a fully data-parallel manner. Then the search process proceeds by computing {\em marker streams} that mark the positions of matches so far.  Each bit in a marker stream indicates that all characters prior to the given position have been matched so far. The initial marker stream $m_0$ consists of all ones, i.e., $m_0 = 2^{L} - 1$, indicating that all positions are in play. Matching then proceeds in accord with the  following equations. \begin{eqnarray*} \mbox{\rm Match}(m, C) & = &  \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\ \mbox{\rm Match}(m, R*) & = &  m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R) \wedge (\neg m), R*) \end{eqnarray*} Here, Advance is an operation that advances all markers by a single position. $\mbox{\rm Advance}(m) = m+m$ MatchStar finds all matches of character class repetitions using in a surprisingly simple manner \cite{cameron2014bitwise}. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$ Interestingly, the MatchStar operation also has application to the parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}. For repetitions of other regular expressions, the final recursive equation accounts for the repeated extension of current matches; this equation is applied until no new match positions result. \begin{figure} at which the entire regular expression is matched. The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$ A single bit stream addition operation suffices to find all reachable positions from marker stream $M$ through character class stream $C$. Interestingly, the MatchStar operation also has application to the parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}. The Parabix toolchain \cite{lin2012parabix} provides a set of compilers and run-time libraries that target the SIMD instructions of commodity
 r4507 } \paragraph{Unicode Advance.}  As illustrated in Section \ref{sec:background},  a bitwise shift (Advance) operation is frequently used in shifting a marker stream from positions matched by one regular expression element to the next. However, characters are represented by variable-length byte sequences in UTF-8, so that a simple shift operation is inadequate to implement the operation of advancing bit stream positions from one Unicode character to the next. \paragraph{UTF-8 Byte Classification and Validation.} In UTF-8, bytes are classified as individual ASCII bytes, or as prefixes of two-, three-, or four-byte sequences, or as suffix bytes. In addition, we say that the {\em scope} bytes of a prefix are the immediately following byte positions at which a suffix byte is expected. Two helper streams are also useful. The Initial stream marks ASCII bytes and prefixes of multibyte sequences, while the NonFinal stream marks any position that is not the final position of a Unicode character. \begin{eqnarray*} \mbox{\rm ASCII} & = & \mbox{\rm CharClass}(\verb:[\x00-\x7F]:) \\ \mbox{\rm Prefix2} & = & \mbox{\rm CharClass}(\verb:[\xC2-\xDF]:) \\ \mbox{\rm Prefix3} & = & \mbox{\rm CharClass}(\verb:[\xE0-\xEF]:) \\ \mbox{\rm Prefix4} & = & \mbox{\rm CharClass}(\verb:[\xF0-\xF4]:) \\ \mbox{\rm Suffix} & = & \mbox{\rm CharClass}(\verb:[\x80-\xBF]:) \\ \mbox{\rm Scope2} & = & \mbox{\rm Advance}(\mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4}) \\ \mbox{\rm Scope3} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4})) \\ \mbox{\rm Scope4} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))) \\ \mbox{\rm Initial} & = & \mbox{\rm ASCII} \vee \mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4} \\ \mbox{\rm NonFinal} & = & \mbox{\rm Prefix} \vee \mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4}) \vee \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4})) \end{eqnarray*} To address the requirements of Unicode advance, we use \paragraph{Unicode Character Classes.}  Whereas ASCII character classes can be determined by simple bitwise logic at a single code unit position, the UnicodeClass stream for a given class involves logic for up to four positions. By convention, we define UnicodeClass($U$) for a given Unicode character class $U$ to be the stream marking the {\em final} position of Unicode character classes. Using these definitions, it is then possible to extend the matching equations to operate with Unicode character classes as follows. \begin{eqnarray*} \mbox{\rm Match}(m, U) & = &  \mbox{\rm Advance}(\mbox{\rm ScanThru}(m, \mbox{\rm NonFinal}) \wedge \mbox{\rm UnicodeClass}(U)) \\ \mbox{\rm Match}(m, U*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm UnicodeClass}(U) \vee \mbox{\rm NonFinal}) \wedge \mbox{\rm Initial}\\ \end{eqnarray*} Here, we use the ScanThru~\cite{cameron2011parallel} operation to move a set of markers each through the nonfinal bytes of UTF-8 sequences to the final byte position. ${\mbox ScanThru}(m, c) = (m + c) \wedge \neg c$ Figure~\ref{fig:multibytesequence} shows this technique in operation in the case of advancing through byte