# Changeset 4555 for docs/Working/icGrep/background.tex

Ignore:
Timestamp:
May 14, 2015, 7:26:46 PM (4 years ago)
Message:

Initial revision of UTF-8 matching

File:
1 edited

### Legend:

Unmodified
 r4553 practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database, there are 1490 characters categorized as upper case and 1841 categorized as lower case. Rather than explicit listing of all characters of interest, then, it is more Rather than explicitly listing the individual characters, then, it is more practical to use named character classes, such as \verb:Lu: for upper case letters and \verb:Ll: for lower case letters.   Using these names, our search might be rewritten position corresponding to one code-unit position within input character stream. Each bit stream may be considered an $L$-bit integer, where $L$ is the length of the input stream. {\em Character class streams}, such as {\tt [d]} for the stream that marks the position of d'' characters and {\tt [a-z]} for the stream of lower case ASCII alphabetics are first computed in a fully data-parallel manner. Then the matching process proper begins taking advantage of bitwise logic and shifting operations as well as an operation for finding all matches to a character class repetition known as MatchStar.  At each step of the process a {\em marker} bit stream identifies the full set of positions within the input data stream that match the regular expression to this point. of the input stream.  Matching is performed using bitwise logic and addition on these long integers. Consider the following simplified definition of regular expressions Consider the following simplified definition of regular expressions. A regular expression may be a character class $C$ or formed by composition.  If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular one string from $R$ and one string from $S$, $R|S$ is the alternation standing for the union of the sets of strings from $R$ and $S$, and $R*$ is the Kleene-star closure consists of the set of strings $R*$ is the Kleene-star closure denoting the set of strings formed from 0 or more occurrences of strings from $R$. Bitwise data parallel search for substring occurrences matching a regular expression $R$ proceeds as follows.  We start with a marker stream $m_0$ consisting of all ones, i.e., $m_0 = 2^{L} - 1$. Matches are then computed according to the following equations. {\em Character class streams}, such as CharClass({\tt [d]}) for the stream that marks the position of d'' characters and CharClass({\tt [a-z]}) for the stream of lower case ASCII alphabetics are first computed in a fully data-parallel manner. Then the search process proceeds by computing {\em marker streams} that mark the positions of matches so far.  Each bit in a marker stream indicates that all characters prior to the given position have been matched so far. The initial marker stream $m_0$ consists of all ones, i.e., $m_0 = 2^{L} - 1$, indicating that all positions are in play. Matching then proceeds in accord with the  following equations. \begin{eqnarray*} \mbox{\rm Match}(m, C) & = &  \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\ \mbox{\rm Match}(m, R*) & = &  m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R) \wedge (\neg m), R*) \end{eqnarray*} Here, Advance is an operation that advances all markers by a single position. $\mbox{\rm Advance}(m) = m+m$ MatchStar finds all matches of character class repetitions using in a surprisingly simple manner \cite{cameron2014bitwise}. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$ Interestingly, the MatchStar operation also has application to the parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}. For repetitions of other regular expressions, the final recursive equation accounts for the repeated extension of current matches; this equation is applied until no new match positions result. \begin{figure} at which the entire regular expression is matched. The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$ A single bit stream addition operation suffices to find all reachable positions from marker stream $M$ through character class stream $C$. Interestingly, the MatchStar operation also has application to the parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}. The Parabix toolchain \cite{lin2012parabix} provides a set of compilers and run-time libraries that target the SIMD instructions of commodity