Changeset 4555 for docs/Working/icGrep/background.tex
 Timestamp:
 May 14, 2015, 7:26:46 PM (4 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/icGrep/background.tex
r4553 r4555 15 15 practical with ASCII, it is less so with Unicode. In the Unicode 7.0 database, 16 16 there are 1490 characters categorized as upper case and 1841 categorized as lower case. 17 Rather than explicit listing of all characters of interest, then, it is more17 Rather than explicitly listing the individual characters, then, it is more 18 18 practical to use named character classes, such as \verb:Lu: for upper case letters and 19 19 \verb:Ll: for lower case letters. Using these names, our search might be rewritten … … 50 50 position corresponding to one codeunit position within input character stream. 51 51 Each bit stream may be considered an $L$bit integer, where $L$ is the length 52 of the input stream. 53 {\em Character class streams}, such as {\tt [d]} for the stream that marks the 54 position of ``d'' characters and {\tt [az]} for the stream of lower case 55 ASCII alphabetics are first computed in a fully dataparallel manner. 56 Then the matching process proper begins taking advantage of bitwise logic 57 and shifting operations as well as an operation for finding all matches to a 58 character class repetition known as MatchStar. At each step of the 59 process a {\em marker} bit stream identifies the full set of positions 60 within the input data stream that match the regular expression to this point. 52 of the input stream. Matching is performed using bitwise logic and addition on these long integers. 61 53 62 Consider the following simplified definition of regular expressions 54 Consider the following simplified definition of regular expressions. 63 55 A regular expression may be a character class $C$ or formed by 64 56 composition. If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular … … 66 58 one string from $R$ and one string from $S$, $RS$ is the alternation 67 59 standing for the union of the sets of strings from $R$ and $S$, and 68 $R*$ is the Kleenestar closure consists ofthe set of strings60 $R*$ is the Kleenestar closure denoting the set of strings 69 61 formed from 0 or more occurrences of strings from $R$. 70 62 71 Bitwise data parallel search for substring occurrences matching a regular expression 72 $R$ proceeds as follows. We start with a marker stream $m_0$ consisting of all 73 ones, i.e., $m_0 = 2^{L}  1$. Matches are then computed according to the following 74 equations. 63 {\em Character class streams}, such as CharClass({\tt [d]}) for the stream that marks the 64 position of ``d'' characters and CharClass({\tt [az]}) for the stream of lower case 65 ASCII alphabetics are first computed in a fully dataparallel manner. 66 Then the search process proceeds by computing {\em marker streams} that mark 67 the positions of matches so far. Each bit in a marker stream indicates 68 that all characters prior to the given position have been matched so far. 69 The initial marker stream $m_0$ consists of all 70 ones, i.e., $m_0 = 2^{L}  1$, indicating that all positions are in play. 71 Matching then proceeds in accord with the following equations. 72 75 73 \begin{eqnarray*} 76 74 \mbox{\rm Match}(m, C) & = & \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\ … … 80 78 \mbox{\rm Match}(m, R*) & = & m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R) \wedge (\neg m), R*) 81 79 \end{eqnarray*} 82 80 Here, Advance is an operation that advances all markers by a single position. \[\mbox{\rm Advance}(m) = m+m\] 81 MatchStar finds all matches of character class repetitions using in a 82 surprisingly simple manner \cite{cameron2014bitwise}. 83 \[\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M\] 84 Interestingly, the MatchStar operation also has application to the 85 parallelized longstream addition itself \cite{cameron2014bitwise}, as well as 86 the bitparallel edit distance algorithm of Myers\cite{myers1999fast}. 87 For repetitions of other regular expressions, the final recursive equation 88 accounts for the repeated extension of current matches; this 89 equation is applied until no new match positions result. 83 90 84 91 \begin{figure} … … 114 121 at which the entire regular expression is matched. 115 122 116 The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}.117 \[\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M\]118 A single bit stream addition operation suffices to find all reachable119 positions from marker stream $M$ through character class stream $C$.120 Interestingly, the MatchStar operation also has application to the121 parallelized longstream addition itself \cite{cameron2014bitwise}, as well as122 the bitparallel edit distance algorithm of Myers\cite{myers1999fast}.123 124 123 The Parabix toolchain \cite{lin2012parabix} provides a set of compilers 125 124 and runtime libraries that target the SIMD instructions of commodity
Note: See TracChangeset
for help on using the changeset viewer.