source: docs/Working/icGrep/background.tex @ 4553

Last change on this file since 4553 was 4553, checked in by cameron, 4 years ago

matching equations

File size: 8.6 KB
1\section{Background} \label{sec:background}
3%\subsection{Unicode Regular Expression Requirements}
4\paragraph{Unicode Regular Expressions.}
5Traditional regular expression syntax is oriented towards
6string search using regular expressions over ASCII or extended-ASCII byte sequences.
7A grep search for a line beginning with a capitalized word might use the
8pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
9matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
10that matches any single character in the contiguous range of characters from A through Z,
11while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
12case ASCII letters.   
14While explicit listing of characters of interest is
15practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
16there are 1490 characters categorized as upper case and 1841 categorized as lower case.
17Rather than explicit listing of all characters of interest, then, it is more
18practical to use named character classes, such as \verb:Lu: for upper case letters and
19\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
20to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or
21``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
22The Unicode consortium has defined an extensive list of named properties that can
23be used in regular expressions.
25Beyond named properties, Unicode Technical Standard \#18 defines additional
26requirements for Unicode regular expressions, at three levels of
27complexity~\cite{davis2012unicode}.   Level 1 generally relates to
28properties expressed in terms of individual Unicode codepoints, while level 2
29introduces complexities due to codepoint sequences that form grapheme clusters,
30and level 3 relates to tailored locale-specific support.
31We consider only Unicode level 1 requirements
32in this paper, as most grep implementations are incomplete with respect
33the requirements even at this level.   
34The additional level 1 regular expression requirements primarily relate to larger
35classes of characters that are used in identifying line breaks, word breaks
36and case-insensitive matching.   
37Beyond this, there is one important syntactic extension: the ability to refine
38character class specifications using set
39intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
40denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
41denotes the class of all non-ASCII lower case letters.
43%\subsection{Bitwise Data Parallel Regular Expression Matching}
44\paragraph{Bitwise Data Parallel Matching.}
45Regular expression search using bitwise data parallelism has been
46recently introduced and shown to considerably outperform methods
47based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the
48method is 100\% data parallel, considering all input positions in a file
49simultaneously.   A set of parallel bit streams is computed, with each bit
50position corresponding to one code-unit position within input character stream.
51Each bit stream may be considered an $L$-bit integer, where $L$ is the length
52of the input stream.
53{\em Character class streams}, such as {\tt [d]} for the stream that marks the
54position of ``d'' characters and {\tt [a-z]} for the stream of lower case
55ASCII alphabetics are first computed in a fully data-parallel manner.
56Then the matching process proper begins taking advantage of bitwise logic
57and shifting operations as well as an operation for finding all matches to a
58character class repetition known as MatchStar.  At each step of the
59process a {\em marker} bit stream identifies the full set of positions
60within the input data stream that match the regular expression to this point.
62Consider the following simplified definition of regular expressions
63A regular expression may be a character class $C$ or formed by
64composition.  If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular
65expression standing for the set of all strings formed by concatenating
66one string from $R$ and one string from $S$, $R|S$ is the alternation
67standing for the union of the sets of strings from $R$ and $S$, and
68$R*$ is the Kleene-star closure consists of the set of strings
69formed from 0 or more occurrences of strings from $R$.
71Bitwise data parallel search for substring occurrences matching a regular expression
72$R$ proceeds as follows.  We start with a marker stream $m_0$ consisting of all
73ones, i.e., $m_0 = 2^{L} - 1$. Matches are then computed according to the following
76\mbox{\rm Match}(m, C) & = &  \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\
77\mbox{\rm Match}(m, RS) & = &  \mbox{\rm Match}(\mbox{\rm Match}(m, R), S) \\
78\mbox{\rm Match}(m, R|S) & = &  \mbox{\rm Match}(m, R) \vee \mbox{\rm Match}(m, S)) \\
79\mbox{\rm Match}(m, C*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm CharClass}(C)) \\
80\mbox{\rm Match}(m, R*) & = &  m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R) \wedge (\neg m), R*)
88input data  & \verb`dead dreams defeated.`\\
89$M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb`.1..1.1......1......1`\\
90$M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb`.1111.111111.11111111`\\
91$M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb`..1.....1.....1.1..1.`\\
92$M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb`....................1`
96\caption{Matching {\tt d[a-z]*ed} Using Bitwise Data Parallelism}\label{fig:bitwisematch}
100For example,
101Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
102against some input text using bitwise methods.  In this diagram we use periods to denote
1030 bits so that the 1 bits stand out.
104In the first step the character class stream
105{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
106Five matches indicated by marker bits are now in play simultaneously. 
107The next step applies the  MatchStar operation to find all the matches that may then be
108reached with the Kleene-* repetition
109{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
110is no need to consider these matches one at a time using lazy or greedy matching strategies.
111Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily
112computed using bitwise logic and shift.
113The final step produces marker stream $M_4$ indicating the single position
114at which the entire regular expression is matched.
116The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}.
117\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
118A single bit stream addition operation suffices to find all reachable
119positions from marker stream $M$ through character class stream $C$.
120Interestingly, the MatchStar operation also has application to the
121parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
122the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
124The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
125and run-time libraries that target the SIMD instructions of commodity
126processors (e.g., SSE or AVX instructions on x86-64 architecture).
127Input is processed in blocks of code units equal to the size in bits
128of the SIMD registers, for example, 128 bytes at a time using 128-bit registers.
129Using the Parabix facilities, the bitwise data parallel approach to
130regular expression search was shown to deliver substantial performance
131acceleration for traditional ASCII regular expression matching tasks,
132often 5$\times$ or better \cite{cameron2014bitwise}.
139The LLVM compiler infrastructure is a set of modular compiler components and tools
140organized around a powerful generic intermediate representation (LLVM IR) that is
141agnostic with respect to source language and code-generation targets.   
142Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
143LLVM is now an open-source codebase supported by a broad community of
144researchers, developers and commercial organizations.   
146LLVM features a flexible multi-stage compilation structure that can be organized in
147passes in many ways, including fully dynamic just-in-time compilation.   Of particular
148importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of
149arbitrary-width integers, well suited to code generation
150targetting the SIMD integer instructions of commodity processors.
154\subsection{Parabix Regular Expression Matching}
Note: See TracBrowser for help on using the repository browser.