source: docs/Working/icGrep/background.tex @ 4555

Last change on this file since 4555 was 4555, checked in by cameron, 4 years ago

Initial revision of UTF-8 matching

File size: 8.7 KB
1\section{Background} \label{sec:background}
3%\subsection{Unicode Regular Expression Requirements}
4\paragraph{Unicode Regular Expressions.}
5Traditional regular expression syntax is oriented towards
6string search using regular expressions over ASCII or extended-ASCII byte sequences.
7A grep search for a line beginning with a capitalized word might use the
8pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
9matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
10that matches any single character in the contiguous range of characters from A through Z,
11while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
12case ASCII letters.   
14While explicit listing of characters of interest is
15practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
16there are 1490 characters categorized as upper case and 1841 categorized as lower case.
17Rather than explicitly listing the individual characters, then, it is more
18practical to use named character classes, such as \verb:Lu: for upper case letters and
19\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
20to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or
21``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
22The Unicode consortium has defined an extensive list of named properties that can
23be used in regular expressions.
25Beyond named properties, Unicode Technical Standard \#18 defines additional
26requirements for Unicode regular expressions, at three levels of
27complexity~\cite{davis2012unicode}.   Level 1 generally relates to
28properties expressed in terms of individual Unicode codepoints, while level 2
29introduces complexities due to codepoint sequences that form grapheme clusters,
30and level 3 relates to tailored locale-specific support.
31We consider only Unicode level 1 requirements
32in this paper, as most grep implementations are incomplete with respect
33the requirements even at this level.   
34The additional level 1 regular expression requirements primarily relate to larger
35classes of characters that are used in identifying line breaks, word breaks
36and case-insensitive matching.   
37Beyond this, there is one important syntactic extension: the ability to refine
38character class specifications using set
39intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
40denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
41denotes the class of all non-ASCII lower case letters.
43%\subsection{Bitwise Data Parallel Regular Expression Matching}
44\paragraph{Bitwise Data Parallel Matching.}
45Regular expression search using bitwise data parallelism has been
46recently introduced and shown to considerably outperform methods
47based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the
48method is 100\% data parallel, considering all input positions in a file
49simultaneously.   A set of parallel bit streams is computed, with each bit
50position corresponding to one code-unit position within input character stream.
51Each bit stream may be considered an $L$-bit integer, where $L$ is the length
52of the input stream.  Matching is performed using bitwise logic and addition on these long integers.
54Consider the following simplified definition of regular expressions.
55A regular expression may be a character class $C$ or formed by
56composition.  If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular
57expression standing for the set of all strings formed by concatenating
58one string from $R$ and one string from $S$, $R|S$ is the alternation
59standing for the union of the sets of strings from $R$ and $S$, and
60$R*$ is the Kleene-star closure denoting the set of strings
61formed from 0 or more occurrences of strings from $R$.
63{\em Character class streams}, such as CharClass({\tt [d]}) for the stream that marks the
64position of ``d'' characters and CharClass({\tt [a-z]}) for the stream of lower case
65ASCII alphabetics are first computed in a fully data-parallel manner. 
66Then the search process proceeds by computing {\em marker streams} that mark
67the positions of matches so far.  Each bit in a marker stream indicates
68that all characters prior to the given position have been matched so far.
69The initial marker stream $m_0$ consists of all
70ones, i.e., $m_0 = 2^{L} - 1$, indicating that all positions are in play.
71Matching then proceeds in accord with the  following equations.
74\mbox{\rm Match}(m, C) & = &  \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\
75\mbox{\rm Match}(m, RS) & = &  \mbox{\rm Match}(\mbox{\rm Match}(m, R), S) \\
76\mbox{\rm Match}(m, R|S) & = &  \mbox{\rm Match}(m, R) \vee \mbox{\rm Match}(m, S)) \\
77\mbox{\rm Match}(m, C*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm CharClass}(C)) \\
78\mbox{\rm Match}(m, R*) & = &  m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R) \wedge (\neg m), R*)
80Here, Advance is an operation that advances all markers by a single position. \[\mbox{\rm Advance}(m) = m+m\]
81MatchStar finds all matches of character class repetitions using in a
82surprisingly simple manner \cite{cameron2014bitwise}.
83\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
84Interestingly, the MatchStar operation also has application to the
85parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
86the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
87For repetitions of other regular expressions, the final recursive equation
88accounts for the repeated extension of current matches; this
89equation is applied until no new match positions result.
95input data  & \verb`dead dreams defeated.`\\
96$M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb`.1..1.1......1......1`\\
97$M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb`.1111.111111.11111111`\\
98$M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb`..1.....1.....1.1..1.`\\
99$M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb`....................1`
103\caption{Matching {\tt d[a-z]*ed} Using Bitwise Data Parallelism}\label{fig:bitwisematch}
107For example,
108Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
109against some input text using bitwise methods.  In this diagram we use periods to denote
1100 bits so that the 1 bits stand out.
111In the first step the character class stream
112{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
113Five matches indicated by marker bits are now in play simultaneously. 
114The next step applies the  MatchStar operation to find all the matches that may then be
115reached with the Kleene-* repetition
116{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
117is no need to consider these matches one at a time using lazy or greedy matching strategies.
118Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily
119computed using bitwise logic and shift.
120The final step produces marker stream $M_4$ indicating the single position
121at which the entire regular expression is matched.
123The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
124and run-time libraries that target the SIMD instructions of commodity
125processors (e.g., SSE or AVX instructions on x86-64 architecture).
126Input is processed in blocks of code units equal to the size in bits
127of the SIMD registers, for example, 128 bytes at a time using 128-bit registers.
128Using the Parabix facilities, the bitwise data parallel approach to
129regular expression search was shown to deliver substantial performance
130acceleration for traditional ASCII regular expression matching tasks,
131often 5$\times$ or better \cite{cameron2014bitwise}.
138The LLVM compiler infrastructure is a set of modular compiler components and tools
139organized around a powerful generic intermediate representation (LLVM IR) that is
140agnostic with respect to source language and code-generation targets.   
141Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
142LLVM is now an open-source codebase supported by a broad community of
143researchers, developers and commercial organizations.   
145LLVM features a flexible multi-stage compilation structure that can be organized in
146passes in many ways, including fully dynamic just-in-time compilation.   Of particular
147importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of
148arbitrary-width integers, well suited to code generation
149targetting the SIMD integer instructions of commodity processors.
153\subsection{Parabix Regular Expression Matching}
Note: See TracBrowser for help on using the repository browser.