source: docs/Working/icGrep/background.tex @ 4490

Last change on this file since 4490 was 4490, checked in by cameron, 4 years ago

Intro and Background, remove section 3

File size: 7.2 KB
Line 
1\section{Background} \label{sec:background}
2
3%\subsection{Unicode Regular Expression Requirements}
4\paragraph{Unicode Regular Expressions.}
5Traditional regular expression syntax is oriented towards
6string search using regular expressions over ASCII or extended-ASCII byte sequences.
7A grep search for a line beginning with a capitalized word might use the
8pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
9matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
10that matches any single character in the contiguous range of characters from A through Z,
11while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
12case ASCII letters.   
13
14While explicit listing of characters of interest is
15practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
16there are 1490 characters categorized as upper case and 1841 categorized as lower case.
17Rather than explicit listing of all characters of interest, then, it is more
18practical to use named character classes, such as \verb:Lu: for upper case letters and
19\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
20to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or
21``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
22The Unicode consortium has defined an extensive list of named properties that can
23be used in regular expressions.
24
25Beyond named properties, Unicode Technical Standard \#18 defines additional
26requirements for Unicode regular expressions, at three levels of
27complexity~\cite{davis2012unicode}.   Level 1 generally relates to
28properties expressed in terms of individual Unicode codepoints, while level 2
29introduces complexities due to codepoint sequences that form grapheme clusters,
30and level 3 relates to tailored locale-specific support.
31We consider only Unicode level 1 requirements
32in this paper, as most grep implementations are incomplete with respect
33the requirements even at this level.   
34The additional level 1 regular expression requirements primarily relate to larger
35classes of characters that are used in identifying line breaks, word breaks
36and case-insensitive matching.   
37Beyond this, there is one important syntactic extension: the ability to refine
38character class specifications using set
39intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
40denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
41denotes the class of all non-ASCII lower case letters.
42
43%\subsection{Bitwise Data Parallel Regular Expression Matching}
44\paragraph{Bitwise Data Parallel Matching.}
45Regular expression search using bitwise data parallelism has been
46recently introduced and shown to considerably outperform methods
47based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the
48method is 100\% data parallel, considering all input positions in a file
49simultaneously.   A set of parallel bit streams is computed, with each bit
50position corresponding to one code-unit position within input character stream.
51{\em Character class streams}, such as {\tt [d]} for the stream that marks the
52position of ``d'' characters and {\tt [a-z]} for the stream of lower case
53ASCII alphabetics are first computed in a fully data-parallel manner.
54Then the matching process proper begins taking advance of bitwise logic
55and shifting operations as well as an operation for finding all matches to a
56character class repetition known as MatchStar.  At each step of the
57process a {\em marker} bit stream identifying the full set of positions
58within the input data stream that match the regular expression to this point.
59
60
61\begin{figure}
62\begin{center}
63\begin{tabular}{cr}\\
64input data  & \verb`dead dreams defeated.`\\
65$M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb`.1..1.1......1......1`\\
66$M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb`.1111.111111.11111111`\\
67$M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb`..1.....1.....1.1..1.`\\
68$M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb`....................1`
69\end{tabular}
70\end{center}
71\caption{Matching with Bitwise Data Parallelism}\label{fig:bitwisematch}
72\end{figure}
73
74For example,
75Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
76against some input text using bitwise methods.  In this diagram we use periods to denote
770 bits so that the 1 bits stand out.
78In the first step the character class stream
79{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
80Four matches indicated by marker bits are now in play simultaneously. 
81The next step applies the  MatchStar operation to find all the matches that may then be
82reached with the Kleene-* repetition
83{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
84is no need to consider these matches one at a time using lazy or greedy matching strategies.
85Rather, the full marker stream $M_3$ of remaining possibilites after matching {\tt [e]} is easily
86computed using a shift and bitwise and.
87The final step produces marker stream $M_4$ indicating that single position
88at which the entire regular expression is matched.
89
90The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}.
91\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
92A single bit stream addition operation suffices to find all reachable
93positions from marker stream $M$ through character class stream $C$.
94Interestingly, the MatchStar operation also has application to the
95parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
96the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
97
98The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
99and run-time libraries that target the SIMD instructions of commodity
100processors (e.g., SSE or AVX instructions on x86-64 architecture).
101Input is processed in blocks of code units equal to the size in bits
102of the SIMD registers, for example, 128 bytes at a time using 128-bit registers.
103Using the Parabix facilities, the bitwise data parallel approach to
104regular expression search was shown to deliver substantial performance
105acceleration for traditional ASCII regular expression matching tasks,
106often 5X or better \cite{cameron2014bitwise}.
107
108
109\comment{
110
111\subsection{LLVM}
112
113The LLVM compiler infrastructure is a set of modular compiler components and tools
114organized around a powerful generic intermediate representation (LLVM IR) that is
115agnostic with respect to source language and code-generation targets.   
116Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
117LLVM is now an open-source codebase supported by a broad community of
118researchers, developers and commercial organizations.   
119
120LLVM features a flexible multi-stage compilation structure that can be organized in
121passes in many ways, including fully dynamic just-in-time compilation.   Of particular
122importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of
123arbitrary-width integers, well suited to code generation
124targetting the SIMD integer instructions of commodity processors.
125
126
127
128\subsection{Parabix Regular Expression Matching}
129}
Note: See TracBrowser for help on using the repository browser.