source: docs/Working/re/re-main.tex.backup @ 3149

Last change on this file since 3149 was 3149, checked in by ksherdy, 6 years ago

Added references to additional papers, Agrep, Shift-or. Text updates.

File size: 11.5 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3
4%opening
5\title{Fast Regular Expression Matching with Bit-parallel Data Streams}
6\author{
7{Robert D. Cameron} \\
8\and
9{Kenneth S. Herdy} \\
10\and
11{Ben Hull} \\
12\and
13{Thomas C. Shermer} \\
14\\School of Computing Science
15\\Simon Fraser University
16}
17\begin{document}
18
19\date{}
20\maketitle
21
22\begin{abstract}
23
24An parallel regular expression matching pattern method is
25introduced and compared with the state of the art in software tools designed for
26efficient on-line search. %such as the {\em grep} family pattern matching tools.
27The method is based on the concept of bit-parallel data streams,
28in which parallel streams of bits are formed such
29that each stream comprises bits in one-to-one correspondence with the
30character code units of a source data stream.
31
32An implementation of the method in the form of a regular expression
33compiler is discussed. The compiler accepts a regular expression and
34forms unbounded bit-parallel data stream operations. Bit-parallel operations
35are then transformed into a low-level C-based implementation for compilation into native
36pattern matching applications. These low-level C-based implementations take advantage of
37the SIMD (single-instruction multiple-data) capabilities of commodity
38processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
39On processors supporting W-bit addition operations, the method processes W source characters
40in parallel and performs up to W finite state transitions per clock cycle.
41
42We introduce a new bit-parallel scanning primitive, {\em Match Star}.
43which performs parallel Kleene closure over character classes
44and without eliminates backtracking. % expand and rephrase description of Match Star
45
46We evaluate the performance of our method in comparison with several widely known {\em grep}
47family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep},
48and regular expression engines such as {\em Google's re2}.
49Performance results are analyzed using the performance monitoring counters
50of commodity hardware. Overall, our results demonstrate a dramatic speed-up
51over publically available alternatives.
52
53\end{abstract}
54
55\section{Introduction}
56\label{Introduction}
57%\input{introduction.tex}
58
59Regular expresssion matching is an extensively studied problem with application to
60text processing and bioinformatics and with numerous algorithms
61and software tools developed to the address the particular
62processing demands. % reword
63
64The pattern matching problem can be
65stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
66find all the text positions of T that start an occurrence of P. Alternatively,
67one may want all the final positions of occurrences. Some applications require
68slightly different output such as the line that matches the pattern.
69
70A pattern P can be a simple string, but it can also be a regular expression.
71A regular expression, is an expression that specifies a set of strings
72and is composed of (i) simple strings and (ii) the
73union, concatenation and Kleene closure of other regular expressions.
74To avoid parentheses it is assumed that the Kleene star has the highest priority,
75next concatenation and then alternation, however, most formalisms provides grouping
76operators to allow the definition of scope and operator precedence.
77Readers unfamiliar with the concept of regular expression matching are referred
78classical texts such as \cite{aho2007}.
79
80Regular expression matching is commonly performed using a wide variety of
81publically available software tools for on-line pattern matching. For instance,
82UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
83expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
84agrep, and nrgrep are widely known and considered as
85the fastest regular expression matching tools in practice \cite{navarro2000}.
86and are of particular interest to this study.
87
88% simple patterns, extended patterns, regular expressions
89
90% motivation / previous work
91Although tradi finite state machine methods used in the scanning and parsing of
92text streams is considered to be the hardest of the “13 dwarves” to parallelize
93[1], parallel bitstream technology shows considerable promise for these types of
94applications [3, 4]. In this approach, character streams are processed W positions
95at a time using the W-bit SIMD registers commonly found on commodity processors
96(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
97first slicing the byte streams into eight separate basis bitstreams, one for each bit
98position within the byte. These basis bitstreams are then combined with bitwise
99logic and shifting operations to compute further parallel bit streams of interest.
100
101We further increase the parallelism in our methods by introducing a new parallel
102scanning primitive which we have coined Match Star. Match Star returns all matches
103in a single operation and eliminates backtracking
104when a partially successful search path fails.
105
106The remainder of this paper is organized as follows.
107Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
108Section~\ref{Background} presents background material on classical and state-of-the-art approaches
109to high performance regular expression matching. In addition, this section provides insight into the
110efficiency of traditional on-line pattern matching tools. Next,
111Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
112Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
113Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
114presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
115Gnu grep, agrep, and nr-grep, and re2. Section~\ref{Conclusion} concludes the paper.
116
117The fundamental contribution of this paper is fully general approach to regular expression matching
118using bit-parallel data streams. The algorithmic aspects of this paper build upon
119the fundamental concepts of our previous work
120in some cases \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
121 Individual contributions include:
122\begin{itemize}
123\item compilation of regular expressions into unbounded bit-parallel data stream equations;
124\item documentation of character classes compilation into bit-parallel character class data streams;
125% \item methods to select optimal subpatterns;
126\item bit-parallel and backtrack-free scanning primitive termed Match Star;
127\item bit-parallel data stream support for unicode.
128\end{itemize}
129
130\section{Basic Concepts}
131\label{Basic Concepts}
132We define the notations and basic concepts used throughout this paper.
133
134\subsection{Notation}
135We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
136be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
137be a text of length n defined over a finite alphabet $sigma$ of size $alpha$.
138The task of regular expression matching
139is to find all the text positions of T that follow an occurrence of P. We use
140C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$, $\ll{}$, and $\lgg{}$
141represent bitwise NOT, OR, AND, XOR, k-bit left shift and k-bit right shift, respectively.
142
143\subsection{Regular Expressions}
144
145% TODO
146
147\section{Background}
148\label{Background}
149%\input{background.tex}
150
151\subsection{Classical Methods}
152
153\subsection{Regular Expression and Finite Automata}
154
155The origins of regular expression matching date back to automata theory
156and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
157Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
158to nondeterministic finite automata (NFA) for regular expression matching.
159Following Thompson's approach, a regular expression of length m is first converted
160to an NFA with O(m) nodes. It is possible to search a text of length n using the
161NFA directly in O(mn) worst case time. Often, a more efficient choice
162is to convert the NFA into a DFA. A DFA has only a single active state and
163allows to search the text at O(n) worst-case optimal. It is well known that the
164conversion of the NFA to the DFA may result in the state explosion problem.
165That is the resultant DFA may have O($2^m$) states.
166
167Thompson's original work marked the beginning of a long line of
168regular expression implementations that
169process an input string, character-at-a-time, and that transition patterm matching state
170based on the current state and the next character read.
171
172The Boyer-Moore family of algorithms \cite{boyer1977fast} , Horspool, ... skip characters
173
174Suffix automata (BDM)
175
176The ideas presented up to now aim at a good implementation of the automa-
177ton, but they must inspect all the text characters. In many cases, however, the
178regular expression involves sets of relatively long substrings that must appear
179for the regular expression to match.
180
181\subsection{Bit-parallel Simulation of Automata}
182
183Define bit-parallelism \cite{navarro2002flexible}
184
185Shift-Or \cite{}
186
187Backward Dawg Matching (BDM) \cite{navarro1998bit}
188
189Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
190
191Skip characters pattern length (occurence frequency and length).
192
193
194\section{Bit-parallel Data Streams}
195\label{Bit-parallel Data Streams}
196
197The bit-parallel data streams use the wide
198SIMD registers commonly found on commodity processors
199to process  byte positions at a time using
200bitwise logic, shifting and other operations.
201
202A signficant advantage of the bit-parallel
203data stream method over other
204pattern matching methods that rely on
205bit-parallel automata simulation
206is the potential to skip full register width
207number of characters in low occurence frequency text. % reword
208
209
210Skip characters register width.
211
212\subsection{Software Tools}
213
214%Thompson created the first grep (UNIX grep) as a standalone adaptation
215%of the regular expression parser he had written for the UNIX ed utility.
216%In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
217%Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
218%time it took to build a complete finite automaton for the regular expression before it could even start searching.
219%Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
220%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
221%It took zero set-up time and just one additional test in the inner loop.
222%http://pages.cs.wisc.edu/~mdant/cs520_4.html
223
224%Given a regular expression R and a test T the regular expression matching
225%problem finds all ending position of substrings in Q that matches a string in
226%the language denoted by R.
227%The behaviour of Gnu grep, agrep, and nr-grep are differ ...
228%Gnu grep
229%agrep
230%nr-grep
231%re2
232
233\subsection{Match Star}
234
235%Wikipedia
236Backtracking is a general algorithm for finding all solutions to some computational problem,
237that incrementally builds candidates to the solutions.
238
239\section{Compiler Technology}
240\label{Compiler Technology}
241
242\section{Methodology}
243\label{Methodology}
244%\input{methodology.tex}
245
246We compare the performance of our bit-parallel data stream techniques against
247Gnu grep, agrep, nr-grep, and re2.
248
249
250
251\section{Experimental Results}
252\label{Experimental Results}
253%\input{results.tex}
254
255\section{Conclusion}
256\label{Conclusion}
257%\input{conclusion.tex}
258
259{
260  \bibliographystyle{acm}
261  \bibliography{reference}
262}
263
264\end{document}
Note: See TracBrowser for help on using the repository browser.