# source:docs/Working/re/re-main.tex.backup@3485

Last change on this file since 3485 was 3154, checked in by ksherdy, 6 years ago

Added references. Cleaned up background and abstract.

File size: 12.2 KB
Line
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3
4%opening
5\title{Fast Regular Expression Matching with Bit-parallel Data Streams}
6\author{
7{Robert D. Cameron} \\
8\and
9{Kenneth S. Herdy} \\
10\and
11{Ben Hull} \\
12\and
13{Thomas C. Shermer} \\
14\\School of Computing Science
15\\Simon Fraser University
16}
17\begin{document}
18
19\date{}
20\maketitle
21
22\begin{abstract}
23
24An parallel regular expression matching pattern method is
25introduced and compared with the state of the art in software tools designed for
26efficient on-line search. %such as the {\em grep} family pattern matching tools.
27The method is based on the concept of bit-parallel data streams,
28in which parallel streams of bits are formed such
29that each stream comprises bits in one-to-one correspondence with the
30character code units of a source data stream.
31
32An implementation of the method in the form of a regular expression
33compiler is discussed. The compiler accepts a regular expression and
34forms unbounded bit-parallel data stream operations. Bit-parallel operations
35are then transformed into a low-level C-based implementation for compilation into native
36pattern matching applications. These low-level C-based implementations take advantage of
37the SIMD (single-instruction multiple-data) capabilities of commodity
38processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
39On processors supporting W-bit addition operations, the method processes W source characters
40in parallel and performs up to W finite state transitions per clock cycle.
41
42We introduce a new bit-parallel scanning primitive, {\em Match Star}.
43which performs parallel Kleene closure over character classes
44and without backtracking. % expand and rephrase description of Match Star
45
46We evaluate the performance of our method in comparison with several widely known {\em grep}
47family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep},
48and regular expression engines such as {\em Google's re2}.
49Performance results are analyzed using the performance monitoring counters
50of commodity hardware. Overall, our results demonstrate a dramatic speed-up
51over publically available alternatives.
52
53\end{abstract}
54
55\section{Introduction}
56\label{Introduction}
57%\input{introduction.tex}
58
59Regular expresssion matching is an extensively studied problem with application to
60text processing and bioinformatics and with numerous algorithms
61and software tools developed to the address the particular
62processing demands. % reword
63
64The pattern matching problem can be
65stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
66find all the text positions of T that start an occurrence of P. Alternatively,
67one may want all the final positions of occurrences. Some applications require
68slightly different output such as the line that matches the pattern.
69
70A pattern P can be a simple string, but it can also be a regular expression.
71A regular expression, is an expression that specifies a set of strings
72and is composed of (i) simple strings and (ii) the
73union, concatenation and Kleene closure of other regular expressions.
74To avoid parentheses it is assumed that the Kleene star has the highest priority,
75next concatenation and then alternation, however, most formalisms provides grouping
76operators to allow the definition of scope and operator precedence.
77Readers unfamiliar with the concept of regular expression matching are referred
78classical texts such as \cite{aho2007}.
79
80Regular expression matching is commonly performed using a wide variety of
81publically available software tools for on-line pattern matching. For instance,
82UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
83expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
84agrep, and nrgrep are widely known and considered as
85the fastest regular expression matching tools in practice \cite{navarro2000}.
86and are of particular interest to this study.
87
88% simple patterns, extended patterns, regular expressions
89
90% motivation / previous work
91Although tradi finite state machine methods used in the scanning and parsing of
92text streams is considered to be the hardest of the â13 dwarvesâ to parallelize
93[1], parallel bitstream technology shows considerable promise for these types of
94applications [3, 4]. In this approach, character streams are processed W positions
95at a time using the W-bit SIMD registers commonly found on commodity processors
96(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
97first slicing the byte streams into eight separate basis bitstreams, one for each bit
98position within the byte. These basis bitstreams are then combined with bitwise
99logic and shifting operations to compute further parallel bit streams of interest.
100
101We further increase the potential parallelism in our approach by introducing a new parallel
102scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star
103
104The remainder of this paper is organized as follows.
105Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
106Section~\ref{Background} presents background material on classical automata-based approaches
107to regular expression matching as well as describes the efficient {\em grep} family utilities
108and regular expression pattern matching software libraries.
109Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
110Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
111Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
112presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
113several software tools. Section~\ref{Conclusion} concludes the paper.
114
115The fundamental contribution of this paper is fully general approach to regular expression matching
116using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
117the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
118Specific contributions include:
119\begin{itemize}
120\item compilation of regular expressions into unbounded bit-parallel data stream equations;
121\item documentation of character classes compilation into bit-parallel character class data streams;
122\item the Match Star parallel scanning primitive;
123\item efficient support for unicode characters.
124\end{itemize}
125
126\section{Basic Concepts}
127\label{Basic Concepts}
128In this section, we define the notation and basic concepts used throughout this paper.
129
130\subsection{Notation}
131We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
132be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
133be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
134The goal of simple pattern expression matching is to find all the text positions of T that follow
135an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
136
137We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
138represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
139logical left shift, and logical right shift, respectively and are further modulated by
140the number of bit positions a given value shall be shifted by, for example shift left by n''.
141Vacant bit-positions are filled in with zeroes.
142
143\subsection{Regular Expressions}
144
145% Define regular expressions (a recursive def), character classes, bounded repetition
146
147\section{Background}
148\label{Background}
149%\input{background.tex}
150
151\subsection{Classical Methods}
152
153\subsection{Regular Expression and Finite Automata}
154
155The origins of regular expression matching date back to automata theory
156and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
157Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
158to nondeterministic finite automata (NFA) for regular expression matching.
159Following Thompson's approach, a regular expression of length m is first converted
160to an NFA with O(m) nodes. It is possible to search a text of length n using the
161NFA directly in O(mn) worst case time. Often, a more efficient choice
162is to convert the NFA into a DFA. A DFA has only a single active state and
163allows to search the text at O(n) worst-case optimal. It is well known that the
164conversion of the NFA to the DFA may result in the state explosion problem.
165That is the resultant DFA may have O($2^m$) states.
166
167Thompson's original work marked the beginning of a long line of
168regular expression pattern matching methods that
169process an input string, character-at-a-time,
170and that transition from state to state
171according to the current state and the next input character.
172
174O(n) worst-case optimal efficiency, simple string matching algorithms,
175such as the Boyer-Moore family of algorithms, skip input characters
176to achieve sublinear times in the average case \cite{boyer1977fast}.
177
178Boyer-Moore methods, begin comparison from the end of the pattern instead of the
179beginning and precompute skip information
180to determine how far ahead a pattern search can skip in the input whenever
181a non-matching character is encountered. Generally, Boyer-Moore family
182algorithms improve faster as the pattern being searched for becomes longer.
183In many cases, the techniques used to skip characters in simple string matching
184approaches can be extended to regular expression matching.
185Widely known techniques used to facilitate character skipping in regular
186expression matching include necessary strings and backward suffix matching
187inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
188
189
190\subsection{Bit-parallel Simulation of Automata}
191
192Define bit-parallelism \cite{Baeza-yates_anew}
193
194Shift-Or / Shift-And \cite{wu1992fast}
195
196Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
197
198\subsection{Software Tools}
199
200%Thompson created the first grep (UNIX grep) as a standalone adaptation
201%of the regular expression parser he had written for the UNIX ed utility.
202%In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
203%Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
204%time it took to build a complete finite automaton for the regular expression before it could even start searching.
205%Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
206%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
207%It took zero set-up time and just one additional test in the inner loop.
208%http://pages.cs.wisc.edu/~mdant/cs520_4.html
209
210%Given a regular expression R and a test T the regular expression matching
211%problem finds all ending position of substrings in Q that matches a string in
212%the language denoted by R.
213%The behaviour of Gnu grep, agrep, and nr-grep are differ ...
214%Gnu grep
215%agrep
216%nr-grep
217%re2
218
219
220\section{Bit-parallel Data Streams}
221\label{Bit-parallel Data Streams}
222
223The bit-parallel data streams use the wide
224SIMD registers commonly found on commodity processors
225to process  byte positions at a time using
226bitwise logic, shifting and other operations.
227
228A signficant advantage of the bit-parallel
229data stream method over other
230pattern matching methods that rely on
231bit-parallel automata simulation
232is the potential to skip full register width
233number of characters in low occurence frequency text. % reword
234
235
236Skip characters register width.
237
238\subsection{Match Star}
239
240%Wikipedia
241Backtracking is a general algorithm for finding all solutions to some computational problem,
242that incrementally builds candidates to the solutions.
243
244\section{Compiler Technology}
245\label{Compiler Technology}
246
247\section{Methodology}
248\label{Methodology}
249%\input{methodology.tex}
250
251We compare the performance of our bit-parallel data stream techniques against
252Gnu grep, agrep, nr-grep, and re2.
253
254
255
256\section{Experimental Results}
257\label{Experimental Results}
258%\input{results.tex}
259
260\section{Conclusion}
261\label{Conclusion}
262%\input{conclusion.tex}
263
264{
265  \bibliographystyle{acm}
266  \bibliography{reference}
267}
268
269\end{document}
Note: See TracBrowser for help on using the repository browser.