source: docs/Working/re/re-main.tex @ 3458

Last change on this file since 3458 was 3458, checked in by lindanl, 6 years ago

Add GPU section(not finished)

File size: 14.4 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3\usepackage{graphicx}
4
5%opening
6\title{Fast Regular Expression Matching with Bit-parallel Data Streams}
7\author{
8{Robert D. Cameron} \\
9\and 
10{Kenneth S. Herdy} \\
11\and 
12{Ben Hull} \\
13\and 
14{Thomas C. Shermer} \\
15\\School of Computing Science
16\\Simon Fraser University
17}
18\begin{document}
19
20\date{}
21\maketitle
22
23\begin{abstract}
24
25A parallel regular expression matching pattern method is introduced and compared
26with existing approaches for efficient on-line search.
27The method is based on the concept of bit-parallel data streams,
28in which parallel streams of bits are formed such
29that each stream comprises bits in one-to-one correspondence with the
30character code units of a source data stream.
31
32An implementation of the techniques in the form of a regular expression
33compiler is discussed. The compiler accepts a regular expression and
34forms unbounded bit-parallel data stream statements. Bit-parallel operations
35are then transformed into a low-level C-based implementation for compilation into native
36pattern matching application code. These low-level C-based implementations take advantage of
37the SIMD (single-instruction multiple-data) capabilities of commodity
38processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
39On processors supporting W-bit addition operations, the method processes W source characters
40in parallel and performs up to W finite state transitions per clock cycle.
41We further improve our methods through the introduce a new bit-parallel scanning primitive, {\em Match Star}.
42which performs a parallel Kleene closure operation over character classes
43and without backtracking. % expand and rephrase description of Match Star
44
45We evaluate the performance of our method in comparison with several widely known {\em grep} 
46implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}.
47Performance results are analyzed using the performance monitoring counters
48of commodity hardware. Overall, our results demonstrate dramatic speed-ups
49over other publically available software.
50
51\end{abstract}
52
53\section{Introduction}
54\label{Introduction}
55%\input{introduction.tex}
56
57Regular expresssion matching is an extensively studied problem with application to
58various problem domains, for example, text-processing and bioinformatics. A multitude of algorithms
59and software tools exist to address the specific demands of the various problem domains. % reword
60
61The pattern matching problem can be
62stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
63find all the text positions of T that start an occurrence of P. Alternatively,
64one may want all the final positions of occurrences. Some applications require
65slightly different output such as the line that matches the pattern.
66
67A pattern P can be a simple string, but it can also be a regular expression.
68A regular expression, is an expression that specifies a set of strings
69and is composed of (i) simple strings and (ii) the
70union, concatenation and Kleene closure of other regular expressions.
71To avoid parentheses it is assumed that the Kleene star has the highest priority,
72next concatenation and then alternation, however, most formalisms provides grouping
73operators to allow the definition of scope and operator precedence.
74Readers unfamiliar with the concept of regular expression matching are referred
75classical texts such as \cite{aho2007}.
76
77Regular expression matching is typically performed using a wide variety of
78publically available software tools designed for efficient on-line pattern matching.
79For example, UNIX grep, Gnu grep, agrep, cgrep, nr-grep, and Perl regular
80expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
81agrep, and nr-grep are widely considered the fastest regular expression
82matching tools in practice \cite{navarro2000}.
83
84% source: Parallel Scanning with BitStream Addition: An XML Case Study
85Although traditional finite state machine methods used in the scanning and parsing of
86text streams is considered to be the hardest of the “13 dwarves” to parallelize
87[1], bit-parallel data stream technology shows considerable promise as a general
88approach to regular expression matching. In this approach, character
89streams are processed W positions at a time using the W-bit SIMD registers
90commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips).
91This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit
92position within the byte. These basis bitstreams are then combined with bitwise
93logic and shifting operations to compute further parallel bit streams of interest.
94
95We further increase the performance of our approach with the introduction of a new parallel
96scanning primitive coined {\em Match Star}. % expand and reword desc. of Match Star
97
98The remainder of this paper is organized as follows.
99Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
100Section~\ref{Background} presents background material on classical automata-based approaches 
101to regular expression matching and describes the {\em grep} family utilities.
102Next, section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
103Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
104Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
105presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
106several software tools. Section~\ref{Conclusion} concludes the paper.
107
108The fundamental contribution of this paper is fully general approach to regular expression matching
109using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
110the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
111Specific contributions include:
112\begin{itemize}
113\item compilation of regular expressions into unbounded bit-parallel data stream equations;
114\item documentation of character classes compilation into bit-parallel character class data streams;
115\item Match Star parallel scanning primitive;
116\item efficient support for unicode characters.
117\end{itemize}
118
119\section{Basic Concepts}
120\label{Basic Concepts}
121In this section, we define the notation and basic concepts used throughout this paper.
122
123\subsection{Notation}
124We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$ 
125be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$ 
126be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
127The goal of simple pattern expression matching is to find all the text positions of T that follow
128an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
129
130We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
131represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
132logical left shift, and logical right shift, respectively and are further modulated by
133the number of bit positions a given value shall be shifted by, for example ``shift left by n''.
134Vacant bit-positions are filled in with zeroes.
135
136\subsection{Regular Expressions}
137
138% Define regular expressions (a recursive def), character classes, bounded repetition
139
140\section{Background}
141\label{Background}
142%\input{background.tex}
143
144\subsection{Regular Expressions and Finite Automata}
145
146The origins of regular expression matching date back to automata theory
147and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
148Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
149to nondeterministic finite automata (NFA) for regular expression matching.
150Following Thompson's approach, a regular expression of length m is first converted
151to an NFA with O(m) nodes. It is possible to search a text of length n using the
152NFA directly in O(mn) worst case time. Often, a more efficient choice
153is to convert the NFA into a DFA. A DFA has only a single active state and
154allows to search the text at O(n) worst-case optimal. It is well known that the
155conversion of the NFA to the DFA may result in the state explosion problem.
156That is the resultant DFA may have O($2^m$) states.
157
158Thompson's original work marked the beginning of a long line of
159regular expression pattern matching methods that
160process an input string, character-at-a-time,
161and that transition from state to state
162according to the current state and the next input character.
163
164Whereas traditional automata techniques acheive
165O(n) worst-case optimal efficiency, simple string matching algorithms,
166such as the Boyer-Moore family of algorithms, skip input characters
167to achieve sublinear times in the average case \cite{boyer1977fast}.
168
169Boyer-Moore methods, begin comparison from the end of the pattern instead of the
170beginning and precompute skip information
171to determine how far ahead a pattern search can skip in the input whenever
172a non-matching character is encountered. Generally, Boyer-Moore family
173algorithms improve faster as the pattern being searched for becomes longer.
174In many cases, the techniques used to skip characters in simple string matching
175approaches can be extended to regular expression matching.
176Widely known techniques used to facilitate character skipping in regular
177expression matching include necessary strings and backward suffix matching
178inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
179
180\subsection{Bit-parallel Simulation of Automata}
181
182bit-parallelism \cite{Baeza-yates_anew}
183
184Shift-Or / Shift-And \cite{wu1992fast}
185
186Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
187
188i.e. bit parallel simulation of automata
189
190\subsection{Software Tools} % Methodology section?
191
192%This subsection relates the performance of well known tools (Gnu grep, agrep, nr-grep)
193%to the above pattern matching algorithms.
194
195%Source: http://pages.cs.wisc.edu/~mdant/cs520_4.html
196
197%Thompson created the first grep (UNIX grep) as a standalone adaptation
198%of the regular expression parser he had written for the UNIX ed utility.
199%In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
200%Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
201%time it took to build a complete finite automaton for the regular expression before it could even start searching.
202%Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
203%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
204%It took zero set-up time and just one additional test in the inner loop.
205
206%Source: Chapter 7 Flexible Pattern Matching \cite{navarro2000}. \cite{navarro2000}
207
208%Gnu grep
209%agrep
210%nr-grep
211%re2
212
213\section{Bit-parallel Data Streams}
214\label{Bit-parallel Data Streams}
215
216% Contrast 'bit-parallel data streams' with the bit-parallel simulation of automata.
217
218The bit-parallel data streams use the wide
219SIMD registers commonly found on commodity processors
220to process  byte positions at a time using
221bitwise logic, shifting and other operations.
222
223A signficant advantage of the bit-parallel
224data stream methods over other
225pattern matching methods that rely on
226bit-parallel automata simulation
227is the potential to skip register width
228characters. % reword
229
230Skip SIMD register width.
231
232\subsection{Match Star}
233
234%Source: http://en.wikipedia.org/wiki/Backtracking
235%Backtracking is a general algorithm for finding all solutions to some computational problem,
236%that incrementally builds candidates to the solutions.
237
238Match Star takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
239
240Figure one illustrates the Match Star method.  The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data.
241
242In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.  $T_1$ has 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
243output marker stream is obtained by ORing $T_2$ with the initial marker stream.
244
245\begin{figure}[tbh]
246\begin{center}
247\begin{tabular}{cr}\\
248source data & \verb`--142578---125-231-----127--5394---94761205-`\\
249$M_0$ & \verb`.......1......1..1..1...1.............1..1..`\\
250$D = $\verb:[0-9]: & \verb`..111111...111.111.....111..1111...11111111.`\\
251$T_0 = M_0 \wedge D$ & \verb`.......1.........1......1.............1..1..`\\
252$T_1 = T_0 + D$ & \verb`.1.........1111.......1..1..1111..1...1...1.`\\
253$T_2 = T_1 \oplus D$ & \verb`.1111111......1111....111.........1111.111..`\\
254$M_1 = T_2 \, | \, M_0$ & \verb`.1111111......1111..1.111.........11111111..`
255\end{tabular}
256
257
258\end{center}
259\caption{Match Star}
260\label{fig:scan1}
261\end{figure}
262
263\section{Compiler Technology}
264\label{Compiler Technology}
265%\input{compilertechnology.tex}
266
267\section{Methodology}
268\label{Methodology}
269%\input{methodology.tex}
270
271We compare the performance of our bit-parallel data stream techniques against
272several fast regular expression matching implemenations,
273Gnu grep, agrep \cite{Wu92agrep-}, nr-grep, and re2.
274
275\section{Experimental Results}
276\label{Experimental Results}
277%\input{results.tex}
278
279
280\input{GPUImpl.tex}
281
282\section{Conclusion}
283\label{Conclusion}
284%\input{conclusion.tex}
285{
286  \bibliographystyle{acm}
287  \bibliography{reference}
288}
289
290\end{document}
Note: See TracBrowser for help on using the repository browser.