source: docs/Working/re/re-main.tex.backup @ 3138

Last change on this file since 3138 was 3138, checked in by ksherdy, 6 years ago

Rough notes.

File size: 8.4 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3
4%opening
5\title{Fast Regular Expression Matching using Parallel Bitstreams}
6\author{
7{Robert D. Cameron} \\
8\and
9{Kenneth S. Herdy} \\
10\and
11{Ben Hull} \\
12\and
13{Thomas C. Shermer} \\
14\\School of Computing Science
15\\Simon Fraser University
16}
17\begin{document}
18
19\date{}
20\maketitle
21
22\begin{abstract}
23
24A parallel regular expression matching method is introduced and studied in
25comparison with software tools designed for efficient on-line text searching such
26as the {\em grep} family.
27
28The method is based on the concept of bit-parallel data streams,
29in which parallel streams of bits are formed such
30that each stream comprises bits in one-to-one correspondence with the
31character code units of a source data stream.
32
33An implementation of the method in the form of a regular expression
34compiler is discussed. The compiler accepts a regular expression and produces
35sequences of arbitrary-length bit-parallel equations.
36Bit-parallel equations are then transformed
37into a low-level C-based implementation.
38
39The C-based program accepts the text to be searched and
40signals a match each time the text
41matches the compiled regular expression.
42
43
44These low-level implementations take advantage of
45the SIMD (single-instruction multiple-data) capabilities of commodity
46processors to yield a dramatic speed-up over
47traditional byte-at-a-time approaches.
48
49On processors supporting W-bit
50addition operations, the method processes W source characters 
51in parallel and performs up to W finite state transitions per clock cycle.
52
53Additional parallelism is achieved through the introduction a new parallel
54scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure.
55
56
57We demonstrate the features and efficiency of our bit-parallel pattern
58matching appoach by searching text data sets for lines matching a regular expression.
59
60We evaluate the performance of our method against the widely known grep implemenations,
61{\em Gnu grep}, {\em agrep}, {\em nr-grep}, as well as {\em Google's re2} regular expression engine.
62
63Performance results show a dramatic speed-up over the traditional and state-of-the-art alternatives.
64
65
66\end{abstract}
67
68\section{Introduction}
69\label{Introduction}
70%\input{introduction.tex}
71
72Regular expresssion matching is an extensively studied problem with application to
73numerous application domains. A multitude
74of algorithms and software tools have been developed to the address the particular
75demands of the various application domains.
76
77The pattern matching problem can be
78stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
79find all the text positions of T that start an occurrence of P. Alternatively,
80one may want all the final positions of occurrences. Some applications require
81slightly different output such as the line that matches the pattern.
82
83A pattern P can be a simple string, but it can also be, a regular expression.
84A regular expression, is an expression that specifies a set of strings.
85A regular expression is composed of (i) simple strings and (ii) the
86union, concatenation and Kleene closure of other regular expressions.
87To avoid parentheses it is assumed that the Kleene star has the highest priority,
88next concatenation and then alternation, however, most formalisms provides grouping
89operators to allow the definition of scope and operator precedence.
90Readers unfamiliar with the concept of regular expression matching are referred
91classical texts such as \cite{aho2007}.
92
93Regular expression matching is commonly performed using a wide variety of
94publically available software tools for on-line pattern matching. For instance,
95UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
96expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
97agrep, and nrgrep are widely known and considered as
98the fastest regular expression matching tools in practice \cite{navarro2000}.
99and are of particular interest to this study.
100
101% simple patterns, extended patterns, regular expressions
102
103% motivation / previous work
104Although tradi finite state machine methods used in the scanning and parsing of
105text streams is considered to be the hardest of the “13 dwarves” to parallelize
106[1], parallel bitstream technology shows considerable promise for these types of
107applications [3, 4]. In this approach, character streams are processed W positions
108at a time using the W-bit SIMD registers commonly found on commodity processors
109(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
110first slicing the byte streams into eight separate basis bitstreams, one for each bit
111position within the byte. These basis bitstreams are then combined with bitwise
112logic and shifting operations to compute further parallel bit streams of interest.
113
114We further increase the parallelism in our methods by introducing a new parallel
115scanning primitive which we have coined Match Star. Match Star returns all matches
116in a single operation and eliminates backtracking
117when a partially successful search path fails.
118
119The remainder of this paper is organized as follows.
120Section~\ref{Background} presents background material on classic
121regular expression pattern matching techniques and provides insight into the
122efficiency of traditional regular expression software tools. Next,
123Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
124Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
125Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
126presents a detailed performance analysis of our data parallel bitstream techniques against
127Gnu grep, agrep, and nr-grep.
128Section~\ref{Conclusion} concludes the paper.
129
130\section{Background}
131\label{Background}
132%\input{background.tex}
133
134The origins of regular expression matching date back to automata theory
135and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
136Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
137to nondeterministic finite automata (NFA) for regular expression matching.
138Following Thompson's approach, a regular expression of length m is first converted
139to an NFA with O(m) nodes. It is possible to search a text of length n using the
140NFA directly in O(mn) worst case time. Alternatively, a more efficient choice
141is to convert the NFA into a DFA. A DFA has only a single active state and
142allows to search the text at O(n) worst-case optimal. It is well known that the
143conversion of the NFA to the DFA may result in the state explosion problem.
144That is the resultant DFA may have O($2^m$) states.
145
146Thompson's original work marked the beginning of a long line of
147regular expression implementations that
148process an input string, character-by-character, and that change based on the current state and the
149character read. Thompson created the first grep utility as a standalone adaptation
150of the regular expression parser he had written for the UNIX ed utility. In 1976,
151Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
152
153Bit-parallel simulation of automata.
154
155Suffix automata (BDM/BNDM)
156
157Skip characters pattern length (occurence frequency and length).
158
159
160\section{Bit-parallel Data Streams}
161\label{Bit-parallel data streams}
162
163The advantage of the parallel bit stream representation is that we can
164use the 128-bit SIMD registers commonly found on commodity processors
165(e.g. SSE on Intel) to process 128 byte positions at a time using
166bitwise logic, shifting and other operations.
167
168Skip characters register width.
169
170\subsection{Mask Star}
171
172%Wikipedia
173Backtracking is a general algorithm for finding all solutions to some computational problem,
174that incrementally builds candidates to the solutions.
175
176\section{Compiler Technology}
177\label{Compiler Technology}
178
179\section{Methodology}
180\label{Methodology}
181%\input{methodology.tex}
182
183We compare the performance of our parallel \bitstream{} techniques against
184Gnu grep, agrep, and nr-grep.
185
186Given a regular expression R and a test T the regular expression matching
187problem finds all ending position of substrings in Q that matches a string in
188the language denoted by R.
189
190The behaviour of Gnu grep, agrep, and nr-grep are differ in that
191
192Gnu grep
193
194agrep
195
196nr-grep
197
198
199
200\section{Experimental Results}
201\label{Experimental Results}
202%\input{results.tex}
203
204\section{Conclusion}
205\label{Conclusion}
206%\input{conclusion.tex}
207
208{
209  \bibliographystyle{acm}
210  \bibliography{reference}
211}
212
213\end{document}
Note: See TracBrowser for help on using the repository browser.