source: docs/Working/re/re-main.tex.backup @ 3126

Last change on this file since 3126 was 3126, checked in by ksherdy, 6 years ago

Rough notes.

File size: 6.9 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3
4\def \Bitstream{Bit Stream}
5\def \bitstream{bit stream}
6
7%opening
8\title{Fast Regular Expression Matching using Parallel \Bitstream{}s}
9\author{
10{Robert D. Cameron} \\
11\and
12{Kenneth S. Herdy} \\
13\and
14{Ben Hull} \\
15\and
16{Thomas C. Shermer} \\
17\\School of Computing Science
18\\Simon Fraser University
19}
20\begin{document}
21
22\date{}
23\maketitle
24
25\begin{abstract}
26
27A parallel regular expression matching method is introduced and studied in
28application to the problem of online pattern matching.
29
30The method is based on the concept of parallel
31\bitstream{} technology, in which parallel streams of bits are formed such
32that each stream comprises bits in one-to-one correspondence with the
33character code units of a source data stream.
34
35On processors supporting W-bit addition operations, the method processes W source characters 
36in parallel and performs up to W finite state transitions per clock cycle.
37
38Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives.
39
40\end{abstract}
41
42\section{Introduction}
43\label{Introduction}
44%\input{introduction.tex}
45
46Regular expresssion matching is an extensively studied problem with application to
47numerous application domains. A multitude
48of algorithms and software tools have been developed to the address the particular
49demands of the various application domains.
50
51The pattern matching problem can be
52stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
53find all the text positions of T that start an occurrence of P. Alternatively,
54one may want all the final positions of occurrences. Some applications require
55slightly different output such as the line that matches the pattern.
56
57A pattern P can be a simple string, but it can also be, a regular expression.
58A regular expression, is an expression that specifies a set of strings.
59A regular expression is composed of (i) simple strings and (ii) the
60union, concatenation and Kleene closure of other regular expressions.
61To avoid parentheses it is assumed that the Kleene star has the highest priority,
62next concatenation and then alternation, however, most formalisms provides grouping
63operators to allow the definition of scope and operator precedence.
64Readers unfamiliar with the concept of regular expression matching are referred
65classical texts such as \cite{aho2007}.
66
67Regular expression matching is commonly performed using a wide variety of
68publically available software tools for on-line pattern matching. For instance,
69UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
70expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
71agrep, and nrgrep are widely known and considered as
72the fastest regular expression matching tools in practice \cite{navarro2000}.
73and are of particular interest to this study.
74
75% simple patterns, extended patterns, regular expressions
76
77% motivation / previous work
78Although tradi finite state machine methods used in the scanning and parsing of
79text streams is considered to be the hardest of the “13 dwarves” to parallelize
80[1], parallel bitstream technology shows considerable promise for these types of
81applications [3, 4]. In this approach, character streams are processed W positions
82at a time using the W-bit SIMD registers commonly found on commodity processors
83(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
84first slicing the byte streams into eight separate basis bitstreams, one for each bit
85position within the byte. These basis bitstreams are then combined with bitwise
86logic and shifting operations to compute further parallel bit streams of interest.
87
88We further increase the parallelism in our methods by introducing a new parallel
89scanning primitive which we have coined Match Star. Match Star returns all matches
90in a single operation and eliminates backtracking
91when a partially successful search path fails.
92
93       
94The remainder of this paper is organized as follows.
95
96Section~\ref{Background} presents background material on classic
97regular expression pattern matching techniques and provides insight into the
98efficiency of traditional regular expression software tools.
99
100Section~\ref{Bitwise Parallel Data Streams} describes out data parallel
101regular expression matching techniques.
102
103Section~\ref{Compiler Technology}
104
105Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
106presents a detailed performance analysis of our data parallel \bitstream{} techniques against
107Gnu grep, agrep, and nr-grep.
108
109Section~\ref{conclusion} concludes the paper.
110
111\section{Background}
112\label{Background}
113%\input{background.tex}
114
115% Background
116
117% History
118
119Historically, the origins of regular expression matching date back to automata theory
120and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
121
122In 1959, Dana and Scott demonstrated that
123NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA
124state corresponds to a set of NFA states.
125
126Thompson, in 1968, is credited with the first construction to convert regular expressions
127to nondeterministic finite automata (NFA) for regular expression matching.
128
129Thompson’s publication \cite{thompson1968} marked the beginning of a long line of
130regular expression implementations that construct automata for pattern matching.
131
132The traditional technique [16] to search a regular expression of length m in
133a text of length n is to first convert the expression into a non-deterministic
134automaton (NFA) with O(m) nodes. It is possible to search the text using the
135NFA directly in O(mn) worst case time. The cost comes from the fact that
136more than one state of the NFA may be active at each step, and therefore all
137may need to be updated. A more effcient choice is to convert the NFA into a
138DFA. A DFA has only a single active state and allows to search the text at
139O(n) worst-case optimal. The problem with this approach is that the DFA
140may have O(2^m) states.
141
142In general, the general process is first to build a
143NFA from the regular expression and simulate the NFA on text input,
144or alternatively to convert the NFA into a
145DFA, optionally minimize the number of states in the DFA,
146and finally simulate the DFA on text input.
147
148
149
150\section{Parallel Bitwise Data Streams}
151\label{Parallel Bitwise Data Streams}
152
153
154\section{Compiler Technology}
155\label{Compiler Technology}
156
157\section{Methodology}
158\label{Methodology}
159%\input{methodology.tex}
160
161We compare the performance of our parallel \bitstream{} techniques against
162Gnu grep, agrep, and nr-grep.
163
164Given a regular expression R and a test T the regular expression matching
165problem finds all ending position of substrings in Q that matches a string in
166the language denoted by R.
167
168The behaviour of Gnu grep, agrep, and nr-grep are differ in that
169
170Gnu grep
171
172agrep
173
174nr-grep
175
176
177
178\section{Experimental Results}
179\label{results}
180%\input{results.tex}
181
182\section{Conclusion and Future Work}
183\label{conclusion}
184%\input{conclusion.tex}
185
186{
187  \bibliographystyle{acm}
188  \bibliography{reference}
189}
190
191\end{document}
Note: See TracBrowser for help on using the repository browser.