source: docs/Working/re/re-main.tex @ 3126

Last change on this file since 3126 was 3126, checked in by ksherdy, 6 years ago

Rough notes.

File size: 6.9 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3
4%opening
5\title{Fast Regular Expression Matching using Parallel Bitstreams}
6\author{
7{Robert D. Cameron} \\
8\and 
9{Kenneth S. Herdy} \\
10\and 
11{Ben Hull} \\
12\and 
13{Thomas C. Shermer} \\
14\\School of Computing Science
15\\Simon Fraser University
16}
17\begin{document}
18
19\date{}
20\maketitle
21
22\begin{abstract}
23
24A parallel regular expression matching method is introduced and studied in
25application to the problem of online pattern matching.
26
27The method is based on the concept of parallel
28bitstream technology, in which parallel streams of bits are formed such
29that each stream comprises bits in one-to-one correspondence with the
30character code units of a source data stream.
31
32On processors supporting W-bit addition operations, the method processes W source characters 
33in parallel and performs up to W finite state transitions per clock cycle.
34
35Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives.
36
37\end{abstract}
38
39\section{Introduction}
40\label{Introduction}
41%\input{introduction.tex}
42
43Regular expresssion matching is an extensively studied problem with application to
44numerous application domains. A multitude
45of algorithms and software tools have been developed to the address the particular
46demands of the various application domains.
47
48The pattern matching problem can be
49stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
50find all the text positions of T that start an occurrence of P. Alternatively,
51one may want all the final positions of occurrences. Some applications require
52slightly different output such as the line that matches the pattern.
53
54A pattern P can be a simple string, but it can also be, a regular expression.
55A regular expression, is an expression that specifies a set of strings.
56A regular expression is composed of (i) simple strings and (ii) the
57union, concatenation and Kleene closure of other regular expressions.
58To avoid parentheses it is assumed that the Kleene star has the highest priority,
59next concatenation and then alternation, however, most formalisms provides grouping
60operators to allow the definition of scope and operator precedence.
61Readers unfamiliar with the concept of regular expression matching are referred
62classical texts such as \cite{aho2007}.
63
64Regular expression matching is commonly performed using a wide variety of
65publically available software tools for on-line pattern matching. For instance,
66UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
67expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
68agrep, and nrgrep are widely known and considered as
69the fastest regular expression matching tools in practice \cite{navarro2000}.
70and are of particular interest to this study.
71
72% simple patterns, extended patterns, regular expressions
73
74% motivation / previous work
75Although tradi finite state machine methods used in the scanning and parsing of
76text streams is considered to be the hardest of the “13 dwarves” to parallelize
77[1], parallel bitstream technology shows considerable promise for these types of
78applications [3, 4]. In this approach, character streams are processed W positions
79at a time using the W-bit SIMD registers commonly found on commodity processors
80(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
81first slicing the byte streams into eight separate basis bitstreams, one for each bit
82position within the byte. These basis bitstreams are then combined with bitwise
83logic and shifting operations to compute further parallel bit streams of interest.
84
85We further increase the parallelism in our methods by introducing a new parallel
86scanning primitive which we have coined Match Star. Match Star returns all matches
87in a single operation and eliminates backtracking
88when a partially successful search path fails.
89
90       
91The remainder of this paper is organized as follows.
92
93Section~\ref{Background} presents background material on classic
94regular expression pattern matching techniques and provides insight into the
95efficiency of traditional regular expression software tools.
96
97Section~\ref{Parallel Bitwise Data Streams} describes out data parallel
98regular expression matching techniques.
99
100Section~\ref{Compiler Technology}
101
102Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
103presents a detailed performance analysis of our data parallel bitstream techniques against
104Gnu grep, agrep, and nr-grep.
105
106Section~\ref{Conclusion} concludes the paper.
107
108\section{Background}
109\label{Background}
110%\input{background.tex}
111
112% Background
113
114% History
115
116Historically, the origins of regular expression matching date back to automata theory
117and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
118
119In 1959, Dana and Scott demonstrated that
120NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA
121state corresponds to a set of NFA states.
122
123Thompson, in 1968, is credited with the first construction to convert regular expressions
124to nondeterministic finite automata (NFA) for regular expression matching.
125
126Thompson’s publication \cite{thompson1968} marked the beginning of a long line of
127regular expression implementations that construct automata for pattern matching.
128
129The traditional technique [16] to search a regular expression of length m in
130a text of length n is to first convert the expression into a non-deterministic
131automaton (NFA) with O(m) nodes. It is possible to search the text using the
132NFA directly in O(mn) worst case time. The cost comes from the fact that
133more than one state of the NFA may be active at each step, and therefore all
134may need to be updated. A more effcient choice is to convert the NFA into a
135DFA. A DFA has only a single active state and allows to search the text at
136O(n) worst-case optimal. The problem with this approach is that the DFA
137may have O($2^m$) states.
138
139In general, the general process is first to build a
140NFA from the regular expression and simulate the NFA on text input,
141or alternatively to convert the NFA into a
142DFA, optionally minimize the number of states in the DFA,
143and finally simulate the DFA on text input.
144
145\section{Parallel Bitwise Data Streams}
146\label{Parallel Bitwise Data Streams}
147
148\section{Compiler Technology}
149\label{Compiler Technology}
150
151\section{Methodology}
152\label{Methodology}
153%\input{methodology.tex}
154
155We compare the performance of our parallel \bitstream{} techniques against
156Gnu grep, agrep, and nr-grep.
157
158Given a regular expression R and a test T the regular expression matching
159problem finds all ending position of substrings in Q that matches a string in
160the language denoted by R.
161
162The behaviour of Gnu grep, agrep, and nr-grep are differ in that
163
164Gnu grep
165
166agrep
167
168nr-grep
169
170
171
172\section{Experimental Results}
173\label{Experimental Results}
174%\input{results.tex}
175
176\section{Conclusion}
177\label{Conclusion}
178%\input{conclusion.tex}
179
180{
181  \bibliographystyle{acm}
182  \bibliography{reference}
183}
184
185\end{document}
Note: See TracBrowser for help on using the repository browser.