source: docs/Working/re/re-main.tex @ 3124

Last change on this file since 3124 was 3124, checked in by ksherdy, 6 years ago

Rough notes.

File size: 5.9 KB
RevLine 
[3121]1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3
[3123]4\def \Bitstream{Bit Stream}
5\def \bitstream{bit stream}
6
[3121]7%opening
[3123]8\title{Fast Regular Expression Matching using Parallel \Bitstream{}s}
[3121]9\author{
10{Robert D. Cameron} \\
11\and 
12{Kenneth S. Herdy} \\
13\and 
14{Ben Hull} \\
15\and 
16{Thomas C. Shermer} \\
17\\School of Computing Science
18\\Simon Fraser University
19}
20\begin{document}
21
22\date{}
23\maketitle
24
25\begin{abstract}
26
[3123]27A data parallel regular expression matching method using the concept of bitstream technology
28is introduced and studied in application to the problem of fast regular expression matching.
29
30The method is based on the concept of parallel
31\bitstream{} technology, in which parallel streams of bits are formed such
32that each stream comprises bits in one-to-one correspondence with the
33character code units of a source data stream.
34
35On processors supporting W-bit addition operations, the method processes W source characters 
36in parallel and performs up to W finite state transitions per clock cycle.
37
38Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives.
39
[3121]40\end{abstract}
41
42\section{Introduction}
43\label{Introduction}
44%\input{introduction.tex}
45
[3124]46Regular expresssion matching is an extensively studied problem with a multitude
47of algorithms and software tools developed to the demands of particular problem contexts.
48
[3123]49Given a text T$_{1..n}$ of n characters and a pattern P, the pattern matching problem can be
50stated as follows. Find all the text positions of T that start an occurrence of P. 
51Alternatively, one may want all the final positions of occurrences. Some
52applications require slightly different output such as the line that matches the pattern.
53
54The pattern P can be just a simple string,
55but it can also be, for example, a regular expression.
56
57A regular expression, or pattern, is an expression that specifies a set of strings.
[3124]58A regular expression is composed of (i) simple strings (ii) the empty or (ii)
[3123]59union, concatenation and Kleene closure of other regular expressions.
60To avoid parentheses it is assumed that the Kleene star has the highest priority,
61next concatenation and then alternation, however, most formalisms provides grouping
62operators to allow the definition of scope and operator precedence.
63Readers unfamiliar with the concept of regular expression matching are referred
64classical texts such as \cite{aho2007}.
65
66Regular expression matching is commonly performed using a variety of
[3124]67publically available software tools. The most prominent, UNIX grep, 
68Gnu grep, agrep, cgrep, nrgrep, and Perl regular
[3123]69expressions \cite{Abou-assaleh04surveyof}.
70
[3124]71Amongst these Gnu grep, agrep, and nrgrep are widely known and considered as
72the fastest regular expression matching tools in practice \cite{}.
[3123]73
[3124]74Of particular interest to this study are the performance oriented Gnu grep, agrep, and nrgrep.
[3123]75
[3124]76% motivation / previous work
77Although the finite state machine methods used in the scanning and parsing of
78text streams is considered to be the hardest of the “13 dwarves” to parallelize
79[1], parallel bitstream technology shows considerable promise for these types of
80applications [3, 4]. In this approach, character streams are processed W positions
81at a time using the W-bit SIMD registers commonly found on commodity processors
82(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
83first slicing the byte streams into eight separate basis bitstreams, one for each bit
84position within the byte. These basis bitstreams are then combined with bitwise
85logic and shifting operations to compute further parallel bit streams of interest.
86
87We further increase the parallelism in our methods by introducing a new parallel
88scanning primitive which we have coined 'Match Star' that returns all matches in a single
89operation and eliminates the need for back tracking ... (ELABORATE)
90
91--- STATE the content of the next sections. ---
92
93
94We compare the performance of our parallel \bitstream{} techniques against
[3123]95various grep concentrate on the simpler case of
96reporting initial or final occurrence positions.
97
[3124]98
99\section{Background}
100\label{Background}
101%\input{background.tex}
102
[3123]103% Background
104
105% History
106
107Historically, the origins of regular expression matching date back to automata theory
108and formal language theory developed by Kleene in the 1950s \cite{kleene1951representation}.
109
110In 1959, Dana and Scott demonstrated that
111NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA
112state corresponds to a set of NFA states.
113
114Thompson, in 1968, is credited with the first construction to convert regular expressions
115to nondeterministic finite automata (NFA) for regular expression matching.
116
117Thompson’s publication \cite{thompson1968} marked the beginning of a long line of
118regular expression implementations that construct automata for pattern matching.
119
120The traditional technique [16] to search a regular expression of length m in
121a text of length n is to first convert the expression into a non-deterministic
122automaton (NFA) with O(m) nodes. It is possible to search the text using the
123NFA directly in O(mn) worst case time. The cost comes from the fact that
124more than one state of the NFA may be active at each step, and therefore all
125may need to be updated. A more effcient choice is to convert the NFA into a
126DFA. A DFA has only a single active state and allows to search the text at
127O(n) worst-case optimal. The problem with this approach is that the DFA
128may have O(2^m) states.
129
[3124]130In general, the general process is first to build a
131NFA from the regular expression and simulate the NFA on text input,
132or alternatively to convert the NFA into a
133DFA, optionally minimize the number of states in the DFA,
134and finally simulate the DFA on text input.
[3123]135
[3121]136
137\section{Methodology}
138\label{Methodology}
139%\input{methodology.tex}
140
141\section{Experimental Results}
142\label{results}
143%\input{results.tex}
144
145\section{Conclusion and Future Work}
146\label{conclusion}
147%\input{conclusion.tex}
148
[3123]149{
150  \bibliographystyle{acm}
151  \bibliography{reference}
152}
153
[3121]154\end{document}
Note: See TracBrowser for help on using the repository browser.