1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | |
---|
4 | \def \Bitstream{Bit Stream} |
---|
5 | \def \bitstream{bit stream} |
---|
6 | |
---|
7 | %opening |
---|
8 | \title{Fast Regular Expression Matching using Parallel \Bitstream{}s} |
---|
9 | \author{ |
---|
10 | {Robert D. Cameron} \\ |
---|
11 | \and |
---|
12 | {Kenneth S. Herdy} \\ |
---|
13 | \and |
---|
14 | {Ben Hull} \\ |
---|
15 | \and |
---|
16 | {Thomas C. Shermer} \\ |
---|
17 | \\School of Computing Science |
---|
18 | \\Simon Fraser University |
---|
19 | } |
---|
20 | \begin{document} |
---|
21 | |
---|
22 | \date{} |
---|
23 | \maketitle |
---|
24 | |
---|
25 | \begin{abstract} |
---|
26 | |
---|
27 | A data parallel regular expression matching method using the concept of bitstream technology |
---|
28 | is introduced and studied in application to the problem of fast regular expression matching. |
---|
29 | |
---|
30 | The method is based on the concept of parallel |
---|
31 | \bitstream{} technology, in which parallel streams of bits are formed such |
---|
32 | that each stream comprises bits in one-to-one correspondence with the |
---|
33 | character code units of a source data stream. |
---|
34 | |
---|
35 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
36 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
37 | |
---|
38 | Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. |
---|
39 | |
---|
40 | |
---|
41 | \end{abstract} |
---|
42 | |
---|
43 | \section{Introduction} |
---|
44 | \label{Introduction} |
---|
45 | %\input{introduction.tex} |
---|
46 | |
---|
47 | Given a text T$_{1..n}$ of n characters and a pattern P, the pattern matching problem can be |
---|
48 | stated as follows. Find all the text positions of T that start an occurrence of P. |
---|
49 | Alternatively, one may want all the final positions of occurrences. Some |
---|
50 | applications require slightly different output such as the line that matches the pattern. |
---|
51 | |
---|
52 | The pattern P can be just a simple string, |
---|
53 | but it can also be, for example, a regular expression. |
---|
54 | In this paper our focus is regular expression matching. |
---|
55 | |
---|
56 | A regular expression, or pattern, is an expression that specifies a set of strings. |
---|
57 | A regular expression is composed of (i) basic strings and (ii) |
---|
58 | union, concatenation and Kleene closure of other regular expressions. |
---|
59 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
60 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
61 | operators to allow the definition of scope and operator precedence. |
---|
62 | |
---|
63 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
64 | classical texts such as \cite{aho2007}. |
---|
65 | |
---|
66 | \subsection{Grep} |
---|
67 | |
---|
68 | Regular expression matching is commonly performed using a variety of |
---|
69 | publically available software tools. Namely, the UNIX grep family, |
---|
70 | the GNU grep family, agrep, cgrep, sgrep, nrgrep, and Perl regular |
---|
71 | expressions \cite{Abou-assaleh04surveyof}. |
---|
72 | |
---|
73 | % Grep |
---|
74 | % Unix grep |
---|
75 | % Gnu grep |
---|
76 | % agrep |
---|
77 | % nrgrep |
---|
78 | |
---|
79 | Of particular interest are well-studied and performance oriented |
---|
80 | Gnu grep, agrep, and nrgrep. |
---|
81 | |
---|
82 | As such, we compare the performance of our parallel \bitstream{} techniques against |
---|
83 | various grep concentrate on the simpler case of |
---|
84 | reporting initial or final occurrence positions. |
---|
85 | |
---|
86 | % Background |
---|
87 | |
---|
88 | % History |
---|
89 | |
---|
90 | Regular expresssion matching is an extensively studied problem with a multitude |
---|
91 | of algorithms and software tools developed to the demands of particular problem contexts. |
---|
92 | |
---|
93 | Historically, the origins of regular expression matching date back to automata theory |
---|
94 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951representation}. |
---|
95 | |
---|
96 | In 1959, Dana and Scott demonstrated that |
---|
97 | NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA |
---|
98 | state corresponds to a set of NFA states. |
---|
99 | |
---|
100 | Thompson, in 1968, is credited with the first construction to convert regular expressions |
---|
101 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
102 | |
---|
103 | Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of |
---|
104 | regular expression implementations that construct automata for pattern matching. |
---|
105 | |
---|
106 | The traditional technique [16] to search a regular expression of length m in |
---|
107 | a text of length n is to first convert the expression into a non-deterministic |
---|
108 | automaton (NFA) with O(m) nodes. It is possible to search the text using the |
---|
109 | NFA directly in O(mn) worst case time. The cost comes from the fact that |
---|
110 | more than one state of the NFA may be active at each step, and therefore all |
---|
111 | may need to be updated. A more effcient choice is to convert the NFA into a |
---|
112 | DFA. A DFA has only a single active state and allows to search the text at |
---|
113 | O(n) worst-case optimal. The problem with this approach is that the DFA |
---|
114 | may have O(2^m) states. |
---|
115 | |
---|
116 | Overall, the general process is first to build a |
---|
117 | NFA from the regular expression, convert the NFA into a |
---|
118 | DFA, minimize the number of states in the DFA, |
---|
119 | and finally use the DFA to scan the text. |
---|
120 | |
---|
121 | \section{Background} |
---|
122 | \label{Background} |
---|
123 | %\input{background.tex} |
---|
124 | |
---|
125 | \section{Methodology} |
---|
126 | \label{Methodology} |
---|
127 | %\input{methodology.tex} |
---|
128 | |
---|
129 | \section{Experimental Results} |
---|
130 | \label{results} |
---|
131 | %\input{results.tex} |
---|
132 | |
---|
133 | \section{Conclusion and Future Work} |
---|
134 | \label{conclusion} |
---|
135 | %\input{conclusion.tex} |
---|
136 | |
---|
137 | { |
---|
138 | \bibliographystyle{acm} |
---|
139 | \bibliography{reference} |
---|
140 | } |
---|
141 | |
---|
142 | \end{document} |
---|