1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | |
---|
4 | \def \Bitstream{Bit Stream} |
---|
5 | \def \bitstream{bit stream} |
---|
6 | |
---|
7 | %opening |
---|
8 | \title{Fast Regular Expression Matching using Parallel \Bitstream{}s} |
---|
9 | \author{ |
---|
10 | {Robert D. Cameron} \\ |
---|
11 | \and |
---|
12 | {Kenneth S. Herdy} \\ |
---|
13 | \and |
---|
14 | {Ben Hull} \\ |
---|
15 | \and |
---|
16 | {Thomas C. Shermer} \\ |
---|
17 | \\School of Computing Science |
---|
18 | \\Simon Fraser University |
---|
19 | } |
---|
20 | \begin{document} |
---|
21 | |
---|
22 | \date{} |
---|
23 | \maketitle |
---|
24 | |
---|
25 | \begin{abstract} |
---|
26 | |
---|
27 | A data parallel regular expression matching method using the concept of bitstream technology |
---|
28 | is introduced and studied in application to the problem of fast regular expression matching. |
---|
29 | |
---|
30 | The method is based on the concept of parallel |
---|
31 | \bitstream{} technology, in which parallel streams of bits are formed such |
---|
32 | that each stream comprises bits in one-to-one correspondence with the |
---|
33 | character code units of a source data stream. |
---|
34 | |
---|
35 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
36 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
37 | |
---|
38 | Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. |
---|
39 | |
---|
40 | \end{abstract} |
---|
41 | |
---|
42 | \section{Introduction} |
---|
43 | \label{Introduction} |
---|
44 | %\input{introduction.tex} |
---|
45 | |
---|
46 | % parallel bitstream technology, parallelization, regular expressions |
---|
47 | |
---|
48 | % motivation / previous work |
---|
49 | Although the finite state machine methods used in the scanning and parsing of |
---|
50 | text streams is considered to be the hardest of the â13 dwarvesâ to parallelize |
---|
51 | [1], parallel bitstream technology shows considerable promise for these types of |
---|
52 | applications [3, 4]. In this approach, character streams are processed W positions |
---|
53 | at a time using the W-bit SIMD registers commonly found on commodity processors |
---|
54 | (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by |
---|
55 | first slicing the byte streams into eight separate basis bitstreams, one for each bit |
---|
56 | position within the byte. These basis bitstreams are then combined with bitwise |
---|
57 | logic and shifting operations to compute further parallel bit streams of interest. |
---|
58 | |
---|
59 | We further increase the parallelism in our methods by introducing a new parallel |
---|
60 | scanning primitive which we have coined 'Match Star' that returns all matches in a single |
---|
61 | operation and eliminates the need for back tracking ... (ELABORATE) |
---|
62 | |
---|
63 | |
---|
64 | Given a text T$_{1..n}$ of n characters and a pattern P, the pattern matching problem can be |
---|
65 | stated as follows. Find all the text positions of T that start an occurrence of P. |
---|
66 | Alternatively, one may want all the final positions of occurrences. Some |
---|
67 | applications require slightly different output such as the line that matches the pattern. |
---|
68 | |
---|
69 | The pattern P can be just a simple string, |
---|
70 | but it can also be, for example, a regular expression. |
---|
71 | |
---|
72 | A regular expression, or pattern, is an expression that specifies a set of strings. |
---|
73 | A regular expression is composed of (i) simple strings (ii) the empty or (ii) |
---|
74 | union, concatenation and Kleene closure of other regular expressions. |
---|
75 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
76 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
77 | operators to allow the definition of scope and operator precedence. |
---|
78 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
79 | classical texts such as \cite{aho2007}. |
---|
80 | |
---|
81 | Regular expression matching is commonly performed using a variety of |
---|
82 | publically available software tools. The most prominent, UNIX grep, |
---|
83 | Gnu grep, agrep, cgrep, nrgrep, and Perl regular |
---|
84 | expressions \cite{Abou-assaleh04surveyof}. |
---|
85 | |
---|
86 | Amongst these Gnu grep, agrep, and nrgrep are widely known and considered as |
---|
87 | the fastest regular expression matching tools in practice \cite{}. |
---|
88 | |
---|
89 | Of particular interest to this study are the performance oriented Gnu grep, agrep, and nrgrep. |
---|
90 | |
---|
91 | % Unix grep |
---|
92 | % Gnu grep |
---|
93 | % agrep |
---|
94 | % nrgrep |
---|
95 | |
---|
96 | Regular expresssion matching is an extensively studied problem with a multitude |
---|
97 | of algorithms and software tools developed to the demands of particular problem contexts. |
---|
98 | |
---|
99 | As such, we compare the performance of our parallel \bitstream{} techniques against |
---|
100 | various grep concentrate on the simpler case of |
---|
101 | reporting initial or final occurrence positions. |
---|
102 | |
---|
103 | |
---|
104 | \section{Background} |
---|
105 | \label{Background} |
---|
106 | %\input{background.tex} |
---|
107 | |
---|
108 | % Background |
---|
109 | |
---|
110 | % History |
---|
111 | |
---|
112 | Historically, the origins of regular expression matching date back to automata theory |
---|
113 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951representation}. |
---|
114 | |
---|
115 | In 1959, Dana and Scott demonstrated that |
---|
116 | NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA |
---|
117 | state corresponds to a set of NFA states. |
---|
118 | |
---|
119 | Thompson, in 1968, is credited with the first construction to convert regular expressions |
---|
120 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
121 | |
---|
122 | Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of |
---|
123 | regular expression implementations that construct automata for pattern matching. |
---|
124 | |
---|
125 | The traditional technique [16] to search a regular expression of length m in |
---|
126 | a text of length n is to first convert the expression into a non-deterministic |
---|
127 | automaton (NFA) with O(m) nodes. It is possible to search the text using the |
---|
128 | NFA directly in O(mn) worst case time. The cost comes from the fact that |
---|
129 | more than one state of the NFA may be active at each step, and therefore all |
---|
130 | may need to be updated. A more effcient choice is to convert the NFA into a |
---|
131 | DFA. A DFA has only a single active state and allows to search the text at |
---|
132 | O(n) worst-case optimal. The problem with this approach is that the DFA |
---|
133 | may have O(2^m) states. |
---|
134 | |
---|
135 | In general, the general process is first to build a |
---|
136 | NFA from the regular expression and simulate the NFA on text input, |
---|
137 | or alternatively to convert the NFA into a |
---|
138 | DFA, optionally minimize the number of states in the DFA, |
---|
139 | and finally simulate the DFA on text input. |
---|
140 | |
---|
141 | |
---|
142 | \section{Methodology} |
---|
143 | \label{Methodology} |
---|
144 | %\input{methodology.tex} |
---|
145 | |
---|
146 | \section{Experimental Results} |
---|
147 | \label{results} |
---|
148 | %\input{results.tex} |
---|
149 | |
---|
150 | \section{Conclusion and Future Work} |
---|
151 | \label{conclusion} |
---|
152 | %\input{conclusion.tex} |
---|
153 | |
---|
154 | { |
---|
155 | \bibliographystyle{acm} |
---|
156 | \bibliography{reference} |
---|
157 | } |
---|
158 | |
---|
159 | \end{document} |
---|