1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | |
---|
4 | %opening |
---|
5 | \title{Fast Regular Expression Matching using Parallel Bitstreams} |
---|
6 | \author{ |
---|
7 | {Robert D. Cameron} \\ |
---|
8 | \and |
---|
9 | {Kenneth S. Herdy} \\ |
---|
10 | \and |
---|
11 | {Ben Hull} \\ |
---|
12 | \and |
---|
13 | {Thomas C. Shermer} \\ |
---|
14 | \\School of Computing Science |
---|
15 | \\Simon Fraser University |
---|
16 | } |
---|
17 | \begin{document} |
---|
18 | |
---|
19 | \date{} |
---|
20 | \maketitle |
---|
21 | |
---|
22 | \begin{abstract} |
---|
23 | |
---|
24 | A parallel regular expression matching method is introduced and studied in |
---|
25 | application to the problem of online pattern matching. |
---|
26 | |
---|
27 | The method is based on the concept of parallel |
---|
28 | bitstream technology, in which parallel streams of bits are formed such |
---|
29 | that each stream comprises bits in one-to-one correspondence with the |
---|
30 | character code units of a source data stream. |
---|
31 | |
---|
32 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
33 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
34 | |
---|
35 | Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. |
---|
36 | |
---|
37 | \end{abstract} |
---|
38 | |
---|
39 | \section{Introduction} |
---|
40 | \label{Introduction} |
---|
41 | %\input{introduction.tex} |
---|
42 | |
---|
43 | Regular expresssion matching is an extensively studied problem with application to |
---|
44 | numerous application domains. A multitude |
---|
45 | of algorithms and software tools have been developed to the address the particular |
---|
46 | demands of the various application domains. |
---|
47 | |
---|
48 | The pattern matching problem can be |
---|
49 | stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P, |
---|
50 | find all the text positions of T that start an occurrence of P. Alternatively, |
---|
51 | one may want all the final positions of occurrences. Some applications require |
---|
52 | slightly different output such as the line that matches the pattern. |
---|
53 | |
---|
54 | A pattern P can be a simple string, but it can also be, a regular expression. |
---|
55 | A regular expression, is an expression that specifies a set of strings. |
---|
56 | A regular expression is composed of (i) simple strings and (ii) the |
---|
57 | union, concatenation and Kleene closure of other regular expressions. |
---|
58 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
59 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
60 | operators to allow the definition of scope and operator precedence. |
---|
61 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
62 | classical texts such as \cite{aho2007}. |
---|
63 | |
---|
64 | Regular expression matching is commonly performed using a wide variety of |
---|
65 | publically available software tools for on-line pattern matching. For instance, |
---|
66 | UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular |
---|
67 | expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep), |
---|
68 | agrep, and nrgrep are widely known and considered as |
---|
69 | the fastest regular expression matching tools in practice \cite{navarro2000}. |
---|
70 | and are of particular interest to this study. |
---|
71 | |
---|
72 | % simple patterns, extended patterns, regular expressions |
---|
73 | |
---|
74 | % motivation / previous work |
---|
75 | Although tradi finite state machine methods used in the scanning and parsing of |
---|
76 | text streams is considered to be the hardest of the â13 dwarvesâ to parallelize |
---|
77 | [1], parallel bitstream technology shows considerable promise for these types of |
---|
78 | applications [3, 4]. In this approach, character streams are processed W positions |
---|
79 | at a time using the W-bit SIMD registers commonly found on commodity processors |
---|
80 | (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by |
---|
81 | first slicing the byte streams into eight separate basis bitstreams, one for each bit |
---|
82 | position within the byte. These basis bitstreams are then combined with bitwise |
---|
83 | logic and shifting operations to compute further parallel bit streams of interest. |
---|
84 | |
---|
85 | We further increase the parallelism in our methods by introducing a new parallel |
---|
86 | scanning primitive which we have coined Match Star. Match Star returns all matches |
---|
87 | in a single operation and eliminates backtracking |
---|
88 | when a partially successful search path fails. |
---|
89 | |
---|
90 | |
---|
91 | The remainder of this paper is organized as follows. |
---|
92 | |
---|
93 | Section~\ref{Background} presents background material on classic |
---|
94 | regular expression pattern matching techniques and provides insight into the |
---|
95 | efficiency of traditional regular expression software tools. |
---|
96 | |
---|
97 | Section~\ref{Parallel Bitwise Data Streams} describes out data parallel |
---|
98 | regular expression matching techniques. |
---|
99 | |
---|
100 | Section~\ref{Compiler Technology} |
---|
101 | |
---|
102 | Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} |
---|
103 | presents a detailed performance analysis of our data parallel bitstream techniques against |
---|
104 | Gnu grep, agrep, and nr-grep. |
---|
105 | |
---|
106 | Section~\ref{Conclusion} concludes the paper. |
---|
107 | |
---|
108 | \section{Background} |
---|
109 | \label{Background} |
---|
110 | %\input{background.tex} |
---|
111 | |
---|
112 | % Background |
---|
113 | |
---|
114 | % History |
---|
115 | |
---|
116 | Historically, the origins of regular expression matching date back to automata theory |
---|
117 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. |
---|
118 | |
---|
119 | In 1959, Dana and Scott demonstrated that |
---|
120 | NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA |
---|
121 | state corresponds to a set of NFA states. |
---|
122 | |
---|
123 | Thompson, in 1968, is credited with the first construction to convert regular expressions |
---|
124 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
125 | |
---|
126 | Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of |
---|
127 | regular expression implementations that construct automata for pattern matching. |
---|
128 | |
---|
129 | The traditional technique [16] to search a regular expression of length m in |
---|
130 | a text of length n is to first convert the expression into a non-deterministic |
---|
131 | automaton (NFA) with O(m) nodes. It is possible to search the text using the |
---|
132 | NFA directly in O(mn) worst case time. The cost comes from the fact that |
---|
133 | more than one state of the NFA may be active at each step, and therefore all |
---|
134 | may need to be updated. A more effcient choice is to convert the NFA into a |
---|
135 | DFA. A DFA has only a single active state and allows to search the text at |
---|
136 | O(n) worst-case optimal. The problem with this approach is that the DFA |
---|
137 | may have O($2^m$) states. |
---|
138 | |
---|
139 | In general, the general process is first to build a |
---|
140 | NFA from the regular expression and simulate the NFA on text input, |
---|
141 | or alternatively to convert the NFA into a |
---|
142 | DFA, optionally minimize the number of states in the DFA, |
---|
143 | and finally simulate the DFA on text input. |
---|
144 | |
---|
145 | \section{Parallel Bitwise Data Streams} |
---|
146 | \label{Parallel Bitwise Data Streams} |
---|
147 | |
---|
148 | \section{Compiler Technology} |
---|
149 | \label{Compiler Technology} |
---|
150 | |
---|
151 | \section{Methodology} |
---|
152 | \label{Methodology} |
---|
153 | %\input{methodology.tex} |
---|
154 | |
---|
155 | We compare the performance of our parallel \bitstream{} techniques against |
---|
156 | Gnu grep, agrep, and nr-grep. |
---|
157 | |
---|
158 | Given a regular expression R and a test T the regular expression matching |
---|
159 | problem finds all ending position of substrings in Q that matches a string in |
---|
160 | the language denoted by R. |
---|
161 | |
---|
162 | The behaviour of Gnu grep, agrep, and nr-grep are differ in that |
---|
163 | |
---|
164 | Gnu grep |
---|
165 | |
---|
166 | agrep |
---|
167 | |
---|
168 | nr-grep |
---|
169 | |
---|
170 | |
---|
171 | |
---|
172 | \section{Experimental Results} |
---|
173 | \label{Experimental Results} |
---|
174 | %\input{results.tex} |
---|
175 | |
---|
176 | \section{Conclusion} |
---|
177 | \label{Conclusion} |
---|
178 | %\input{conclusion.tex} |
---|
179 | |
---|
180 | { |
---|
181 | \bibliographystyle{acm} |
---|
182 | \bibliography{reference} |
---|
183 | } |
---|
184 | |
---|
185 | \end{document} |
---|