1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | |
---|
4 | %opening |
---|
5 | \title{Fast Regular Expression Matching with Bit-parallel Data Streams} |
---|
6 | \author{ |
---|
7 | {Robert D. Cameron} \\ |
---|
8 | \and |
---|
9 | {Kenneth S. Herdy} \\ |
---|
10 | \and |
---|
11 | {Ben Hull} \\ |
---|
12 | \and |
---|
13 | {Thomas C. Shermer} \\ |
---|
14 | \\School of Computing Science |
---|
15 | \\Simon Fraser University |
---|
16 | } |
---|
17 | \begin{document} |
---|
18 | |
---|
19 | \date{} |
---|
20 | \maketitle |
---|
21 | |
---|
22 | \begin{abstract} |
---|
23 | |
---|
24 | An parallel regular expression matching pattern method is |
---|
25 | introduced and compared with the state of the art in software tools designed for |
---|
26 | efficient on-line search. %such as the {\em grep} family pattern matching tools. |
---|
27 | The method is based on the concept of bit-parallel data streams, |
---|
28 | in which parallel streams of bits are formed such |
---|
29 | that each stream comprises bits in one-to-one correspondence with the |
---|
30 | character code units of a source data stream. |
---|
31 | |
---|
32 | An implementation of the method in the form of a regular expression |
---|
33 | compiler is discussed. The compiler accepts a regular expression and |
---|
34 | forms unbounded bit-parallel data stream operations. Bit-parallel operations |
---|
35 | are then transformed into a low-level C-based implementation for compilation into native |
---|
36 | pattern matching applications. These low-level C-based implementations take advantage of |
---|
37 | the SIMD (single-instruction multiple-data) capabilities of commodity |
---|
38 | processors to yield a dramatic speed-up over traditional byte-at-a-time approaches. |
---|
39 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
40 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
41 | |
---|
42 | We introduce a new bit-parallel scanning primitive, {\em Match Star}. |
---|
43 | which performs parallel Kleene closure over character classes |
---|
44 | and without backtracking. % expand and rephrase description of Match Star |
---|
45 | |
---|
46 | We evaluate the performance of our method in comparison with several widely known {\em grep} |
---|
47 | family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}, |
---|
48 | and regular expression engines such as {\em Google's re2}. |
---|
49 | Performance results are analyzed using the performance monitoring counters |
---|
50 | of commodity hardware. Overall, our results demonstrate a dramatic speed-up |
---|
51 | over publically available alternatives. |
---|
52 | |
---|
53 | \end{abstract} |
---|
54 | |
---|
55 | \section{Introduction} |
---|
56 | \label{Introduction} |
---|
57 | %\input{introduction.tex} |
---|
58 | |
---|
59 | Regular expresssion matching is an extensively studied problem with application to |
---|
60 | text processing and bioinformatics and with numerous algorithms |
---|
61 | and software tools developed to the address the particular |
---|
62 | processing demands. % reword |
---|
63 | |
---|
64 | The pattern matching problem can be |
---|
65 | stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P, |
---|
66 | find all the text positions of T that start an occurrence of P. Alternatively, |
---|
67 | one may want all the final positions of occurrences. Some applications require |
---|
68 | slightly different output such as the line that matches the pattern. |
---|
69 | |
---|
70 | A pattern P can be a simple string, but it can also be a regular expression. |
---|
71 | A regular expression, is an expression that specifies a set of strings |
---|
72 | and is composed of (i) simple strings and (ii) the |
---|
73 | union, concatenation and Kleene closure of other regular expressions. |
---|
74 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
75 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
76 | operators to allow the definition of scope and operator precedence. |
---|
77 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
78 | classical texts such as \cite{aho2007}. |
---|
79 | |
---|
80 | Regular expression matching is commonly performed using a wide variety of |
---|
81 | publically available software tools for on-line pattern matching. For instance, |
---|
82 | UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular |
---|
83 | expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep), |
---|
84 | agrep, and nrgrep are widely known and considered as |
---|
85 | the fastest regular expression matching tools in practice \cite{navarro2000}. |
---|
86 | and are of particular interest to this study. |
---|
87 | |
---|
88 | % simple patterns, extended patterns, regular expressions |
---|
89 | |
---|
90 | % motivation / previous work |
---|
91 | Although tradi finite state machine methods used in the scanning and parsing of |
---|
92 | text streams is considered to be the hardest of the â13 dwarvesâ to parallelize |
---|
93 | [1], parallel bitstream technology shows considerable promise for these types of |
---|
94 | applications [3, 4]. In this approach, character streams are processed W positions |
---|
95 | at a time using the W-bit SIMD registers commonly found on commodity processors |
---|
96 | (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by |
---|
97 | first slicing the byte streams into eight separate basis bitstreams, one for each bit |
---|
98 | position within the byte. These basis bitstreams are then combined with bitwise |
---|
99 | logic and shifting operations to compute further parallel bit streams of interest. |
---|
100 | |
---|
101 | We further increase the potential parallelism in our approach by introducing a new parallel |
---|
102 | scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star |
---|
103 | |
---|
104 | The remainder of this paper is organized as follows. |
---|
105 | Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper. |
---|
106 | Section~\ref{Background} presents background material on classical automata-based approaches |
---|
107 | to regular expression matching as well as describes the efficient {\em grep} family utilities |
---|
108 | and regular expression pattern matching software libraries. |
---|
109 | Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. |
---|
110 | Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. |
---|
111 | Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} |
---|
112 | presents a detailed performance analysis of our data parallel bitstream techniques in comparison to |
---|
113 | several software tools. Section~\ref{Conclusion} concludes the paper. |
---|
114 | |
---|
115 | The fundamental contribution of this paper is fully general approach to regular expression matching |
---|
116 | using bit-parallel data stream operations. The algorithmic aspects of this paper build upon |
---|
117 | the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. |
---|
118 | Specific contributions include: |
---|
119 | \begin{itemize} |
---|
120 | \item compilation of regular expressions into unbounded bit-parallel data stream equations; |
---|
121 | \item documentation of character classes compilation into bit-parallel character class data streams; |
---|
122 | \item the Match Star parallel scanning primitive; |
---|
123 | \item efficient support for unicode characters. |
---|
124 | \end{itemize} |
---|
125 | |
---|
126 | \section{Basic Concepts} |
---|
127 | \label{Basic Concepts} |
---|
128 | In this section, we define the notation and basic concepts used throughout this paper. |
---|
129 | |
---|
130 | \subsection{Notation} |
---|
131 | We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$ |
---|
132 | be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$ |
---|
133 | be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$. |
---|
134 | The goal of simple pattern expression matching is to find all the text positions of T that follow |
---|
135 | an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression. |
---|
136 | |
---|
137 | We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$ |
---|
138 | represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent |
---|
139 | logical left shift, and logical right shift, respectively and are further modulated by |
---|
140 | the number of bit positions a given value shall be shifted by, for example ``shift left by n''. |
---|
141 | Vacant bit-positions are filled in with zeroes. |
---|
142 | |
---|
143 | \subsection{Regular Expressions} |
---|
144 | |
---|
145 | % Define regular expressions (a recursive def), character classes, bounded repetition |
---|
146 | |
---|
147 | \section{Background} |
---|
148 | \label{Background} |
---|
149 | %\input{background.tex} |
---|
150 | |
---|
151 | \subsection{Classical Methods} |
---|
152 | |
---|
153 | \subsection{Regular Expression and Finite Automata} |
---|
154 | |
---|
155 | The origins of regular expression matching date back to automata theory |
---|
156 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. |
---|
157 | Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions |
---|
158 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
159 | Following Thompson's approach, a regular expression of length m is first converted |
---|
160 | to an NFA with O(m) nodes. It is possible to search a text of length n using the |
---|
161 | NFA directly in O(mn) worst case time. Often, a more efficient choice |
---|
162 | is to convert the NFA into a DFA. A DFA has only a single active state and |
---|
163 | allows to search the text at O(n) worst-case optimal. It is well known that the |
---|
164 | conversion of the NFA to the DFA may result in the state explosion problem. |
---|
165 | That is the resultant DFA may have O($2^m$) states. |
---|
166 | |
---|
167 | Thompson's original work marked the beginning of a long line of |
---|
168 | regular expression pattern matching methods that |
---|
169 | process an input string, character-at-a-time, |
---|
170 | and that transition from state to state |
---|
171 | according to the current state and the next input character. |
---|
172 | |
---|
173 | Whereas traditional automata techniques acheive |
---|
174 | O(n) worst-case optimal efficiency, simple string matching algorithms, |
---|
175 | such as the Boyer-Moore family of algorithms, skip input characters |
---|
176 | to achieve sublinear times in the average case \cite{boyer1977fast}. |
---|
177 | |
---|
178 | Boyer-Moore methods, begin comparison from the end of the pattern instead of the |
---|
179 | beginning and precompute skip information |
---|
180 | to determine how far ahead a pattern search can skip in the input whenever |
---|
181 | a non-matching character is encountered. Generally, Boyer-Moore family |
---|
182 | algorithms improve faster as the pattern being searched for becomes longer. |
---|
183 | In many cases, the techniques used to skip characters in simple string matching |
---|
184 | approaches can be extended to regular expression matching. |
---|
185 | Widely known techniques used to facilitate character skipping in regular |
---|
186 | expression matching include necessary strings and backward suffix matching |
---|
187 | inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}. |
---|
188 | |
---|
189 | |
---|
190 | \subsection{Bit-parallel Simulation of Automata} |
---|
191 | |
---|
192 | Define bit-parallelism \cite{Baeza-yates_anew} |
---|
193 | |
---|
194 | Shift-Or / Shift-And \cite{wu1992fast} |
---|
195 | |
---|
196 | Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm) |
---|
197 | |
---|
198 | \subsection{Software Tools} |
---|
199 | |
---|
200 | %Thompson created the first grep (UNIX grep) as a standalone adaptation |
---|
201 | %of the regular expression parser he had written for the UNIX ed utility. |
---|
202 | %In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep. |
---|
203 | %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the |
---|
204 | %time it took to build a complete finite automaton for the regular expression before it could even start searching. |
---|
205 | %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time |
---|
206 | %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. |
---|
207 | %It took zero set-up time and just one additional test in the inner loop. |
---|
208 | %http://pages.cs.wisc.edu/~mdant/cs520_4.html |
---|
209 | |
---|
210 | %Given a regular expression R and a test T the regular expression matching |
---|
211 | %problem finds all ending position of substrings in Q that matches a string in |
---|
212 | %the language denoted by R. |
---|
213 | %The behaviour of Gnu grep, agrep, and nr-grep are differ ... |
---|
214 | %Gnu grep |
---|
215 | %agrep |
---|
216 | %nr-grep |
---|
217 | %re2 |
---|
218 | |
---|
219 | |
---|
220 | \section{Bit-parallel Data Streams} |
---|
221 | \label{Bit-parallel Data Streams} |
---|
222 | |
---|
223 | The bit-parallel data streams use the wide |
---|
224 | SIMD registers commonly found on commodity processors |
---|
225 | to process byte positions at a time using |
---|
226 | bitwise logic, shifting and other operations. |
---|
227 | |
---|
228 | A signficant advantage of the bit-parallel |
---|
229 | data stream method over other |
---|
230 | pattern matching methods that rely on |
---|
231 | bit-parallel automata simulation |
---|
232 | is the potential to skip full register width |
---|
233 | number of characters in low occurence frequency text. % reword |
---|
234 | |
---|
235 | |
---|
236 | Skip characters register width. |
---|
237 | |
---|
238 | \subsection{Match Star} |
---|
239 | |
---|
240 | %Wikipedia |
---|
241 | Backtracking is a general algorithm for finding all solutions to some computational problem, |
---|
242 | that incrementally builds candidates to the solutions. |
---|
243 | |
---|
244 | \section{Compiler Technology} |
---|
245 | \label{Compiler Technology} |
---|
246 | |
---|
247 | \section{Methodology} |
---|
248 | \label{Methodology} |
---|
249 | %\input{methodology.tex} |
---|
250 | |
---|
251 | We compare the performance of our bit-parallel data stream techniques against |
---|
252 | Gnu grep, agrep, nr-grep, and re2. |
---|
253 | |
---|
254 | |
---|
255 | |
---|
256 | \section{Experimental Results} |
---|
257 | \label{Experimental Results} |
---|
258 | %\input{results.tex} |
---|
259 | |
---|
260 | \section{Conclusion} |
---|
261 | \label{Conclusion} |
---|
262 | %\input{conclusion.tex} |
---|
263 | |
---|
264 | { |
---|
265 | \bibliographystyle{acm} |
---|
266 | \bibliography{reference} |
---|
267 | } |
---|
268 | |
---|
269 | \end{document} |
---|