1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | \usepackage{graphicx} |
---|
4 | |
---|
5 | %opening |
---|
6 | \title{Fast Regular Expression Matching with Bit-parallel Data Streams} |
---|
7 | \author{ |
---|
8 | {Robert D. Cameron} \\ |
---|
9 | \and |
---|
10 | {Kenneth S. Herdy} \\ |
---|
11 | \and |
---|
12 | {Ben Hull} \\ |
---|
13 | \and |
---|
14 | {Thomas C. Shermer} \\ |
---|
15 | \\School of Computing Science |
---|
16 | \\Simon Fraser University |
---|
17 | } |
---|
18 | \begin{document} |
---|
19 | |
---|
20 | \date{} |
---|
21 | \maketitle |
---|
22 | |
---|
23 | \begin{abstract} |
---|
24 | |
---|
25 | A parallel regular expression matching pattern method is introduced and compared |
---|
26 | with existing approaches for efficient on-line search. |
---|
27 | The method is based on the concept of bit-parallel data streams, |
---|
28 | in which parallel streams of bits are formed such |
---|
29 | that each stream comprises bits in one-to-one correspondence with the |
---|
30 | character code units of a source data stream. |
---|
31 | |
---|
32 | An implementation of the techniques in the form of a regular expression |
---|
33 | compiler is discussed. The compiler accepts a regular expression and |
---|
34 | forms unbounded bit-parallel data stream statements. Bit-parallel operations |
---|
35 | are then transformed into a low-level C-based implementation for compilation into native |
---|
36 | pattern matching application code. These low-level C-based implementations take advantage of |
---|
37 | the SIMD (single-instruction multiple-data) capabilities of commodity |
---|
38 | processors to yield a dramatic speed-up over traditional byte-at-a-time approaches. |
---|
39 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
40 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
41 | We further improve our methods through the introduce a new bit-parallel scanning primitive, {\em Match Star}. |
---|
42 | which performs a parallel Kleene closure operation over character classes |
---|
43 | and without backtracking. % expand and rephrase description of Match Star |
---|
44 | |
---|
45 | We evaluate the performance of our method in comparison with several widely known {\em grep} |
---|
46 | implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}. |
---|
47 | Performance results are analyzed using the performance monitoring counters |
---|
48 | of commodity hardware. Overall, our results demonstrate dramatic speed-ups |
---|
49 | over other publically available software. |
---|
50 | |
---|
51 | \end{abstract} |
---|
52 | |
---|
53 | \section{Introduction} |
---|
54 | \label{Introduction} |
---|
55 | %\input{introduction.tex} |
---|
56 | |
---|
57 | Regular expresssion matching is an extensively studied problem with application to |
---|
58 | various problem domains, for example, text-processing and bioinformatics. A multitude of algorithms |
---|
59 | and software tools exist to address the specific demands of the various problem domains. % reword |
---|
60 | |
---|
61 | The pattern matching problem can be |
---|
62 | stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P, |
---|
63 | find all the text positions of T that start an occurrence of P. Alternatively, |
---|
64 | one may want all the final positions of occurrences. Some applications require |
---|
65 | slightly different output such as the line that matches the pattern. |
---|
66 | |
---|
67 | A pattern P can be a simple string, but it can also be a regular expression. |
---|
68 | A regular expression, is an expression that specifies a set of strings |
---|
69 | and is composed of (i) simple strings and (ii) the |
---|
70 | union, concatenation and Kleene closure of other regular expressions. |
---|
71 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
72 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
73 | operators to allow the definition of scope and operator precedence. |
---|
74 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
75 | classical texts such as \cite{aho2007}. |
---|
76 | |
---|
77 | Regular expression matching is typically performed using a wide variety of |
---|
78 | publically available software tools designed for efficient on-line pattern matching. |
---|
79 | For example, UNIX grep, Gnu grep, agrep, cgrep, nr-grep, and Perl regular |
---|
80 | expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep), |
---|
81 | agrep, and nr-grep are widely considered the fastest regular expression |
---|
82 | matching tools in practice \cite{navarro2000}. |
---|
83 | |
---|
84 | % source: Parallel Scanning with BitStream Addition: An XML Case Study |
---|
85 | Although traditional finite state machine methods used in the scanning and parsing of |
---|
86 | text streams is considered to be the hardest of the â13 dwarvesâ to parallelize |
---|
87 | [1], bit-parallel data stream technology shows considerable promise as a general |
---|
88 | approach to regular expression matching. In this approach, character |
---|
89 | streams are processed W positions at a time using the W-bit SIMD registers |
---|
90 | commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips). |
---|
91 | This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit |
---|
92 | position within the byte. These basis bitstreams are then combined with bitwise |
---|
93 | logic and shifting operations to compute further parallel bit streams of interest. |
---|
94 | |
---|
95 | We further increase the performance of our approach with the introduction of a new parallel |
---|
96 | scanning primitive coined {\em Match Star}. % expand and reword desc. of Match Star |
---|
97 | |
---|
98 | The remainder of this paper is organized as follows. |
---|
99 | Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper. |
---|
100 | Section~\ref{Background} presents background material on classical automata-based approaches |
---|
101 | to regular expression matching and describes the {\em grep} family utilities. |
---|
102 | Next, section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. |
---|
103 | Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. |
---|
104 | Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} |
---|
105 | presents a detailed performance analysis of our data parallel bitstream techniques in comparison to |
---|
106 | several software tools. Section~\ref{Conclusion} concludes the paper. |
---|
107 | |
---|
108 | The fundamental contribution of this paper is fully general approach to regular expression matching |
---|
109 | using bit-parallel data stream operations. The algorithmic aspects of this paper build upon |
---|
110 | the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. |
---|
111 | Specific contributions include: |
---|
112 | \begin{itemize} |
---|
113 | \item compilation of regular expressions into unbounded bit-parallel data stream equations; |
---|
114 | \item documentation of character classes compilation into bit-parallel character class data streams; |
---|
115 | \item Match Star parallel scanning primitive; |
---|
116 | \item efficient support for unicode characters. |
---|
117 | \end{itemize} |
---|
118 | |
---|
119 | \section{Basic Concepts} |
---|
120 | \label{Basic Concepts} |
---|
121 | In this section, we define the notation and basic concepts used throughout this paper. |
---|
122 | |
---|
123 | \subsection{Notation} |
---|
124 | We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$ |
---|
125 | be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$ |
---|
126 | be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$. |
---|
127 | The goal of simple pattern expression matching is to find all the text positions of T that follow |
---|
128 | an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression. |
---|
129 | |
---|
130 | We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$ |
---|
131 | represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent |
---|
132 | logical left shift, and logical right shift, respectively and are further modulated by |
---|
133 | the number of bit positions a given value shall be shifted by, for example ``shift left by n''. |
---|
134 | Vacant bit-positions are filled in with zeroes. |
---|
135 | |
---|
136 | \subsection{Regular Expressions} |
---|
137 | |
---|
138 | % Define regular expressions (a recursive def), character classes, bounded repetition |
---|
139 | |
---|
140 | \section{Background} |
---|
141 | \label{Background} |
---|
142 | %\input{background.tex} |
---|
143 | |
---|
144 | \subsection{Regular Expressions and Finite Automata} |
---|
145 | |
---|
146 | The origins of regular expression matching date back to automata theory |
---|
147 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. |
---|
148 | Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions |
---|
149 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
150 | Following Thompson's approach, a regular expression of length m is first converted |
---|
151 | to an NFA with O(m) nodes. It is possible to search a text of length n using the |
---|
152 | NFA directly in O(mn) worst case time. Often, a more efficient choice |
---|
153 | is to convert the NFA into a DFA. A DFA has only a single active state and |
---|
154 | allows to search the text at O(n) worst-case optimal. It is well known that the |
---|
155 | conversion of the NFA to the DFA may result in the state explosion problem. |
---|
156 | That is the resultant DFA may have O($2^m$) states. |
---|
157 | |
---|
158 | Thompson's original work marked the beginning of a long line of |
---|
159 | regular expression pattern matching methods that |
---|
160 | process an input string, character-at-a-time, |
---|
161 | and that transition from state to state |
---|
162 | according to the current state and the next input character. |
---|
163 | |
---|
164 | Whereas traditional automata techniques acheive |
---|
165 | O(n) worst-case optimal efficiency, simple string matching algorithms, |
---|
166 | such as the Boyer-Moore family of algorithms, skip input characters |
---|
167 | to achieve sublinear times in the average case \cite{boyer1977fast}. |
---|
168 | |
---|
169 | Boyer-Moore methods, begin comparison from the end of the pattern instead of the |
---|
170 | beginning and precompute skip information |
---|
171 | to determine how far ahead a pattern search can skip in the input whenever |
---|
172 | a non-matching character is encountered. Generally, Boyer-Moore family |
---|
173 | algorithms improve faster as the pattern being searched for becomes longer. |
---|
174 | In many cases, the techniques used to skip characters in simple string matching |
---|
175 | approaches can be extended to regular expression matching. |
---|
176 | Widely known techniques used to facilitate character skipping in regular |
---|
177 | expression matching include necessary strings and backward suffix matching |
---|
178 | inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}. |
---|
179 | |
---|
180 | \subsection{Bit-parallel Simulation of Automata} |
---|
181 | |
---|
182 | bit-parallelism \cite{Baeza-yates_anew} |
---|
183 | |
---|
184 | Shift-Or / Shift-And \cite{wu1992fast} |
---|
185 | |
---|
186 | Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm) |
---|
187 | |
---|
188 | i.e. bit parallel simulation of automata |
---|
189 | |
---|
190 | \subsection{Software Tools} % Methodology section? |
---|
191 | |
---|
192 | %This subsection relates the performance of well known tools (Gnu grep, agrep, nr-grep) |
---|
193 | %to the above pattern matching algorithms. |
---|
194 | |
---|
195 | %Source: http://pages.cs.wisc.edu/~mdant/cs520_4.html |
---|
196 | |
---|
197 | %Thompson created the first grep (UNIX grep) as a standalone adaptation |
---|
198 | %of the regular expression parser he had written for the UNIX ed utility. |
---|
199 | %In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep. |
---|
200 | %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the |
---|
201 | %time it took to build a complete finite automaton for the regular expression before it could even start searching. |
---|
202 | %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time |
---|
203 | %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. |
---|
204 | %It took zero set-up time and just one additional test in the inner loop. |
---|
205 | |
---|
206 | %Source: Chapter 7 Flexible Pattern Matching \cite{navarro2000}. \cite{navarro2000} |
---|
207 | |
---|
208 | %Gnu grep |
---|
209 | %agrep |
---|
210 | %nr-grep |
---|
211 | %re2 |
---|
212 | |
---|
213 | \section{Bit-parallel Data Streams} |
---|
214 | \label{Bit-parallel Data Streams} |
---|
215 | |
---|
216 | % Contrast 'bit-parallel data streams' with the bit-parallel simulation of automata. |
---|
217 | |
---|
218 | The bit-parallel data streams use the wide |
---|
219 | SIMD registers commonly found on commodity processors |
---|
220 | to process byte positions at a time using |
---|
221 | bitwise logic, shifting and other operations. |
---|
222 | |
---|
223 | A signficant advantage of the bit-parallel |
---|
224 | data stream methods over other |
---|
225 | pattern matching methods that rely on |
---|
226 | bit-parallel automata simulation |
---|
227 | is the potential to skip register width |
---|
228 | characters. % reword |
---|
229 | |
---|
230 | Skip SIMD register width. |
---|
231 | |
---|
232 | \subsection{Match Star} |
---|
233 | |
---|
234 | %Source: http://en.wikipedia.org/wiki/Backtracking |
---|
235 | %Backtracking is a general algorithm for finding all solutions to some computational problem, |
---|
236 | %that incrementally builds candidates to the solutions. |
---|
237 | |
---|
238 | Match Star takes a marker bitstream and a character class bitstream as input. It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream. |
---|
239 | |
---|
240 | Figure one illustrates the Match Star method. The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data. |
---|
241 | |
---|
242 | In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic. Next, the temporary marker bitstream is added to the character class bitstream. $T_1$ has 1s in three types of positions. There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions. Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output. These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$. The |
---|
243 | output marker stream is obtained by ORing $T_2$ with the initial marker stream. |
---|
244 | |
---|
245 | \begin{figure}[tbh] |
---|
246 | \begin{center} |
---|
247 | \begin{tabular}{cr}\\ |
---|
248 | source data & \verb`--142578---125-231-----127--5394---94761205-`\\ |
---|
249 | $M_0$ & \verb`.......1......1..1..1...1.............1..1..`\\ |
---|
250 | $D = $\verb:[0-9]: & \verb`..111111...111.111.....111..1111...11111111.`\\ |
---|
251 | $T_0 = M_0 \wedge D$ & \verb`.......1.........1......1.............1..1..`\\ |
---|
252 | $T_1 = T_0 + D$ & \verb`.1.........1111.......1..1..1111..1...1...1.`\\ |
---|
253 | $T_2 = T_1 \oplus D$ & \verb`.1111111......1111....111.........1111.111..`\\ |
---|
254 | $M_1 = T_2 \, | \, M_0$ & \verb`.1111111......1111..1.111.........11111111..` |
---|
255 | \end{tabular} |
---|
256 | |
---|
257 | |
---|
258 | \end{center} |
---|
259 | \caption{Match Star} |
---|
260 | \label{fig:scan1} |
---|
261 | \end{figure} |
---|
262 | |
---|
263 | \section{Compiler Technology} |
---|
264 | \label{Compiler Technology} |
---|
265 | %\input{compilertechnology.tex} |
---|
266 | |
---|
267 | \section{Methodology} |
---|
268 | \label{Methodology} |
---|
269 | %\input{methodology.tex} |
---|
270 | |
---|
271 | We compare the performance of our bit-parallel data stream techniques against |
---|
272 | several fast regular expression matching implemenations, |
---|
273 | Gnu grep, agrep \cite{Wu92agrep-}, nr-grep, and re2. |
---|
274 | |
---|
275 | \section{Experimental Results} |
---|
276 | \label{Experimental Results} |
---|
277 | %\input{results.tex} |
---|
278 | |
---|
279 | |
---|
280 | \input{GPUImpl.tex} |
---|
281 | |
---|
282 | \section{Conclusion} |
---|
283 | \label{Conclusion} |
---|
284 | %\input{conclusion.tex} |
---|
285 | { |
---|
286 | \bibliographystyle{acm} |
---|
287 | \bibliography{reference} |
---|
288 | } |
---|
289 | |
---|
290 | \end{document} |
---|