[3123] | 1 | \documentclass[a4paper,10pt]{article} |
---|
| 2 | \usepackage[utf8]{inputenc} |
---|
| 3 | |
---|
| 4 | %opening |
---|
[3149] | 5 | \title{Fast Regular Expression Matching with Bit-parallel Data Streams} |
---|
[3123] | 6 | \author{ |
---|
| 7 | {Robert D. Cameron} \\ |
---|
| 8 | \and |
---|
| 9 | {Kenneth S. Herdy} \\ |
---|
| 10 | \and |
---|
| 11 | {Ben Hull} \\ |
---|
| 12 | \and |
---|
| 13 | {Thomas C. Shermer} \\ |
---|
| 14 | \\School of Computing Science |
---|
| 15 | \\Simon Fraser University |
---|
| 16 | } |
---|
| 17 | \begin{document} |
---|
| 18 | |
---|
| 19 | \date{} |
---|
| 20 | \maketitle |
---|
| 21 | |
---|
| 22 | \begin{abstract} |
---|
| 23 | |
---|
[3149] | 24 | An parallel regular expression matching pattern method is |
---|
| 25 | introduced and compared with the state of the art in software tools designed for |
---|
| 26 | efficient on-line search. %such as the {\em grep} family pattern matching tools. |
---|
[3138] | 27 | The method is based on the concept of bit-parallel data streams, |
---|
| 28 | in which parallel streams of bits are formed such |
---|
[3123] | 29 | that each stream comprises bits in one-to-one correspondence with the |
---|
[3138] | 30 | character code units of a source data stream. |
---|
[3123] | 31 | |
---|
[3138] | 32 | An implementation of the method in the form of a regular expression |
---|
[3149] | 33 | compiler is discussed. The compiler accepts a regular expression and |
---|
| 34 | forms unbounded bit-parallel data stream operations. Bit-parallel operations |
---|
| 35 | are then transformed into a low-level C-based implementation for compilation into native |
---|
| 36 | pattern matching applications. These low-level C-based implementations take advantage of |
---|
[3138] | 37 | the SIMD (single-instruction multiple-data) capabilities of commodity |
---|
[3149] | 38 | processors to yield a dramatic speed-up over traditional byte-at-a-time approaches. |
---|
| 39 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
[3123] | 40 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
| 41 | |
---|
[3149] | 42 | We introduce a new bit-parallel scanning primitive, {\em Match Star}. |
---|
| 43 | which performs parallel Kleene closure over character classes |
---|
[3154] | 44 | and without backtracking. % expand and rephrase description of Match Star |
---|
[3123] | 45 | |
---|
[3149] | 46 | We evaluate the performance of our method in comparison with several widely known {\em grep} |
---|
| 47 | family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}, |
---|
| 48 | and regular expression engines such as {\em Google's re2}. |
---|
| 49 | Performance results are analyzed using the performance monitoring counters |
---|
| 50 | of commodity hardware. Overall, our results demonstrate a dramatic speed-up |
---|
| 51 | over publically available alternatives. |
---|
[3138] | 52 | |
---|
[3123] | 53 | \end{abstract} |
---|
| 54 | |
---|
| 55 | \section{Introduction} |
---|
| 56 | \label{Introduction} |
---|
| 57 | %\input{introduction.tex} |
---|
| 58 | |
---|
[3126] | 59 | Regular expresssion matching is an extensively studied problem with application to |
---|
[3149] | 60 | text processing and bioinformatics and with numerous algorithms |
---|
| 61 | and software tools developed to the address the particular |
---|
| 62 | processing demands. % reword |
---|
[3124] | 63 | |
---|
[3126] | 64 | The pattern matching problem can be |
---|
| 65 | stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P, |
---|
| 66 | find all the text positions of T that start an occurrence of P. Alternatively, |
---|
| 67 | one may want all the final positions of occurrences. Some applications require |
---|
| 68 | slightly different output such as the line that matches the pattern. |
---|
| 69 | |
---|
[3149] | 70 | A pattern P can be a simple string, but it can also be a regular expression. |
---|
| 71 | A regular expression, is an expression that specifies a set of strings |
---|
| 72 | and is composed of (i) simple strings and (ii) the |
---|
[3126] | 73 | union, concatenation and Kleene closure of other regular expressions. |
---|
| 74 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
| 75 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
| 76 | operators to allow the definition of scope and operator precedence. |
---|
| 77 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
| 78 | classical texts such as \cite{aho2007}. |
---|
| 79 | |
---|
| 80 | Regular expression matching is commonly performed using a wide variety of |
---|
| 81 | publically available software tools for on-line pattern matching. For instance, |
---|
| 82 | UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular |
---|
| 83 | expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep), |
---|
| 84 | agrep, and nrgrep are widely known and considered as |
---|
| 85 | the fastest regular expression matching tools in practice \cite{navarro2000}. |
---|
| 86 | and are of particular interest to this study. |
---|
| 87 | |
---|
| 88 | % simple patterns, extended patterns, regular expressions |
---|
| 89 | |
---|
[3124] | 90 | % motivation / previous work |
---|
[3126] | 91 | Although tradi finite state machine methods used in the scanning and parsing of |
---|
[3124] | 92 | text streams is considered to be the hardest of the â13 dwarvesâ to parallelize |
---|
| 93 | [1], parallel bitstream technology shows considerable promise for these types of |
---|
| 94 | applications [3, 4]. In this approach, character streams are processed W positions |
---|
| 95 | at a time using the W-bit SIMD registers commonly found on commodity processors |
---|
| 96 | (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by |
---|
| 97 | first slicing the byte streams into eight separate basis bitstreams, one for each bit |
---|
| 98 | position within the byte. These basis bitstreams are then combined with bitwise |
---|
| 99 | logic and shifting operations to compute further parallel bit streams of interest. |
---|
| 100 | |
---|
[3154] | 101 | We further increase the potential parallelism in our approach by introducing a new parallel |
---|
| 102 | scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star |
---|
[3124] | 103 | |
---|
[3126] | 104 | The remainder of this paper is organized as follows. |
---|
[3149] | 105 | Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper. |
---|
[3154] | 106 | Section~\ref{Background} presents background material on classical automata-based approaches |
---|
| 107 | to regular expression matching as well as describes the efficient {\em grep} family utilities |
---|
| 108 | and regular expression pattern matching software libraries. |
---|
| 109 | Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. |
---|
[3138] | 110 | Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. |
---|
[3126] | 111 | Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} |
---|
[3149] | 112 | presents a detailed performance analysis of our data parallel bitstream techniques in comparison to |
---|
[3154] | 113 | several software tools. Section~\ref{Conclusion} concludes the paper. |
---|
[3123] | 114 | |
---|
[3149] | 115 | The fundamental contribution of this paper is fully general approach to regular expression matching |
---|
[3154] | 116 | using bit-parallel data stream operations. The algorithmic aspects of this paper build upon |
---|
| 117 | the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. |
---|
| 118 | Specific contributions include: |
---|
[3149] | 119 | \begin{itemize} |
---|
| 120 | \item compilation of regular expressions into unbounded bit-parallel data stream equations; |
---|
| 121 | \item documentation of character classes compilation into bit-parallel character class data streams; |
---|
[3154] | 122 | \item the Match Star parallel scanning primitive; |
---|
| 123 | \item efficient support for unicode characters. |
---|
[3149] | 124 | \end{itemize} |
---|
| 125 | |
---|
| 126 | \section{Basic Concepts} |
---|
| 127 | \label{Basic Concepts} |
---|
[3154] | 128 | In this section, we define the notation and basic concepts used throughout this paper. |
---|
[3149] | 129 | |
---|
| 130 | \subsection{Notation} |
---|
| 131 | We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$ |
---|
| 132 | be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$ |
---|
[3154] | 133 | be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$. |
---|
| 134 | The goal of simple pattern expression matching is to find all the text positions of T that follow |
---|
| 135 | an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression. |
---|
[3149] | 136 | |
---|
[3154] | 137 | We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$ |
---|
| 138 | represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent |
---|
| 139 | logical left shift, and logical right shift, respectively and are further modulated by |
---|
| 140 | the number of bit positions a given value shall be shifted by, for example ``shift left by n''. |
---|
| 141 | Vacant bit-positions are filled in with zeroes. |
---|
| 142 | |
---|
[3149] | 143 | \subsection{Regular Expressions} |
---|
| 144 | |
---|
[3154] | 145 | % Define regular expressions (a recursive def), character classes, bounded repetition |
---|
[3149] | 146 | |
---|
[3124] | 147 | \section{Background} |
---|
| 148 | \label{Background} |
---|
| 149 | %\input{background.tex} |
---|
| 150 | |
---|
[3149] | 151 | \subsection{Classical Methods} |
---|
| 152 | |
---|
| 153 | \subsection{Regular Expression and Finite Automata} |
---|
| 154 | |
---|
[3138] | 155 | The origins of regular expression matching date back to automata theory |
---|
| 156 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. |
---|
| 157 | Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions |
---|
| 158 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
| 159 | Following Thompson's approach, a regular expression of length m is first converted |
---|
| 160 | to an NFA with O(m) nodes. It is possible to search a text of length n using the |
---|
[3149] | 161 | NFA directly in O(mn) worst case time. Often, a more efficient choice |
---|
[3138] | 162 | is to convert the NFA into a DFA. A DFA has only a single active state and |
---|
| 163 | allows to search the text at O(n) worst-case optimal. It is well known that the |
---|
| 164 | conversion of the NFA to the DFA may result in the state explosion problem. |
---|
| 165 | That is the resultant DFA may have O($2^m$) states. |
---|
[3123] | 166 | |
---|
[3138] | 167 | Thompson's original work marked the beginning of a long line of |
---|
[3154] | 168 | regular expression pattern matching methods that |
---|
| 169 | process an input string, character-at-a-time, |
---|
| 170 | and that transition from state to state |
---|
| 171 | according to the current state and the next input character. |
---|
[3123] | 172 | |
---|
[3154] | 173 | Whereas traditional automata techniques acheive |
---|
| 174 | O(n) worst-case optimal efficiency, simple string matching algorithms, |
---|
| 175 | such as the Boyer-Moore family of algorithms, skip input characters |
---|
| 176 | to achieve sublinear times in the average case \cite{boyer1977fast}. |
---|
[3123] | 177 | |
---|
[3154] | 178 | Boyer-Moore methods, begin comparison from the end of the pattern instead of the |
---|
| 179 | beginning and precompute skip information |
---|
| 180 | to determine how far ahead a pattern search can skip in the input whenever |
---|
| 181 | a non-matching character is encountered. Generally, Boyer-Moore family |
---|
| 182 | algorithms improve faster as the pattern being searched for becomes longer. |
---|
| 183 | In many cases, the techniques used to skip characters in simple string matching |
---|
| 184 | approaches can be extended to regular expression matching. |
---|
| 185 | Widely known techniques used to facilitate character skipping in regular |
---|
| 186 | expression matching include necessary strings and backward suffix matching |
---|
| 187 | inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}. |
---|
[3123] | 188 | |
---|
[3149] | 189 | |
---|
| 190 | \subsection{Bit-parallel Simulation of Automata} |
---|
| 191 | |
---|
[3154] | 192 | Define bit-parallelism \cite{Baeza-yates_anew} |
---|
[3149] | 193 | |
---|
[3154] | 194 | Shift-Or / Shift-And \cite{wu1992fast} |
---|
[3149] | 195 | |
---|
| 196 | Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm) |
---|
| 197 | |
---|
| 198 | \subsection{Software Tools} |
---|
[3138] | 199 | |
---|
[3149] | 200 | %Thompson created the first grep (UNIX grep) as a standalone adaptation |
---|
| 201 | %of the regular expression parser he had written for the UNIX ed utility. |
---|
| 202 | %In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep. |
---|
| 203 | %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the |
---|
| 204 | %time it took to build a complete finite automaton for the regular expression before it could even start searching. |
---|
| 205 | %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time |
---|
| 206 | %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. |
---|
| 207 | %It took zero set-up time and just one additional test in the inner loop. |
---|
| 208 | %http://pages.cs.wisc.edu/~mdant/cs520_4.html |
---|
| 209 | |
---|
| 210 | %Given a regular expression R and a test T the regular expression matching |
---|
| 211 | %problem finds all ending position of substrings in Q that matches a string in |
---|
| 212 | %the language denoted by R. |
---|
| 213 | %The behaviour of Gnu grep, agrep, and nr-grep are differ ... |
---|
| 214 | %Gnu grep |
---|
| 215 | %agrep |
---|
| 216 | %nr-grep |
---|
| 217 | %re2 |
---|
| 218 | |
---|
[3154] | 219 | |
---|
| 220 | \section{Bit-parallel Data Streams} |
---|
| 221 | \label{Bit-parallel Data Streams} |
---|
| 222 | |
---|
| 223 | The bit-parallel data streams use the wide |
---|
| 224 | SIMD registers commonly found on commodity processors |
---|
| 225 | to process byte positions at a time using |
---|
| 226 | bitwise logic, shifting and other operations. |
---|
| 227 | |
---|
| 228 | A signficant advantage of the bit-parallel |
---|
| 229 | data stream method over other |
---|
| 230 | pattern matching methods that rely on |
---|
| 231 | bit-parallel automata simulation |
---|
| 232 | is the potential to skip full register width |
---|
| 233 | number of characters in low occurence frequency text. % reword |
---|
| 234 | |
---|
| 235 | |
---|
| 236 | Skip characters register width. |
---|
| 237 | |
---|
[3149] | 238 | \subsection{Match Star} |
---|
| 239 | |
---|
[3138] | 240 | %Wikipedia |
---|
| 241 | Backtracking is a general algorithm for finding all solutions to some computational problem, |
---|
| 242 | that incrementally builds candidates to the solutions. |
---|
| 243 | |
---|
[3126] | 244 | \section{Compiler Technology} |
---|
| 245 | \label{Compiler Technology} |
---|
| 246 | |
---|
[3123] | 247 | \section{Methodology} |
---|
| 248 | \label{Methodology} |
---|
| 249 | %\input{methodology.tex} |
---|
| 250 | |
---|
[3149] | 251 | We compare the performance of our bit-parallel data stream techniques against |
---|
| 252 | Gnu grep, agrep, nr-grep, and re2. |
---|
[3126] | 253 | |
---|
| 254 | |
---|
| 255 | |
---|
[3123] | 256 | \section{Experimental Results} |
---|
[3135] | 257 | \label{Experimental Results} |
---|
[3123] | 258 | %\input{results.tex} |
---|
| 259 | |
---|
[3135] | 260 | \section{Conclusion} |
---|
| 261 | \label{Conclusion} |
---|
[3123] | 262 | %\input{conclusion.tex} |
---|
| 263 | |
---|
| 264 | { |
---|
| 265 | \bibliographystyle{acm} |
---|
| 266 | \bibliography{reference} |
---|
| 267 | } |
---|
| 268 | |
---|
| 269 | \end{document} |
---|