[3121] | 1 | \documentclass[a4paper,10pt]{article} |
---|
| 2 | \usepackage[utf8]{inputenc} |
---|
| 3 | |
---|
[3123] | 4 | \def \Bitstream{Bit Stream} |
---|
| 5 | \def \bitstream{bit stream} |
---|
| 6 | |
---|
[3121] | 7 | %opening |
---|
[3123] | 8 | \title{Fast Regular Expression Matching using Parallel \Bitstream{}s} |
---|
[3121] | 9 | \author{ |
---|
| 10 | {Robert D. Cameron} \\ |
---|
| 11 | \and |
---|
| 12 | {Kenneth S. Herdy} \\ |
---|
| 13 | \and |
---|
| 14 | {Ben Hull} \\ |
---|
| 15 | \and |
---|
| 16 | {Thomas C. Shermer} \\ |
---|
| 17 | \\School of Computing Science |
---|
| 18 | \\Simon Fraser University |
---|
| 19 | } |
---|
| 20 | \begin{document} |
---|
| 21 | |
---|
| 22 | \date{} |
---|
| 23 | \maketitle |
---|
| 24 | |
---|
| 25 | \begin{abstract} |
---|
| 26 | |
---|
[3123] | 27 | A data parallel regular expression matching method using the concept of bitstream technology |
---|
| 28 | is introduced and studied in application to the problem of fast regular expression matching. |
---|
| 29 | |
---|
| 30 | The method is based on the concept of parallel |
---|
| 31 | \bitstream{} technology, in which parallel streams of bits are formed such |
---|
| 32 | that each stream comprises bits in one-to-one correspondence with the |
---|
| 33 | character code units of a source data stream. |
---|
| 34 | |
---|
| 35 | On processors supporting W-bit addition operations, the method processes W source characters |
---|
| 36 | in parallel and performs up to W finite state transitions per clock cycle. |
---|
| 37 | |
---|
| 38 | Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. |
---|
| 39 | |
---|
[3121] | 40 | \end{abstract} |
---|
| 41 | |
---|
| 42 | \section{Introduction} |
---|
| 43 | \label{Introduction} |
---|
| 44 | %\input{introduction.tex} |
---|
| 45 | |
---|
[3124] | 46 | Regular expresssion matching is an extensively studied problem with a multitude |
---|
| 47 | of algorithms and software tools developed to the demands of particular problem contexts. |
---|
| 48 | |
---|
[3123] | 49 | Given a text T$_{1..n}$ of n characters and a pattern P, the pattern matching problem can be |
---|
| 50 | stated as follows. Find all the text positions of T that start an occurrence of P. |
---|
| 51 | Alternatively, one may want all the final positions of occurrences. Some |
---|
| 52 | applications require slightly different output such as the line that matches the pattern. |
---|
| 53 | |
---|
| 54 | The pattern P can be just a simple string, |
---|
| 55 | but it can also be, for example, a regular expression. |
---|
| 56 | |
---|
| 57 | A regular expression, or pattern, is an expression that specifies a set of strings. |
---|
[3124] | 58 | A regular expression is composed of (i) simple strings (ii) the empty or (ii) |
---|
[3123] | 59 | union, concatenation and Kleene closure of other regular expressions. |
---|
| 60 | To avoid parentheses it is assumed that the Kleene star has the highest priority, |
---|
| 61 | next concatenation and then alternation, however, most formalisms provides grouping |
---|
| 62 | operators to allow the definition of scope and operator precedence. |
---|
| 63 | Readers unfamiliar with the concept of regular expression matching are referred |
---|
| 64 | classical texts such as \cite{aho2007}. |
---|
| 65 | |
---|
| 66 | Regular expression matching is commonly performed using a variety of |
---|
[3124] | 67 | publically available software tools. The most prominent, UNIX grep, |
---|
| 68 | Gnu grep, agrep, cgrep, nrgrep, and Perl regular |
---|
[3123] | 69 | expressions \cite{Abou-assaleh04surveyof}. |
---|
| 70 | |
---|
[3124] | 71 | Amongst these Gnu grep, agrep, and nrgrep are widely known and considered as |
---|
| 72 | the fastest regular expression matching tools in practice \cite{}. |
---|
[3123] | 73 | |
---|
[3124] | 74 | Of particular interest to this study are the performance oriented Gnu grep, agrep, and nrgrep. |
---|
[3123] | 75 | |
---|
[3124] | 76 | % motivation / previous work |
---|
| 77 | Although the finite state machine methods used in the scanning and parsing of |
---|
| 78 | text streams is considered to be the hardest of the â13 dwarvesâ to parallelize |
---|
| 79 | [1], parallel bitstream technology shows considerable promise for these types of |
---|
| 80 | applications [3, 4]. In this approach, character streams are processed W positions |
---|
| 81 | at a time using the W-bit SIMD registers commonly found on commodity processors |
---|
| 82 | (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by |
---|
| 83 | first slicing the byte streams into eight separate basis bitstreams, one for each bit |
---|
| 84 | position within the byte. These basis bitstreams are then combined with bitwise |
---|
| 85 | logic and shifting operations to compute further parallel bit streams of interest. |
---|
| 86 | |
---|
| 87 | We further increase the parallelism in our methods by introducing a new parallel |
---|
| 88 | scanning primitive which we have coined 'Match Star' that returns all matches in a single |
---|
| 89 | operation and eliminates the need for back tracking ... (ELABORATE) |
---|
| 90 | |
---|
| 91 | --- STATE the content of the next sections. --- |
---|
| 92 | |
---|
| 93 | |
---|
| 94 | We compare the performance of our parallel \bitstream{} techniques against |
---|
[3123] | 95 | various grep concentrate on the simpler case of |
---|
| 96 | reporting initial or final occurrence positions. |
---|
| 97 | |
---|
[3124] | 98 | |
---|
| 99 | \section{Background} |
---|
| 100 | \label{Background} |
---|
| 101 | %\input{background.tex} |
---|
| 102 | |
---|
[3123] | 103 | % Background |
---|
| 104 | |
---|
| 105 | % History |
---|
| 106 | |
---|
| 107 | Historically, the origins of regular expression matching date back to automata theory |
---|
| 108 | and formal language theory developed by Kleene in the 1950s \cite{kleene1951representation}. |
---|
| 109 | |
---|
| 110 | In 1959, Dana and Scott demonstrated that |
---|
| 111 | NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA |
---|
| 112 | state corresponds to a set of NFA states. |
---|
| 113 | |
---|
| 114 | Thompson, in 1968, is credited with the first construction to convert regular expressions |
---|
| 115 | to nondeterministic finite automata (NFA) for regular expression matching. |
---|
| 116 | |
---|
| 117 | Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of |
---|
| 118 | regular expression implementations that construct automata for pattern matching. |
---|
| 119 | |
---|
| 120 | The traditional technique [16] to search a regular expression of length m in |
---|
| 121 | a text of length n is to first convert the expression into a non-deterministic |
---|
| 122 | automaton (NFA) with O(m) nodes. It is possible to search the text using the |
---|
| 123 | NFA directly in O(mn) worst case time. The cost comes from the fact that |
---|
| 124 | more than one state of the NFA may be active at each step, and therefore all |
---|
| 125 | may need to be updated. A more effcient choice is to convert the NFA into a |
---|
| 126 | DFA. A DFA has only a single active state and allows to search the text at |
---|
| 127 | O(n) worst-case optimal. The problem with this approach is that the DFA |
---|
| 128 | may have O(2^m) states. |
---|
| 129 | |
---|
[3124] | 130 | In general, the general process is first to build a |
---|
| 131 | NFA from the regular expression and simulate the NFA on text input, |
---|
| 132 | or alternatively to convert the NFA into a |
---|
| 133 | DFA, optionally minimize the number of states in the DFA, |
---|
| 134 | and finally simulate the DFA on text input. |
---|
[3123] | 135 | |
---|
[3121] | 136 | |
---|
| 137 | \section{Methodology} |
---|
| 138 | \label{Methodology} |
---|
| 139 | %\input{methodology.tex} |
---|
| 140 | |
---|
| 141 | \section{Experimental Results} |
---|
| 142 | \label{results} |
---|
| 143 | %\input{results.tex} |
---|
| 144 | |
---|
| 145 | \section{Conclusion and Future Work} |
---|
| 146 | \label{conclusion} |
---|
| 147 | %\input{conclusion.tex} |
---|
| 148 | |
---|
[3123] | 149 | { |
---|
| 150 | \bibliographystyle{acm} |
---|
| 151 | \bibliography{reference} |
---|
| 152 | } |
---|
| 153 | |
---|
[3121] | 154 | \end{document} |
---|