1 | \section{Running-time Comparison with DFA and NFA |
---|
2 | Implementations}\label{sec:analysis} |
---|
3 | |
---|
4 | Our experimental results indicate that regular expression matching using |
---|
5 | bitstreams can outperform current implementations of NFA- and DFA-based |
---|
6 | matching. |
---|
7 | It is worth exploring why this is so, and under what conditions one might |
---|
8 | expect bitstreams to perform better than NFA- or DFA-based matchers, and |
---|
9 | vice-versa. |
---|
10 | |
---|
11 | The bitstream method starts with a preprocessing step: the compilation |
---|
12 | of the |
---|
13 | regular expression using the parabix toolchain. |
---|
14 | Compilation is an offline process whose time is not counted in our |
---|
15 | performance |
---|
16 | measures, as parabix is a experimental research compiler that is not |
---|
17 | optimized. |
---|
18 | This leads to a bias in our results, as our timings for nrgrep and grep |
---|
19 | include the time taken for preprocessing. |
---|
20 | We have attempted to minimize the bias by performing our tests with large |
---|
21 | inputs, so that the text-scanning costs dominate the preprocessing costs. |
---|
22 | We furthermore believe that, if a special-purpose optimized compiler for |
---|
23 | regular expressions were built, that its inclusion in bitstream grep |
---|
24 | would not |
---|
25 | substantially increase the running time, particularly for large input |
---|
26 | texts--the compilation involved is straightforward. |
---|
27 | |
---|
28 | For simplicity, we will first assume that the input regular expressions |
---|
29 | are restricted to having Kleene closures only of single characters or |
---|
30 | alternations of single characters. |
---|
31 | This is a broad class of regular expressions, covering the majority of |
---|
32 | common uses of grep. |
---|
33 | |
---|
34 | The bitstream method compiles a regular expression of size $m$ into |
---|
35 | bitstream code that is $O(m)$ statements long (with one operation per |
---|
36 | statement; it is essentially three-address code). |
---|
37 | This is translated to machine code and placed inside a |
---|
38 | loop\footnote{Technically, it is inside two loops: an inner one that |
---|
39 | executes once per $w$ characters in a large buffer, and an outer one |
---|
40 | that successively fetches |
---|
41 | buffers until the input is exhausted.} that |
---|
42 | executes once per $w$ characters, where $w$ is the width of the processor's |
---|
43 | word. |
---|
44 | This gives $O(\frac{nm}{w})$ work. |
---|
45 | We can further break this down by noting that all of the work in the loop is |
---|
46 | done by superscalar instructions, with the exception of the additions, |
---|
47 | which require carry propagation. |
---|
48 | There will be at most $C$ of these additions in the loop, where $C$ is the |
---|
49 | number of concatenation and Kleene star operations in the regular |
---|
50 | expression. |
---|
51 | |
---|
52 | Almost all intermediate bitstreams in the loop body can be kept in |
---|
53 | registers, |
---|
54 | requiring no storage in memory. |
---|
55 | Good register allocation--and limited live ranges for bitstream |
---|
56 | variables--keeps register spillage to a minimum. |
---|
57 | For those bitstreams that do require storage in memory, |
---|
58 | long buffers are allocated, allowing the successive iterations of the |
---|
59 | loop to access successive memory locations. |
---|
60 | That is, for the few streams requiring it, |
---|
61 | memory is accessed in a sequential fashion. |
---|
62 | As this is the best case for |
---|
63 | hardware prefetching, |
---|
64 | we expect few cache misses with bitstream method. |
---|
65 | |
---|
66 | Compare this with NFA methods. |
---|
67 | In the base NFA method, a state set of approximately $m$ states is kept |
---|
68 | as a bit |
---|
69 | set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes). |
---|
70 | For each character $c$ of the input, |
---|
71 | a precomputed transition table, indexed by the $c$ and the current state |
---|
72 | set, |
---|
73 | is accessed. |
---|
74 | Since there are $2^{\Theta(m)}$ state sets, the transition table can have |
---|
75 | $\sigma 2^{\Theta(m)}$ entries, where $\sigma$ is the size of the input |
---|
76 | alphabet. |
---|
77 | Each entry is a new state set, which requires $\frac{m}{8}$ bytes. |
---|
78 | Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is |
---|
79 | quite |
---|
80 | large: it can become expensive to precompute, and it consumes a lot of |
---|
81 | memory. |
---|
82 | For even fairly small $m$ a table of this size will probably not fit in |
---|
83 | cache |
---|
84 | memory. |
---|
85 | Thus, we would expect many cache misses with this base method. |
---|
86 | |
---|
87 | To improve the table size, several authors have separated the transition |
---|
88 | table |
---|
89 | into several tables, each indexed by a subset of the bits in the bit set |
---|
90 | representing the current state. |
---|
91 | Suppose one uses $k$ bits of the state set to index each table. |
---|
92 | Ignoring ceilings, this requires $\frac{m}{k}$ tables, |
---|
93 | each with $\sigma 2^k$ |
---|
94 | entries of $\frac{m}{8}$ bytes apiece. |
---|
95 | Each table therefore takes up $2^{k-3} m \sigma$ bytes, |
---|
96 | and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes. |
---|
97 | At each character, the NFA algorithm does one lookup in each table, |
---|
98 | combining the results with $\frac{m}{k}-1$ boolean OR operations. |
---|
99 | |
---|
100 | The original NFA method of Thompson uses $k=1$, |
---|
101 | which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each, |
---|
102 | along with |
---|
103 | $m$ lookups and $m-1$ boolean OR operations to |
---|
104 | combine the lookups, per character. |
---|
105 | |
---|
106 | Navarro and Raffinot use $k= \frac{m}{2}$, |
---|
107 | giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each, |
---|
108 | two lookups per character, and 1 boolean OR operation per character to |
---|
109 | combine |
---|
110 | the lookups. |
---|
111 | |
---|
112 | |
---|
113 | |
---|
114 | |
---|