1 | \section{Running-time Comparison with DFA and NFA |
---|
2 | Implementations}\label{sec:analysis} |
---|
3 | |
---|
4 | Our experimental results indicate that regular expression matching using |
---|
5 | bitstreams can outperform current implementations of NFA- and DFA-based |
---|
6 | matching. |
---|
7 | It is worth exploring why this is so, and under what conditions one might |
---|
8 | expect bitstreams to perform better than NFA- or DFA-based matchers, and |
---|
9 | vice-versa. |
---|
10 | |
---|
11 | The bitstream method starts with a preprocessing step: the compilation |
---|
12 | of the |
---|
13 | regular expression using the Parabix toolchain. |
---|
14 | Compilation is an offline process whose time is not counted in our |
---|
15 | performance |
---|
16 | measures, as Parabix is a experimental research compiler that is not |
---|
17 | optimized. |
---|
18 | This leads to a bias in our results, as our timings for nrgrep and grep |
---|
19 | include the time taken for preprocessing. |
---|
20 | We have attempted to minimize the bias by performing our tests with large |
---|
21 | inputs, so that the text-scanning costs dominate the preprocessing costs. |
---|
22 | We furthermore believe that, if a special-purpose optimized compiler for |
---|
23 | regular expressions were built, that its inclusion in bitstreams grep |
---|
24 | would not |
---|
25 | substantially increase the running time, particularly for large input |
---|
26 | texts--the compilation involved is straightforward. |
---|
27 | |
---|
28 | For simplicity, we will first assume that the input regular expressions |
---|
29 | are restricted to having Kleene closures only of single characters or |
---|
30 | alternations of single characters. |
---|
31 | This is a broad class of regular expressions, covering the majority of |
---|
32 | common uses of grep. |
---|
33 | |
---|
34 | Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$. |
---|
35 | As we are comparing techniques in practice, we assume that $\Sigma$ is a |
---|
36 | standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$), |
---|
37 | UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma = 1114112$). |
---|
38 | This assumption allows us to equate the number of bits in the encoding of a |
---|
39 | character (a parameter for the bitstream method) with $\log \sigma$. |
---|
40 | |
---|
41 | The bitstream method compiles a regular expression of size $m$ into |
---|
42 | bitstream code that is $O(m \log \sigma)$ statements long (with one operation |
---|
43 | per statement; it is essentially three-address code). |
---|
44 | This is translated to machine code and placed inside a |
---|
45 | loop\footnote{Technically, it is inside two loops: an inner one that |
---|
46 | executes once per $w$ characters in a large buffer, and an outer one |
---|
47 | that successively fetches |
---|
48 | buffers until the input is exhausted.} that |
---|
49 | executes once per $w$ characters, where $w$ is the width of the processor's |
---|
50 | word. |
---|
51 | Also inside this loop is the transposition step that converts character-encoded |
---|
52 | files into their bitstream representation; |
---|
53 | this transposition takes $O(\log w)$ work per loop iteration. |
---|
54 | |
---|
55 | In total, this is $O(m \log \sigma + \log w)$ work per iteration. |
---|
56 | In current practice, we have $\log w$ around 8 (for 256-bit architectures), |
---|
57 | and $\log \sigma$ at least 7. |
---|
58 | Thus, $m \log \sigma$ will dominate $\log w$ |
---|
59 | with current and foreseeable technology--we do not expect to see $\log w$ |
---|
60 | skyrocket. |
---|
61 | So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per |
---|
62 | iteration. |
---|
63 | We multiply this by $O(\frac{n}{w})$ iterations to give |
---|
64 | $O(\frac{n m \log \sigma}{w})$ work. |
---|
65 | |
---|
66 | We further note that all of the work in the loop is |
---|
67 | done by superscalar instructions, with the exception of the additions, |
---|
68 | which require carry propagation. |
---|
69 | There will be at most $C$ of these additions in the loop, where $C$ is the |
---|
70 | number of concatenation and Kleene star operations in the regular |
---|
71 | expression; $C < m$. |
---|
72 | |
---|
73 | Almost all intermediate bitstreams in the loop body can be kept in |
---|
74 | registers, |
---|
75 | requiring no storage in memory. |
---|
76 | Good register allocation--and limited live ranges for bitstream |
---|
77 | variables--keeps register spillage to a minimum. |
---|
78 | For those bitstreams that do require storage in memory, |
---|
79 | long buffers are allocated, allowing the successive iterations of the |
---|
80 | loop to access successive memory locations. |
---|
81 | That is, for the few streams requiring it, |
---|
82 | memory is accessed in a sequential fashion. |
---|
83 | As this is the best case for |
---|
84 | hardware prefetching, |
---|
85 | we expect few cache misses with bitstream method. |
---|
86 | |
---|
87 | Compare this with NFA methods. |
---|
88 | In the base NFA method, a state set of approximately $m$ states is kept |
---|
89 | as a bit |
---|
90 | set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes). |
---|
91 | For each character $c$ of the input, |
---|
92 | a precomputed transition table, indexed by the $c$ and the current state |
---|
93 | set, |
---|
94 | is accessed. |
---|
95 | Since there are $2^{\Theta(m)}$ state sets, the transition table will have |
---|
96 | $\sigma 2^{\Theta(m)}$ entries. |
---|
97 | %%, where $\sigma$ is the size of the input alphabet. |
---|
98 | Each entry is a new state set, which requires $\frac{m}{8}$ bytes. |
---|
99 | Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is |
---|
100 | quite |
---|
101 | large: it can become expensive to precompute, and it consumes a lot of |
---|
102 | memory. |
---|
103 | For even fairly small $m$ a table of this size will probably not fit in |
---|
104 | cache |
---|
105 | memory. |
---|
106 | Thus, we would expect many cache misses with this base method. |
---|
107 | |
---|
108 | To improve the table size, several authors have separated the transition |
---|
109 | table |
---|
110 | into several tables, each indexed by a subset of the bits in the bit set |
---|
111 | representing the current state. |
---|
112 | Suppose one uses $k$ bits of the state set to index each table. |
---|
113 | Ignoring ceilings, this requires $\frac{m}{k}$ tables, |
---|
114 | each with $\sigma 2^k$ |
---|
115 | entries of $\frac{m}{8}$ bytes apiece. |
---|
116 | Each table therefore takes up $ m 2^{k-3} \sigma$ bytes, |
---|
117 | and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes. |
---|
118 | At each character, the NFA algorithm does one lookup in each table, |
---|
119 | combining the results with $\frac{m}{k}-1$ boolean OR operations. |
---|
120 | |
---|
121 | The original NFA method of Thompson uses $k=1$, |
---|
122 | which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each, |
---|
123 | along with |
---|
124 | $m$ lookups and $m-1$ boolean OR operations to |
---|
125 | combine the lookups, per character. |
---|
126 | |
---|
127 | Navarro and Raffinot use $k= \frac{m}{2}$, |
---|
128 | giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each, |
---|
129 | two lookups per character, and 1 boolean OR operation per character to |
---|
130 | combine the lookups. |
---|
131 | |
---|
132 | In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA |
---|
133 | methods, listing the number of table lookups per input character and the size of the |
---|
134 | tables for various values of $m$, the number of states. |
---|
135 | We assume the ASCII character set ($\sigma = 128$); any of the |
---|
136 | UTF character sets would yield larger tables. |
---|
137 | |
---|
138 | \begin{table} |
---|
139 | \small{ |
---|
140 | \begin{tabular}{rrrrrrr} |
---|
141 | %% & & Thompson & \ & NavRaff & NFA \\ |
---|
142 | $k$ & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\ |
---|
143 | %%\hline |
---|
144 | lookups & & $m$ & $\frac{m}{4}$ & $\frac{m}{8}$ & 2 & 1\\ |
---|
145 | \hline |
---|
146 | & $m$ & & & & \\ |
---|
147 | \multirow{4}{1.35cm}{memory (KiB)} |
---|
148 | & 5 & 0.8 & 1.6 & 12.5 & 1.3 & 2.5\\ |
---|
149 | & 10 & 3.1 & 6.2 & 50.0 & 10.0 & 160.0\\ |
---|
150 | & 15 & 7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\ |
---|
151 | & 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\ |
---|
152 | & 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ |
---|
153 | \end{tabular} |
---|
154 | } |
---|
155 | \caption{lookups per character and memory consumed by tables in NFA methods |
---|
156 | (in kibibytes)} |
---|
157 | \label{tab:ascii} |
---|
158 | \end{table} |
---|
159 | |
---|
160 | Of particular importance to the speed of NFA methods is whether the |
---|
161 | table lookups result in cache hits or not. |
---|
162 | If the tables are small enough, then they will fit into cache and |
---|
163 | lookups will all be cache hits, taking minimal time. |
---|
164 | In this case, the time per input character will be a small constant times the |
---|
165 | number of lookups. |
---|
166 | |
---|
167 | If the tables are not small enough to fit into cache, some proportion |
---|
168 | of the lookups will generate cache misses. |
---|
169 | This will stall the processor and these stalls will come to dominate the |
---|
170 | computation time. |
---|
171 | In this case, the time per input character will be some large constant |
---|
172 | (a cache miss can take about two orders of magnitude longer than a cache hit) |
---|
173 | times the number of lookups. |
---|
174 | |
---|
175 | Using 256KiB as an estimate of the size of a current standard data cache, |
---|
176 | we can consider those entries of Table \ref{tab:ascii} above 256 to be |
---|
177 | relatively slow. |
---|
178 | We can summarize these theoretical predictions by saying that the NFA methods |
---|
179 | with small $k$ scale well with an increase in NFA states, but with large $k$ the method is |
---|
180 | limited to a small number of states. |
---|
181 | |
---|
182 | We can now directly (but roughly) compare the NFA methods with bitstream |
---|
183 | methods. |
---|
184 | Consider small-$k$ (say, $k <= 4$) NFA methods. |
---|
185 | For the reasonable range of $m$, the tables fit into cache. |
---|
186 | The running time is predicted to be a small constant times the |
---|
187 | $\frac{m}{k} >= \frac{m}{4}$ lookups. |
---|
188 | The small constant, which we will under approximate with 4 cycles, is for the |
---|
189 | table addressing computation, combining the lookups with boolean OR, and final state |
---|
190 | detection and handling. |
---|
191 | Thus, the running time per input character may be lower-bounded by |
---|
192 | $4*\frac{m}{4}$, or simply $m$, cycles. |
---|
193 | |
---|
194 | Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per |
---|
195 | input character, where the constant inside the big-Oh is approximately 2. |
---|
196 | Furthermore, we expect no cache misses due to the regular stride of our memory |
---|
197 | accesses. |
---|
198 | For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles |
---|
199 | per character. |
---|
200 | This is faster than the small-$k$ NFA methods by a factor of $w/14$. |
---|
201 | For processors with a 128-bit word, this is about one order of magnitude. |
---|
202 | |
---|
203 | |
---|
204 | |
---|
205 | |
---|
206 | |
---|