1 | \section{Running-time Comparison with Base NFA |
---|
2 | Implementations}\label{sec:analysis} |
---|
3 | |
---|
4 | Our experimental results indicate that regular expression matching using |
---|
5 | bitstreams can outperform current implementations of NFA- (and DFA-) based |
---|
6 | matching. |
---|
7 | It is worth exploring why this is so, and under what conditions one might |
---|
8 | expect bitstreams to perform better than NFA- or DFA-based matchers, and |
---|
9 | vice-versa. |
---|
10 | |
---|
11 | The bitstream method starts with a preprocessing step: the compilation |
---|
12 | of the regular expression using the algorithm presented above as |
---|
13 | well as the compilers of the Parabix toolchain. |
---|
14 | Compilation is an offline process whose time is not counted in our |
---|
15 | performance measures, as each of these are research tools that have |
---|
16 | neither been optimized nor integrated. |
---|
17 | This leads to a bias in our results, as our timings for nrgrep and gre2p |
---|
18 | include the time taken for preprocessing. |
---|
19 | We minimize the bias by performing our tests with |
---|
20 | reasonably large inputs, so that the text-scanning costs dominate the preprocessing |
---|
21 | costs. We assume that the length $m$ of regular expressions is typically less than |
---|
22 | 100 bytes and that data files are typically over 10 MB. |
---|
23 | Provided that a well-engineered implementation of our regular |
---|
24 | expression compilation algorithm together with the compilers of |
---|
25 | the Parabix tool chain requires no more than $10000m$ cycles, |
---|
26 | the overhead of compilation will not substantially increase the |
---|
27 | running time. As our regular expression algorithm is $O(m)$, |
---|
28 | and the other steps of the Parabix tool chain require $O(m \log m)$ |
---|
29 | time, we expect that such performance is well within reason. |
---|
30 | It is important to note that our algorithms construct neither |
---|
31 | NFAs nor DFAs and so are no subject to the exponential-time |
---|
32 | behaviour of NFA-to-DFA transformation. |
---|
33 | |
---|
34 | For simplicity, we will first assume that the input regular expressions |
---|
35 | are restricted to having Kleene closures only of single characters or |
---|
36 | alternations of single characters. |
---|
37 | This is a broad class of regular expressions, covering the majority of |
---|
38 | common uses of grep. |
---|
39 | |
---|
40 | Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$. |
---|
41 | As we are comparing techniques in practice, we assume that $\Sigma$ is a |
---|
42 | standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$), |
---|
43 | UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma = 1114112$). |
---|
44 | This assumption allows us to equate the number of bits in the encoding of a |
---|
45 | character (a parameter for the bitstream method) with $\log \sigma$. |
---|
46 | |
---|
47 | The bitstream method compiles a regular expression of size $m$ into |
---|
48 | bitstream code that is $O(m \log \sigma)$ statements long (with one operation |
---|
49 | per statement; it is essentially three-address code). |
---|
50 | This is translated to machine code and placed inside a |
---|
51 | loop\footnote{Technically, it is inside two loops: an inner one that |
---|
52 | executes once per $w$ characters in a large buffer, and an outer one |
---|
53 | that successively fetches |
---|
54 | buffers until the input is exhausted.} that |
---|
55 | executes once per $w$ characters, where $w$ is the width of the processor's |
---|
56 | word. |
---|
57 | Also inside this loop is the transposition step that converts character-encoded |
---|
58 | files into their bitstream representation; |
---|
59 | this transposition takes $O(\log \sigma \log \log \sigma)$ work per loop |
---|
60 | iteration. |
---|
61 | |
---|
62 | In total, this is $O(m \log \sigma + \log w + \log \sigma \log \log \sigma)$ |
---|
63 | work per iteration. |
---|
64 | In current practice, we have $\log w$ around 8 (for 256-bit architectures), |
---|
65 | and $\log \sigma$ at least 7. |
---|
66 | Thus, $m \log \sigma$ will dominate $\log w$ |
---|
67 | with current and foreseeable technology--we do not expect to see $\log w$ |
---|
68 | skyrocket. |
---|
69 | So we can absorb the $\log w$ term and state the work as $O(m \log \sigma + \log |
---|
70 | \sigma \log \log \sigma)$ per iteration. |
---|
71 | We multiply this by $O(\frac{n}{w})$ iterations to give |
---|
72 | $O(\frac{n (m + \log \log \sigma) \log \sigma}{w})$ work. |
---|
73 | |
---|
74 | We further note that all of the work in the loop is |
---|
75 | done by superscalar instructions, with the exception of the additions, |
---|
76 | which require carry propagation. |
---|
77 | There will be at most $C$ of these additions in the loop, where $C$ is the |
---|
78 | number of concatenation and Kleene star operations in the regular |
---|
79 | expression; $C < m$. |
---|
80 | |
---|
81 | Almost all intermediate bitstreams in the loop body can be kept in |
---|
82 | registers, |
---|
83 | requiring no storage in memory. |
---|
84 | Good register allocation--and limited live ranges for bitstream |
---|
85 | variables--keeps register spillage to a minimum. |
---|
86 | For those bitstreams that do require storage in memory, |
---|
87 | long buffers are allocated, allowing the successive iterations of the |
---|
88 | loop to access successive memory locations. |
---|
89 | That is, for the few streams requiring it, |
---|
90 | memory is accessed in a sequential fashion. |
---|
91 | As this is the best case for |
---|
92 | hardware prefetching, |
---|
93 | we expect few cache misses with bitstream method. |
---|
94 | |
---|
95 | We compare this with base NFA methods; by ``base'' here we mean NFA methods |
---|
96 | that do not skip input characters. |
---|
97 | The performance of input-skipping methods can be approximated by first analyzing |
---|
98 | the performance of the base method and then multiplying this by the |
---|
99 | expected fraction of examined (non-skipped) input. |
---|
100 | |
---|
101 | In the base NFA method, a state set of approximately $m$ states is kept |
---|
102 | as a bit |
---|
103 | set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes). |
---|
104 | For each character $c$ of the input, |
---|
105 | a precomputed transition table, indexed by the $c$ and the current state |
---|
106 | set, |
---|
107 | is accessed. |
---|
108 | Since there are $2^{\Theta(m)}$ state sets, the transition table will have |
---|
109 | $\sigma 2^{\Theta(m)}$ entries. |
---|
110 | %%, where $\sigma$ is the size of the input alphabet. |
---|
111 | Each entry is a new state set, which requires $\frac{m}{8}$ bytes. |
---|
112 | Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is |
---|
113 | quite |
---|
114 | large: it can become expensive to precompute, and it consumes a lot of |
---|
115 | memory. |
---|
116 | For even fairly small $m$ a table of this size will probably not fit in |
---|
117 | cache |
---|
118 | memory. |
---|
119 | Thus, we would expect many cache misses with this base method. |
---|
120 | |
---|
121 | To improve the table size, several authors have separated the transition |
---|
122 | table |
---|
123 | into several tables, each indexed by a subset of the bits in the bit set |
---|
124 | representing the current state. |
---|
125 | Suppose one uses $k$ bits of the state set to index each table. |
---|
126 | Ignoring ceilings, this requires $\frac{m}{k}$ tables, |
---|
127 | each with $\sigma 2^k$ |
---|
128 | entries of $\frac{m}{8}$ bytes apiece. |
---|
129 | Each table therefore takes up $ m 2^{k-3} \sigma$ bytes, |
---|
130 | and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes. |
---|
131 | At each character, the NFA algorithm does one lookup in each table, |
---|
132 | combining the results with $\frac{m}{k}-1$ boolean OR operations. |
---|
133 | |
---|
134 | The original NFA method of Thompson uses $k=1$, |
---|
135 | which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each, |
---|
136 | along with |
---|
137 | $m$ lookups and $m-1$ boolean OR operations to |
---|
138 | combine the lookups, per character. |
---|
139 | |
---|
140 | Navarro and Raffinot use $k= \frac{m}{2}$, |
---|
141 | giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each, |
---|
142 | two lookups per character, and 1 boolean OR operation per character to |
---|
143 | combine the lookups. |
---|
144 | |
---|
145 | In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA |
---|
146 | methods, listing the number of table lookups per input character and the size of the |
---|
147 | tables for various values of $m$, the number of states. |
---|
148 | We assume the ASCII character set ($\sigma = 128$); any of the |
---|
149 | UTF character sets would yield larger tables. |
---|
150 | |
---|
151 | \begin{table} |
---|
152 | \small{ |
---|
153 | \begin{tabular}{rrrrrrr} |
---|
154 | %% & & Thompson & \ & NavRaff & NFA \\ |
---|
155 | $k$ & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\ |
---|
156 | %%\hline |
---|
157 | lookups & & $m$ & $\frac{m}{4}$ & $\frac{m}{8}$ & 2 & 1\\ |
---|
158 | \hline |
---|
159 | & $m$ & & & & \\ |
---|
160 | \multirow{4}{1.35cm}{memory (KiB)} |
---|
161 | & 5 & 0.8 & 1.6 & 12.5 & 1.3 & 2.5\\ |
---|
162 | & 10 & 3.1 & 6.2 & 50.0 & 10.0 & 160.0\\ |
---|
163 | & 15 & 7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\ |
---|
164 | & 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\ |
---|
165 | & 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ |
---|
166 | \end{tabular} |
---|
167 | } |
---|
168 | \caption{lookups per character and memory consumed by tables in NFA methods |
---|
169 | (in kibibytes)} |
---|
170 | \label{tab:ascii} |
---|
171 | \end{table} |
---|
172 | |
---|
173 | Of particular importance to the speed of NFA methods is whether the |
---|
174 | table lookups result in cache hits or not. |
---|
175 | If the tables are small enough, then they will fit into cache and |
---|
176 | lookups will all be cache hits, taking minimal time. |
---|
177 | In this case, the time per input character will be a small constant times the |
---|
178 | number of lookups. |
---|
179 | |
---|
180 | If the tables are not small enough to fit into cache, some proportion |
---|
181 | of the lookups will generate cache misses. |
---|
182 | This will stall the processor and these stalls will come to dominate the |
---|
183 | computation time. |
---|
184 | In this case, the time per input character will be some large constant |
---|
185 | (a cache miss can take about two orders of magnitude longer than a cache hit) |
---|
186 | times the number of lookups. |
---|
187 | |
---|
188 | Using 256KiB as an estimate of the size of a current standard data cache, |
---|
189 | we can consider those entries of Table \ref{tab:ascii} above 256 to be |
---|
190 | relatively slow. |
---|
191 | We can summarize these theoretical predictions by saying that the NFA methods |
---|
192 | with small $k$ scale well with an increase in NFA states, but with large $k$ the method is |
---|
193 | limited to a small number of states. |
---|
194 | |
---|
195 | We can now directly (but roughly) compare the NFA methods with bitstream |
---|
196 | methods. |
---|
197 | Consider small-$k$ (say, $k <= 4$) NFA methods. |
---|
198 | For the reasonable range of $m$, the tables fit into cache. |
---|
199 | The running time is predicted to be a small constant times the |
---|
200 | $\frac{m}{k} >= \frac{m}{4}$ lookups. |
---|
201 | The small constant, which we will under approximate with 4 cycles, is for the |
---|
202 | table addressing computation, combining the lookups with boolean OR, and final state |
---|
203 | detection and handling. |
---|
204 | Thus, the running time per input character may be lower-bounded by |
---|
205 | $4*\frac{m}{4}$, or simply $m$, cycles. |
---|
206 | |
---|
207 | Our method, on the other hand, takes time $O(\frac{m \log \sigma + \log \log |
---|
208 | \sigma \log \sigma}{w})$ per input character, where the constant inside the |
---|
209 | big-Oh is approximately 2 for the first part of the numerator and 6 for the |
---|
210 | second part. |
---|
211 | Furthermore, we expect no cache misses due to the regular stride of our memory |
---|
212 | accesses. |
---|
213 | For UTF-8 (or ASCII), this time becomes at most $2 \frac{8m}{w} + 6 \frac{24}{w} |
---|
214 | = \frac{16m + 144}{w}$ cycles per character. |
---|
215 | |
---|
216 | For processors with a 128-bit word, this is $\frac{16m + 144}{128} = \frac{m}{8} |
---|
217 | + \frac{9}{8}$ cycles per character. |
---|
218 | Comparing this with the at least $m$ cycles per character of the base NFA |
---|
219 | methods, we expect these NFA methods to be competitive with our method only when |
---|
220 | the size of the regular expression is $1$. |
---|
221 | As the size of the regular expression increases, we expect our method |
---|
222 | to approach a factor-of-8 improvement over the base NFA methods. |
---|
223 | |
---|
224 | In theory, our improvement factor should scale closely with the word size; so |
---|
225 | that for processors with a 256-bit word, we expect an 16x improvement, and for |
---|
226 | processors with a 512-bit word, we expect a 32x improvement. |
---|
227 | In practice, we expect some reduction in these improvement ratios |
---|
228 | for various reasons, including the need for while-loops to |
---|
229 | implement nested Kleene closures as well as processor |
---|
230 | limitations in memory bandwidth, for example. |
---|
231 | |
---|
232 | |
---|
233 | |
---|
234 | |
---|