source: docs/PACT14/analysis.tex @ 4565

Last change on this file since 4565 was 3897, checked in by cameron, 5 years ago

Little clean-ups

File size: 10.6 KB
Line 
1\section{Running-time Comparison with Base NFA
2Implementations}\label{sec:analysis}
3
4Our experimental results indicate that regular expression matching using
5bitstreams can outperform current implementations of NFA- (and DFA-) based
6matching.
7It is worth exploring why this is so, and under what conditions one might
8expect bitstreams to perform better than NFA- or DFA-based matchers, and
9vice-versa.
10
11The bitstream method starts with a preprocessing step: the compilation
12of the regular expression using the algorithm presented above as
13well as the compilers of the Parabix toolchain.
14Compilation is an offline process whose time is not counted in our
15performance measures, as each of these are research tools that have
16neither been optimized nor integrated.   
17 This leads to a bias in our results, as our timings for nrgrep and gre2p
18include the time taken for preprocessing.
19We minimize the bias by performing our tests with
20reasonably large inputs, so that the text-scanning costs dominate the preprocessing
21costs.   We assume that the length $m$ of regular expressions is typically less than
22100 bytes and that data files are typically over 10 MB.   
23Provided that a well-engineered implementation of our regular
24expression compilation algorithm together with the compilers of
25the Parabix tool chain requires no more than $10000m$ cycles,
26the overhead of compilation will not substantially increase the
27running time.   As our regular expression algorithm is $O(m)$,
28and the other steps of the Parabix tool chain require $O(m \log m)$
29time, we expect that such performance is well within reason.
30It is important to note that our algorithms construct neither
31NFAs nor DFAs and so are no subject to the exponential-time
32behaviour of NFA-to-DFA transformation.
33
34For simplicity, we will first assume that the input regular expressions
35are restricted to having Kleene closures only of single characters or
36alternations of single characters.
37This is a broad class of regular expressions, covering the majority of
38common uses of grep.
39
40Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
41As we are comparing techniques in practice, we assume that $\Sigma$ is a
42standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
43UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma = 1114112$).
44This assumption allows us to equate the number of bits in the encoding of a
45character (a parameter for the bitstream method) with $\log \sigma$.
46
47The bitstream method compiles a regular expression of size $m$ into
48bitstream code that is $O(m \log \sigma)$ statements long (with one operation
49per statement; it is essentially three-address code).
50This is translated to machine code and placed inside a
51loop\footnote{Technically, it is inside two loops: an inner one that
52executes once per $w$ characters in a large buffer, and an outer one
53that successively fetches
54buffers until the input is exhausted.} that
55executes once per $w$ characters, where $w$ is the width of the processor's
56word.
57Also inside this loop is the transposition step that converts character-encoded
58files into their bitstream representation;
59this transposition takes $O(\log \sigma \log \log \sigma)$ work per loop
60iteration.
61
62In total, this is $O(m \log \sigma + \log w + \log \sigma \log \log \sigma)$
63work per iteration.
64In current practice, we have $\log w$ around 8 (for 256-bit architectures),
65and $\log \sigma$ at least 7.
66Thus, $m \log \sigma$ will dominate $\log w$
67with current and foreseeable technology--we do not expect to see $\log w$
68skyrocket.
69So we can absorb the $\log w$ term and state the work as $O(m \log \sigma + \log
70\sigma \log \log \sigma)$ per iteration.
71We multiply this by $O(\frac{n}{w})$ iterations to give
72$O(\frac{n (m + \log \log \sigma) \log \sigma}{w})$ work.
73
74We further note that all of the work in the loop is
75done by superscalar instructions, with the exception of the additions,
76which require carry propagation.
77There will be at most $C$ of these additions in the loop, where $C$ is the
78number of concatenation and Kleene star operations in the regular
79expression; $C < m$.
80
81Almost all intermediate bitstreams in the loop body can be kept in
82registers,
83requiring no storage in memory.
84Good register allocation--and limited live ranges for bitstream
85variables--keeps register spillage to a minimum.
86For those bitstreams that do require storage in memory,
87long buffers are allocated, allowing the successive iterations of the
88loop to access successive memory locations.
89That is, for the few streams requiring it,
90memory is accessed in a sequential fashion.
91As this is the best case for
92hardware prefetching,
93we expect few cache misses with bitstream method.
94
95We compare this with base NFA methods; by ``base'' here we mean NFA methods
96that do not skip input characters.
97The performance of input-skipping methods can be approximated by first analyzing
98the performance of the base method and then multiplying this by the
99expected fraction of examined (non-skipped) input.
100
101In the base NFA method, a state set of approximately $m$ states is kept
102as a bit
103set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
104For each character $c$ of the input,
105a precomputed transition table, indexed by the $c$ and the current state
106set,
107is accessed.
108Since there are $2^{\Theta(m)}$ state sets, the transition table will have
109$\sigma 2^{\Theta(m)}$ entries.
110%%, where $\sigma$ is the size of the input alphabet.
111Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
112Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
113quite
114large:  it can become expensive to precompute, and it consumes a lot of
115memory.
116For even fairly small $m$ a table of this size will probably not fit in
117cache
118memory.
119Thus, we would expect many cache misses with this base method.
120
121To improve the table size, several authors have separated the transition
122table
123into several tables, each indexed by a subset of the bits in the bit set
124representing the current state.
125Suppose one uses $k$ bits of the state set to index each table.
126Ignoring ceilings, this requires $\frac{m}{k}$ tables,
127each with $\sigma 2^k$
128entries of $\frac{m}{8}$ bytes apiece.
129Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
130and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
131At each character, the NFA algorithm does one lookup in each table,
132combining the results with $\frac{m}{k}-1$ boolean OR operations.
133
134The original NFA method of Thompson uses $k=1$,
135which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
136along with
137$m$ lookups and $m-1$ boolean OR operations to
138combine the lookups, per character.
139
140Navarro and Raffinot use $k= \frac{m}{2}$,
141giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
142two lookups per character, and 1 boolean OR operation per character to
143combine the lookups.
144
145In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
146methods, listing the number of table lookups per input character and the size of the
147tables for various values of $m$, the number of states.
148We assume the ASCII character set ($\sigma = 128$); any of the
149UTF character sets would yield larger tables.
150
151\begin{table}
152\small{
153\begin{tabular}{rrrrrrr}
154%%   & & Thompson & \  & NavRaff & NFA \\
155$k$   & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
156%%\hline
157lookups & & $m$ & $\frac{m}{4}$  & $\frac{m}{8}$  & 2 & 1\\
158\hline 
159& $m$ &  &  &  & \\
160\multirow{4}{1.35cm}{memory (KiB)}
161&  5 &  0.8 &  1.6 &  12.5 &   1.3 & 2.5\\
162& 10 &  3.1 &  6.2 &  50.0 &  10.0 & 160.0\\
163& 15 &  7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
164& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
165& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ 
166\end{tabular}
167}
168\caption{lookups per character and memory consumed by tables in NFA methods
169(in kibibytes)}
170\label{tab:ascii}
171\end{table}
172
173Of particular importance to the speed of NFA methods is whether the
174table lookups result in cache hits or not.
175If the tables are small enough, then they will fit into cache and
176lookups will all be cache hits, taking minimal time.
177In this case, the time per input character will be a small constant times the
178number of lookups.
179
180If the tables are not small enough to fit into cache, some proportion
181of the lookups will generate cache misses. 
182This will stall the processor and these stalls will come to dominate the
183computation time.
184In this case, the time per input character will be some large constant
185(a cache miss can take about two orders of magnitude longer than a cache hit)
186times the number of lookups. 
187
188Using  256KiB as an estimate of the size of a current standard data cache,
189we can consider those entries of Table \ref{tab:ascii} above 256 to be
190relatively slow.
191We can summarize these theoretical predictions by saying that the NFA methods
192with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
193limited to a small number of states.
194
195We can now directly (but roughly) compare the NFA methods with bitstream
196methods.
197Consider small-$k$ (say, $k <= 4$) NFA methods.
198For the reasonable range of $m$, the tables fit into cache.
199The running time is predicted to be a small constant times the
200$\frac{m}{k} >= \frac{m}{4}$ lookups.
201The small constant, which we will under approximate with 4 cycles, is for the
202table addressing computation, combining the lookups with boolean OR, and final state
203detection and handling.
204Thus, the running time per input character may be lower-bounded by
205$4*\frac{m}{4}$, or simply $m$, cycles.
206
207Our method, on the other hand, takes time $O(\frac{m \log \sigma + \log \log
208\sigma \log \sigma}{w})$ per input character, where the constant inside the
209big-Oh is approximately 2 for the first part of the numerator and 6 for the
210second part.
211Furthermore, we expect no cache misses due to the regular stride of our memory
212accesses. 
213For UTF-8 (or ASCII), this time becomes at most $2 \frac{8m}{w} + 6 \frac{24}{w}
214= \frac{16m + 144}{w}$ cycles per character.
215
216For processors with a 128-bit word, this is $\frac{16m + 144}{128} = \frac{m}{8}
217+ \frac{9}{8}$ cycles per character.
218Comparing this with the at least $m$ cycles per character of the base NFA
219methods, we expect these NFA methods to be competitive with our method only when
220the size of the regular expression is $1$.
221As the size of the regular expression increases, we expect our method
222to approach a factor-of-8 improvement over the base NFA methods.
223
224In theory, our improvement factor should scale closely with the word size; so
225that for processors with a 256-bit word, we expect an 16x improvement, and for
226processors with a 512-bit word, we expect a 32x improvement.
227In practice, we expect some reduction in these improvement ratios
228for various reasons, including the need for while-loops to
229implement nested Kleene closures as well as processor
230limitations in memory bandwidth, for example.
231
232
233
234
Note: See TracBrowser for help on using the repository browser.