source: docs/Working/re/analysis.tex @ 3653

Last change on this file since 3653 was 3641, checked in by shermer, 6 years ago

Roughed in a rough analysis of NFA method speed vs. bitstreams. Didn't get around to large-k NFA methods, like NRGrep.

File size: 8.8 KB
1\section{Running-time Comparison with DFA and NFA
4Our experimental results indicate that regular expression matching using
5bitstreams can outperform current implementations of NFA- and DFA-based
7It is worth exploring why this is so, and under what conditions one might
8expect bitstreams to perform better than NFA- or DFA-based matchers, and
11The bitstream method starts with a preprocessing step: the compilation
12of the
13regular expression using the parabix toolchain.
14Compilation is an offline process whose time is not counted in our
16measures, as parabix is a experimental research compiler that is not
18This leads to a bias in our results, as our timings for nrgrep and grep
19include the time taken for preprocessing.
20We have attempted to minimize the bias by performing our tests with large
21inputs, so that the text-scanning costs dominate the preprocessing costs.
22We furthermore believe that, if a special-purpose optimized compiler for
23regular expressions were built, that its inclusion in bitstream grep
24would not
25substantially increase the running time, particularly for large input
26texts--the compilation involved is straightforward.
28For simplicity, we will first assume that the input regular expressions
29are restricted to having Kleene closures only of single characters or
30alternations of single characters.
31This is a broad class of regular expressions, covering the majority of
32common uses of grep.
34Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
35As we are comparing techniques in practice, we assume that $\Sigma$ is a
36standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
37UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma \approx 4.3 \times 10^9$).
38This assumption allows us to equate the number of bits in the encoding of a
39character (a parameter for the bitstream method) with $\log \sigma$.
42The bitstream method compiles a regular expression of size $m$ into
43bitstream code that is $O(m \log \sigma)$ statements long (with one operation
44per statement; it is essentially three-address code).
45This is translated to machine code and placed inside a
46loop\footnote{Technically, it is inside two loops: an inner one that
47executes once per $w$ characters in a large buffer, and an outer one
48that successively fetches
49buffers until the input is exhausted.} that
50executes once per $w$ characters, where $w$ is the width of the processor's
52Also inside this loop is the transposition step that converts character-encoded
53files into their bitstream representation;
54this transposition takes $O(\log w)$ work per loop iteration.
56In total, this is $O(m \log \sigma + \log w)$ work per iteration.
57In current practice, we have $\log w$ around 8 (for 256-bit architectures),
58and $\log \sigma$ at least 7.
59Thus, $m \log \sigma$ will dominate $\log w$
60with current and forseeable technology--we do not expect to see $\log w$
62So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per
64We multiply this by $O(\frac{n}{w})$ iterations to give
65$O(\frac{n m \log \sigma}{w})$ work.
67We further note that all of the work in the loop is
68done by superscalar instructions, with the exception of the additions,
69which require carry propagation.
70There will be at most $C$ of these additions in the loop, where $C$ is the
71number of concatenation and Kleene star operations in the regular
72expression; $C < m$.
74Almost all intermediate bitstreams in the loop body can be kept in
76requiring no storage in memory.
77Good register allocation--and limited live ranges for bitstream
78variables--keeps register spillage to a minimum.
79For those bitstreams that do require storage in memory,
80long buffers are allocated, allowing the successive iterations of the
81loop to access successive memory locations.
82That is, for the few streams requiring it,
83memory is accessed in a sequential fashion.
84As this is the best case for
85hardware prefetching,
86we expect few cache misses with bitstream method.
88Compare this with NFA methods.
89In the base NFA method, a state set of approximately $m$ states is kept
90as a bit
91set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
92For each character $c$ of the input,
93a precomputed transition table, indexed by the $c$ and the current state
95is accessed.
96Since there are $2^{\Theta(m)}$ state sets, the transition table will have
97$\sigma 2^{\Theta(m)}$ entries.
98%%, where $\sigma$ is the size of the input alphabet.
99Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
100Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
102large:  it can become expensive to precompute, and it consumes a lot of
104For even fairly small $m$ a table of this size will probably not fit in
107Thus, we would expect many cache misses with this base method.
109To improve the table size, several authors have separated the transition
111into several tables, each indexed by a subset of the bits in the bit set
112representing the current state.
113Suppose one uses $k$ bits of the state set to index each table.
114Ignoring ceilings, this requires $\frac{m}{k}$ tables,
115each with $\sigma 2^k$
116entries of $\frac{m}{8}$ bytes apiece.
117Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
118and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
119At each character, the NFA algorithm does one lookup in each table,
120combining the results with $\frac{m}{k}-1$ boolean OR operations.
122The original NFA method of Thompson uses $k=1$,
123which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
124along with
125$m$ lookups and $m-1$ boolean OR operations to
126combine the lookups, per character.
128Navarro and Raffinot use $k= \frac{m}{2}$,
129giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
130two lookups per character, and 1 boolean OR operation per character to
131combine the lookups.
133In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
134methods, listing the number of table lookups per input character and the size of the
135tables for various values of $m$, the number of states.
136We assume the ASCII character set ($\sigma = 128$); any of the
137UTF character sets would yield larger tables.
142%%   & & Thompson & \  & NavRaff & NFA \\
143$k$   & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
145lookups & & $m$ & $\frac{m}{4}$  & $\frac{m}{8}$  & 2 & 1\\
147& $m$ &  &  &  & \\
148\multirow{4}{1.35cm}{memory (KiB)}
149&  5 &  0.8 &  1.6 &  12.5 &   1.3 & 2.5\\
150& 10 &  3.1 &  6.2 &  50.0 &  10.0 & 160.0\\
151& 15 &  7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
152& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
153& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ 
156\caption{lookups per character and memory consumed by tables in NFA methods
157(in kibibytes)}
161Of particular importance to the speed of NFA methods is whether the
162table lookups result in cache hits or not.
163If the tables are small enough, then they will fit into cache and
164lookups will all be cache hits, taking minimal time.
165In this case, the time per input character will be a small constant times the
166number of lookups.
168If the tables are not small enough to fit into cache, some proportion
169of the lookups will generate cache misses. 
170This will stall the processor and these stalls will come to dominate the
171computation time.
172In this case, the time per input character will be some large constant
173(a cache miss can take about two orders of magnitude longer than a cache hit)
174times the number of lookups. 
176Using  256KiB as an estimate of the size of a current standard data cache,
177we can consider those entries of Table \ref{tab:ascii} above 256 to be
178relatively slow.
179We can summarize these theoretical predictions by saying that the NFA methods
180with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
181limited to a small number of states.
183We can now directly (but roughly) compare the NFA methods with bitstream
185Consider small-$k$ (say, $k <= 4$) NFA methods.
186For the reasonable range of $m$, the tables fit into cache.
187The running time is predicted to be a small constant times the
188$\frac{m}{k} >= \frac{m}{4}$ lookups.
189The small constant, which we will underapproximate with 4 cycles, is for the
190table addressing computation, combining the lookups with boolean OR, and final state
191detection and handling.
192Thus, the running time per input character may be lower-bounded by
193$4*\frac{m}{4}$, or simply $m$, cycles.
195Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per
196input character, where the constant inside the big-Oh is approximately 2.
197Furthermore, we expect no cache misses due to the regular stride of our memory
199For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles
200per character.
201This is faster than the small-$k$ NFA methods by a factor of $w/14$.
202For processors with a 128-bit word, this is about one order of magnitude.
Note: See TracBrowser for help on using the repository browser.