source: docs/Working/re/analysis.tex @ 3511

Last change on this file since 3511 was 3502, checked in by cameron, 6 years ago

Updates to section 1; analysis section

File size: 4.7 KB
Line 
1\section{Running-time Comparison with DFA and NFA
2Implementations}\label{sec:analysis}
3
4Our experimental results indicate that regular expression matching using
5bitstreams can outperform current implementations of NFA- and DFA-based
6matching.
7It is worth exploring why this is so, and under what conditions one might
8expect bitstreams to perform better than NFA- or DFA-based matchers, and
9vice-versa.
10
11The bitstream method starts with a preprocessing step: the compilation
12of the
13regular expression using the parabix toolchain.
14Compilation is an offline process whose time is not counted in our
15performance
16measures, as parabix is a experimental research compiler that is not
17optimized.
18This leads to a bias in our results, as our timings for nrgrep and grep
19include the time taken for preprocessing.
20We have attempted to minimize the bias by performing our tests with large
21inputs, so that the text-scanning costs dominate the preprocessing costs.
22We furthermore believe that, if a special-purpose optimized compiler for
23regular expressions were built, that its inclusion in bitstream grep
24would not
25substantially increase the running time, particularly for large input
26texts--the compilation involved is straightforward.
27
28For simplicity, we will first assume that the input regular expressions
29are restricted to having Kleene closures only of single characters or
30alternations of single characters.
31This is a broad class of regular expressions, covering the majority of
32common uses of grep.
33
34The bitstream method compiles a regular expression of size $m$ into
35bitstream code that is $O(m)$ statements long (with one operation per
36statement; it is essentially three-address code).
37This is translated to machine code and placed inside a
38loop\footnote{Technically, it is inside two loops: an inner one that
39executes once per $w$ characters in a large buffer, and an outer one
40that successively fetches
41buffers until the input is exhausted.} that
42executes once per $w$ characters, where $w$ is the width of the processor's
43word.
44This gives $O(\frac{nm}{w})$ work.
45We can further break this down by noting that all of the work in the loop is
46done by superscalar instructions, with the exception of the additions,
47which require carry propagation.
48There will be at most $C$ of these additions in the loop, where $C$ is the
49number of concatenation and Kleene star operations in the regular
50expression.
51
52Almost all intermediate bitstreams in the loop body can be kept in
53registers,
54requiring no storage in memory.
55Good register allocation--and limited live ranges for bitstream
56variables--keeps register spillage to a minimum.
57For those bitstreams that do require storage in memory,
58long buffers are allocated, allowing the successive iterations of the
59loop to access successive memory locations.
60That is, for the few streams requiring it,
61memory is accessed in a sequential fashion.
62As this is the best case for
63hardware prefetching,
64we expect few cache misses with bitstream method.
65
66Compare this with NFA methods.
67In the base NFA method, a state set of approximately $m$ states is kept
68as a bit
69set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
70For each character $c$ of the input,
71a precomputed transition table, indexed by the $c$ and the current state
72set,
73is accessed.
74Since there are $2^{\Theta(m)}$ state sets, the transition table can have
75$\sigma 2^{\Theta(m)}$ entries, where $\sigma$ is the size of the input
76alphabet.
77Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
78Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
79quite
80large:  it can become expensive to precompute, and it consumes a lot of
81memory.
82For even fairly small $m$ a table of this size will probably not fit in
83cache
84memory.
85Thus, we would expect many cache misses with this base method.
86
87To improve the table size, several authors have separated the transition
88table
89into several tables, each indexed by a subset of the bits in the bit set
90representing the current state.
91Suppose one uses $k$ bits of the state set to index each table.
92Ignoring ceilings, this requires $\frac{m}{k}$ tables,
93each with $\sigma 2^k$
94entries of $\frac{m}{8}$ bytes apiece.
95Each table therefore takes up $2^{k-3} m \sigma$ bytes,
96and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
97At each character, the NFA algorithm does one lookup in each table,
98combining the results with $\frac{m}{k}-1$ boolean OR operations.
99
100The original NFA method of Thompson uses $k=1$,
101which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
102along with
103$m$ lookups and $m-1$ boolean OR operations to
104combine the lookups, per character.
105
106Navarro and Raffinot use $k= \frac{m}{2}$,
107giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
108two lookups per character, and 1 boolean OR operation per character to
109combine
110the lookups.
111
112
113
114
Note: See TracBrowser for help on using the repository browser.