source: docs/Working/re/analysis.tex @ 3878

Last change on this file since 3878 was 3868, checked in by cameron, 5 years ago

Data updates, more analysis

File size: 8.8 KB
Line 
1\section{Running-time Comparison with DFA and NFA
2Implementations}\label{sec:analysis}
3
4Our experimental results indicate that regular expression matching using
5bitstreams can outperform current implementations of NFA- and DFA-based
6matching.
7It is worth exploring why this is so, and under what conditions one might
8expect bitstreams to perform better than NFA- or DFA-based matchers, and
9vice-versa.
10
11The bitstream method starts with a preprocessing step: the compilation
12of the
13regular expression using the Parabix toolchain.
14Compilation is an offline process whose time is not counted in our
15performance
16measures, as Parabix is a experimental research compiler that is not
17optimized.
18This leads to a bias in our results, as our timings for nrgrep and grep
19include the time taken for preprocessing.
20We have attempted to minimize the bias by performing our tests with large
21inputs, so that the text-scanning costs dominate the preprocessing costs.
22We furthermore believe that, if a special-purpose optimized compiler for
23regular expressions were built, that its inclusion in bitstreams grep
24would not
25substantially increase the running time, particularly for large input
26texts--the compilation involved is straightforward.
27
28For simplicity, we will first assume that the input regular expressions
29are restricted to having Kleene closures only of single characters or
30alternations of single characters.
31This is a broad class of regular expressions, covering the majority of
32common uses of grep.
33
34Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
35As we are comparing techniques in practice, we assume that $\Sigma$ is a
36standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
37UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma = 1114112$).
38This assumption allows us to equate the number of bits in the encoding of a
39character (a parameter for the bitstream method) with $\log \sigma$.
40
41The bitstream method compiles a regular expression of size $m$ into
42bitstream code that is $O(m \log \sigma)$ statements long (with one operation
43per statement; it is essentially three-address code).
44This is translated to machine code and placed inside a
45loop\footnote{Technically, it is inside two loops: an inner one that
46executes once per $w$ characters in a large buffer, and an outer one
47that successively fetches
48buffers until the input is exhausted.} that
49executes once per $w$ characters, where $w$ is the width of the processor's
50word.
51Also inside this loop is the transposition step that converts character-encoded
52files into their bitstream representation;
53this transposition takes $O(\log w)$ work per loop iteration.
54
55In total, this is $O(m \log \sigma + \log w)$ work per iteration.
56In current practice, we have $\log w$ around 8 (for 256-bit architectures),
57and $\log \sigma$ at least 7.
58Thus, $m \log \sigma$ will dominate $\log w$
59with current and foreseeable technology--we do not expect to see $\log w$
60skyrocket.
61So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per
62iteration.
63We multiply this by $O(\frac{n}{w})$ iterations to give
64$O(\frac{n m \log \sigma}{w})$ work.
65
66We further note that all of the work in the loop is
67done by superscalar instructions, with the exception of the additions,
68which require carry propagation.
69There will be at most $C$ of these additions in the loop, where $C$ is the
70number of concatenation and Kleene star operations in the regular
71expression; $C < m$.
72
73Almost all intermediate bitstreams in the loop body can be kept in
74registers,
75requiring no storage in memory.
76Good register allocation--and limited live ranges for bitstream
77variables--keeps register spillage to a minimum.
78For those bitstreams that do require storage in memory,
79long buffers are allocated, allowing the successive iterations of the
80loop to access successive memory locations.
81That is, for the few streams requiring it,
82memory is accessed in a sequential fashion.
83As this is the best case for
84hardware prefetching,
85we expect few cache misses with bitstream method.
86
87Compare this with NFA methods.
88In the base NFA method, a state set of approximately $m$ states is kept
89as a bit
90set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
91For each character $c$ of the input,
92a precomputed transition table, indexed by the $c$ and the current state
93set,
94is accessed.
95Since there are $2^{\Theta(m)}$ state sets, the transition table will have
96$\sigma 2^{\Theta(m)}$ entries.
97%%, where $\sigma$ is the size of the input alphabet.
98Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
99Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
100quite
101large:  it can become expensive to precompute, and it consumes a lot of
102memory.
103For even fairly small $m$ a table of this size will probably not fit in
104cache
105memory.
106Thus, we would expect many cache misses with this base method.
107
108To improve the table size, several authors have separated the transition
109table
110into several tables, each indexed by a subset of the bits in the bit set
111representing the current state.
112Suppose one uses $k$ bits of the state set to index each table.
113Ignoring ceilings, this requires $\frac{m}{k}$ tables,
114each with $\sigma 2^k$
115entries of $\frac{m}{8}$ bytes apiece.
116Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
117and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
118At each character, the NFA algorithm does one lookup in each table,
119combining the results with $\frac{m}{k}-1$ boolean OR operations.
120
121The original NFA method of Thompson uses $k=1$,
122which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
123along with
124$m$ lookups and $m-1$ boolean OR operations to
125combine the lookups, per character.
126
127Navarro and Raffinot use $k= \frac{m}{2}$,
128giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
129two lookups per character, and 1 boolean OR operation per character to
130combine the lookups.
131
132In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
133methods, listing the number of table lookups per input character and the size of the
134tables for various values of $m$, the number of states.
135We assume the ASCII character set ($\sigma = 128$); any of the
136UTF character sets would yield larger tables.
137
138\begin{table}
139\small{
140\begin{tabular}{rrrrrrr}
141%%   & & Thompson & \  & NavRaff & NFA \\
142$k$   & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
143%%\hline
144lookups & & $m$ & $\frac{m}{4}$  & $\frac{m}{8}$  & 2 & 1\\
145\hline 
146& $m$ &  &  &  & \\
147\multirow{4}{1.35cm}{memory (KiB)}
148&  5 &  0.8 &  1.6 &  12.5 &   1.3 & 2.5\\
149& 10 &  3.1 &  6.2 &  50.0 &  10.0 & 160.0\\
150& 15 &  7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
151& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
152& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ 
153\end{tabular}
154}
155\caption{lookups per character and memory consumed by tables in NFA methods
156(in kibibytes)}
157\label{tab:ascii}
158\end{table}
159
160Of particular importance to the speed of NFA methods is whether the
161table lookups result in cache hits or not.
162If the tables are small enough, then they will fit into cache and
163lookups will all be cache hits, taking minimal time.
164In this case, the time per input character will be a small constant times the
165number of lookups.
166
167If the tables are not small enough to fit into cache, some proportion
168of the lookups will generate cache misses. 
169This will stall the processor and these stalls will come to dominate the
170computation time.
171In this case, the time per input character will be some large constant
172(a cache miss can take about two orders of magnitude longer than a cache hit)
173times the number of lookups. 
174
175Using  256KiB as an estimate of the size of a current standard data cache,
176we can consider those entries of Table \ref{tab:ascii} above 256 to be
177relatively slow.
178We can summarize these theoretical predictions by saying that the NFA methods
179with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
180limited to a small number of states.
181
182We can now directly (but roughly) compare the NFA methods with bitstream
183methods.
184Consider small-$k$ (say, $k <= 4$) NFA methods.
185For the reasonable range of $m$, the tables fit into cache.
186The running time is predicted to be a small constant times the
187$\frac{m}{k} >= \frac{m}{4}$ lookups.
188The small constant, which we will under approximate with 4 cycles, is for the
189table addressing computation, combining the lookups with boolean OR, and final state
190detection and handling.
191Thus, the running time per input character may be lower-bounded by
192$4*\frac{m}{4}$, or simply $m$, cycles.
193
194Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per
195input character, where the constant inside the big-Oh is approximately 2.
196Furthermore, we expect no cache misses due to the regular stride of our memory
197accesses. 
198For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles
199per character.
200This is faster than the small-$k$ NFA methods by a factor of $w/14$.
201For processors with a 128-bit word, this is about one order of magnitude.
202
203
204
205
206
Note: See TracBrowser for help on using the repository browser.