source: docs/Working/re/analysis.tex @ 3879

Last change on this file since 3879 was 3879, checked in by shermer, 5 years ago

Corrected detail about the cost of transposition. Retitled section more accurately. Note we analyze nonskipping NFA methods & how to extend to skipping. More rough approximation at end of section.

File size: 10.1 KB
Line 
1\section{Running-time Comparison with Base NFA
2Implementations}\label{sec:analysis}
3
4Our experimental results indicate that regular expression matching using
5bitstreams can outperform current implementations of NFA- (and DFA-) based
6matching.
7It is worth exploring why this is so, and under what conditions one might
8expect bitstreams to perform better than NFA- or DFA-based matchers, and
9vice-versa.
10
11The bitstream method starts with a preprocessing step: the compilation
12of the
13regular expression using the Parabix toolchain.
14Compilation is an offline process whose time is not counted in our
15performance
16measures, as Parabix is a experimental research compiler that is not
17optimized.
18This leads to a bias in our results, as our timings for nrgrep and grep
19include the time taken for preprocessing.
20We have attempted to minimize the bias by performing our tests with large
21inputs, so that the text-scanning costs dominate the preprocessing costs.
22We furthermore believe that, if a special-purpose optimized compiler for
23regular expressions were built, that its inclusion in bitstreams grep
24would not
25substantially increase the running time, particularly for large input
26texts--the compilation involved is straightforward.
27
28For simplicity, we will first assume that the input regular expressions
29are restricted to having Kleene closures only of single characters or
30alternations of single characters.
31This is a broad class of regular expressions, covering the majority of
32common uses of grep.
33
34Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
35As we are comparing techniques in practice, we assume that $\Sigma$ is a
36standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
37UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma = 1114112$).
38This assumption allows us to equate the number of bits in the encoding of a
39character (a parameter for the bitstream method) with $\log \sigma$.
40
41The bitstream method compiles a regular expression of size $m$ into
42bitstream code that is $O(m \log \sigma)$ statements long (with one operation
43per statement; it is essentially three-address code).
44This is translated to machine code and placed inside a
45loop\footnote{Technically, it is inside two loops: an inner one that
46executes once per $w$ characters in a large buffer, and an outer one
47that successively fetches
48buffers until the input is exhausted.} that
49executes once per $w$ characters, where $w$ is the width of the processor's
50word.
51Also inside this loop is the transposition step that converts character-encoded
52files into their bitstream representation;
53this transposition takes $O(\log \sigma \log \log \sigma)$ work per loop
54iteration.
55
56In total, this is $O(m \log \sigma + \log w + \log \sigma \log \log \sigma)$
57work per iteration.
58In current practice, we have $\log w$ around 8 (for 256-bit architectures),
59and $\log \sigma$ at least 7.
60Thus, $m \log \sigma$ will dominate $\log w$
61with current and foreseeable technology--we do not expect to see $\log w$
62skyrocket.
63So we can absorb the $\log w$ term and state the work as $O(m \log \sigma + \log
64\sigma \log \log \sigma)$ per iteration.
65We multiply this by $O(\frac{n}{w})$ iterations to give
66$O(\frac{n (m + \log \log \sigma) \log \sigma}{w})$ work.
67
68We further note that all of the work in the loop is
69done by superscalar instructions, with the exception of the additions,
70which require carry propagation.
71There will be at most $C$ of these additions in the loop, where $C$ is the
72number of concatenation and Kleene star operations in the regular
73expression; $C < m$.
74
75Almost all intermediate bitstreams in the loop body can be kept in
76registers,
77requiring no storage in memory.
78Good register allocation--and limited live ranges for bitstream
79variables--keeps register spillage to a minimum.
80For those bitstreams that do require storage in memory,
81long buffers are allocated, allowing the successive iterations of the
82loop to access successive memory locations.
83That is, for the few streams requiring it,
84memory is accessed in a sequential fashion.
85As this is the best case for
86hardware prefetching,
87we expect few cache misses with bitstream method.
88
89We compare this with base NFA methods; by ``base'' here we mean NFA methods
90that do not skip input characters.
91The performance of input-skipping methods can be approximated by first analyzing
92the performance of the base method and then multiplying this by the
93expected fraction of examined (non-skipped) input.
94
95In the base NFA method, a state set of approximately $m$ states is kept
96as a bit
97set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
98For each character $c$ of the input,
99a precomputed transition table, indexed by the $c$ and the current state
100set,
101is accessed.
102Since there are $2^{\Theta(m)}$ state sets, the transition table will have
103$\sigma 2^{\Theta(m)}$ entries.
104%%, where $\sigma$ is the size of the input alphabet.
105Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
106Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
107quite
108large:  it can become expensive to precompute, and it consumes a lot of
109memory.
110For even fairly small $m$ a table of this size will probably not fit in
111cache
112memory.
113Thus, we would expect many cache misses with this base method.
114
115To improve the table size, several authors have separated the transition
116table
117into several tables, each indexed by a subset of the bits in the bit set
118representing the current state.
119Suppose one uses $k$ bits of the state set to index each table.
120Ignoring ceilings, this requires $\frac{m}{k}$ tables,
121each with $\sigma 2^k$
122entries of $\frac{m}{8}$ bytes apiece.
123Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
124and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
125At each character, the NFA algorithm does one lookup in each table,
126combining the results with $\frac{m}{k}-1$ boolean OR operations.
127
128The original NFA method of Thompson uses $k=1$,
129which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
130along with
131$m$ lookups and $m-1$ boolean OR operations to
132combine the lookups, per character.
133
134Navarro and Raffinot use $k= \frac{m}{2}$,
135giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
136two lookups per character, and 1 boolean OR operation per character to
137combine the lookups.
138
139In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
140methods, listing the number of table lookups per input character and the size of the
141tables for various values of $m$, the number of states.
142We assume the ASCII character set ($\sigma = 128$); any of the
143UTF character sets would yield larger tables.
144
145\begin{table}
146\small{
147\begin{tabular}{rrrrrrr}
148%%   & & Thompson & \  & NavRaff & NFA \\
149$k$   & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
150%%\hline
151lookups & & $m$ & $\frac{m}{4}$  & $\frac{m}{8}$  & 2 & 1\\
152\hline 
153& $m$ &  &  &  & \\
154\multirow{4}{1.35cm}{memory (KiB)}
155&  5 &  0.8 &  1.6 &  12.5 &   1.3 & 2.5\\
156& 10 &  3.1 &  6.2 &  50.0 &  10.0 & 160.0\\
157& 15 &  7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
158& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
159& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ 
160\end{tabular}
161}
162\caption{lookups per character and memory consumed by tables in NFA methods
163(in kibibytes)}
164\label{tab:ascii}
165\end{table}
166
167Of particular importance to the speed of NFA methods is whether the
168table lookups result in cache hits or not.
169If the tables are small enough, then they will fit into cache and
170lookups will all be cache hits, taking minimal time.
171In this case, the time per input character will be a small constant times the
172number of lookups.
173
174If the tables are not small enough to fit into cache, some proportion
175of the lookups will generate cache misses. 
176This will stall the processor and these stalls will come to dominate the
177computation time.
178In this case, the time per input character will be some large constant
179(a cache miss can take about two orders of magnitude longer than a cache hit)
180times the number of lookups. 
181
182Using  256KiB as an estimate of the size of a current standard data cache,
183we can consider those entries of Table \ref{tab:ascii} above 256 to be
184relatively slow.
185We can summarize these theoretical predictions by saying that the NFA methods
186with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
187limited to a small number of states.
188
189We can now directly (but roughly) compare the NFA methods with bitstream
190methods.
191Consider small-$k$ (say, $k <= 4$) NFA methods.
192For the reasonable range of $m$, the tables fit into cache.
193The running time is predicted to be a small constant times the
194$\frac{m}{k} >= \frac{m}{4}$ lookups.
195The small constant, which we will under approximate with 4 cycles, is for the
196table addressing computation, combining the lookups with boolean OR, and final state
197detection and handling.
198Thus, the running time per input character may be lower-bounded by
199$4*\frac{m}{4}$, or simply $m$, cycles.
200
201Our method, on the other hand, takes time $O(\frac{m \log \sigma + \log \log
202\sigma \log \sigma}{w})$ per input character, where the constant inside the
203big-Oh is approximately 2 for the first part of the numerator and 6 for the
204second part.
205Furthermore, we expect no cache misses due to the regular stride of our memory
206accesses. 
207For UTF-8 (or ASCII), this time becomes at most $2 \frac{8m}{w} + 6 \frac{24}{w}
208= \frac{16m + 144}{w}$ cycles per character.
209
210For processors with a 128-bit word, this is $\frac{16m + 144}{128} = \frac{m}{8}
211+ \frac{9}{8}$ cycles per character.
212Comparing this with the at least $m$ cycles per character of the base NFA
213methods, we expect these NFA methods to be competitive with our method only when
214the size of the regular expression is $1$.
215As the size of the regular expression increases, we expect our method
216to approach a factor-of-8 improvement over the base NFA methods.
217
218In theory, our improvement factor should scale closely with the word size; so
219that for processors with a 256-bit word, we expect an 16x improvement, and for
220processors with a 512-bit word, we expect a 32x improvement.
221In practice, there is some reduction in these improvement factors due to
222the instruction sets of larger-width processors not yet being as versatile as
223those of 128-bit processors.  (For example, full-width addition is not yet
224supported.)
225
226
227
228
229
Note: See TracBrowser for help on using the repository browser.