\section{Running-time Comparison with DFA and NFA
Implementations}\label{sec:analysis}
Our experimental results indicate that regular expression matching using
bitstreams can outperform current implementations of NFA- and DFA-based
matching.
It is worth exploring why this is so, and under what conditions one might
expect bitstreams to perform better than NFA- or DFA-based matchers, and
vice-versa.
The bitstream method starts with a preprocessing step: the compilation
of the
regular expression using the Parabix toolchain.
Compilation is an offline process whose time is not counted in our
performance
measures, as Parabix is a experimental research compiler that is not
optimized.
This leads to a bias in our results, as our timings for nrgrep and grep
include the time taken for preprocessing.
We have attempted to minimize the bias by performing our tests with large
inputs, so that the text-scanning costs dominate the preprocessing costs.
We furthermore believe that, if a special-purpose optimized compiler for
regular expressions were built, that its inclusion in bitstreams grep
would not
substantially increase the running time, particularly for large input
texts--the compilation involved is straightforward.
For simplicity, we will first assume that the input regular expressions
are restricted to having Kleene closures only of single characters or
alternations of single characters.
This is a broad class of regular expressions, covering the majority of
common uses of grep.
Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
As we are comparing techniques in practice, we assume that $\Sigma$ is a
standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma \approx 4.3 \times 10^9$).
This assumption allows us to equate the number of bits in the encoding of a
character (a parameter for the bitstream method) with $\log \sigma$.
The bitstream method compiles a regular expression of size $m$ into
bitstream code that is $O(m \log \sigma)$ statements long (with one operation
per statement; it is essentially three-address code).
This is translated to machine code and placed inside a
loop\footnote{Technically, it is inside two loops: an inner one that
executes once per $w$ characters in a large buffer, and an outer one
that successively fetches
buffers until the input is exhausted.} that
executes once per $w$ characters, where $w$ is the width of the processor's
word.
Also inside this loop is the transposition step that converts character-encoded
files into their bitstream representation;
this transposition takes $O(\log w)$ work per loop iteration.
In total, this is $O(m \log \sigma + \log w)$ work per iteration.
In current practice, we have $\log w$ around 8 (for 256-bit architectures),
and $\log \sigma$ at least 7.
Thus, $m \log \sigma$ will dominate $\log w$
with current and foreseeable technology--we do not expect to see $\log w$
skyrocket.
So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per
iteration.
We multiply this by $O(\frac{n}{w})$ iterations to give
$O(\frac{n m \log \sigma}{w})$ work.
We further note that all of the work in the loop is
done by superscalar instructions, with the exception of the additions,
which require carry propagation.
There will be at most $C$ of these additions in the loop, where $C$ is the
number of concatenation and Kleene star operations in the regular
expression; $C < m$.
Almost all intermediate bitstreams in the loop body can be kept in
registers,
requiring no storage in memory.
Good register allocation--and limited live ranges for bitstream
variables--keeps register spillage to a minimum.
For those bitstreams that do require storage in memory,
long buffers are allocated, allowing the successive iterations of the
loop to access successive memory locations.
That is, for the few streams requiring it,
memory is accessed in a sequential fashion.
As this is the best case for
hardware prefetching,
we expect few cache misses with bitstream method.
Compare this with NFA methods.
In the base NFA method, a state set of approximately $m$ states is kept
as a bit
set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
For each character $c$ of the input,
a precomputed transition table, indexed by the $c$ and the current state
set,
is accessed.
Since there are $2^{\Theta(m)}$ state sets, the transition table will have
$\sigma 2^{\Theta(m)}$ entries.
%%, where $\sigma$ is the size of the input alphabet.
Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
quite
large: it can become expensive to precompute, and it consumes a lot of
memory.
For even fairly small $m$ a table of this size will probably not fit in
cache
memory.
Thus, we would expect many cache misses with this base method.
To improve the table size, several authors have separated the transition
table
into several tables, each indexed by a subset of the bits in the bit set
representing the current state.
Suppose one uses $k$ bits of the state set to index each table.
Ignoring ceilings, this requires $\frac{m}{k}$ tables,
each with $\sigma 2^k$
entries of $\frac{m}{8}$ bytes apiece.
Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
At each character, the NFA algorithm does one lookup in each table,
combining the results with $\frac{m}{k}-1$ boolean OR operations.
The original NFA method of Thompson uses $k=1$,
which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
along with
$m$ lookups and $m-1$ boolean OR operations to
combine the lookups, per character.
Navarro and Raffinot use $k= \frac{m}{2}$,
giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
two lookups per character, and 1 boolean OR operation per character to
combine the lookups.
In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
methods, listing the number of table lookups per input character and the size of the
tables for various values of $m$, the number of states.
We assume the ASCII character set ($\sigma = 128$); any of the
UTF character sets would yield larger tables.
\begin{table}
\small{
\begin{tabular}{rrrrrrr}
%% & & Thompson & \ & NavRaff & NFA \\
$k$ & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
%%\hline
lookups & & $m$ & $\frac{m}{4}$ & $\frac{m}{8}$ & 2 & 1\\
\hline
& $m$ & & & & \\
\multirow{4}{1.35cm}{memory (KiB)}
& 5 & 0.8 & 1.6 & 12.5 & 1.3 & 2.5\\
& 10 & 3.1 & 6.2 & 50.0 & 10.0 & 160.0\\
& 15 & 7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\
\end{tabular}
}
\caption{lookups per character and memory consumed by tables in NFA methods
(in kibibytes)}
\label{tab:ascii}
\end{table}
Of particular importance to the speed of NFA methods is whether the
table lookups result in cache hits or not.
If the tables are small enough, then they will fit into cache and
lookups will all be cache hits, taking minimal time.
In this case, the time per input character will be a small constant times the
number of lookups.
If the tables are not small enough to fit into cache, some proportion
of the lookups will generate cache misses.
This will stall the processor and these stalls will come to dominate the
computation time.
In this case, the time per input character will be some large constant
(a cache miss can take about two orders of magnitude longer than a cache hit)
times the number of lookups.
Using 256KiB as an estimate of the size of a current standard data cache,
we can consider those entries of Table \ref{tab:ascii} above 256 to be
relatively slow.
We can summarize these theoretical predictions by saying that the NFA methods
with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
limited to a small number of states.
We can now directly (but roughly) compare the NFA methods with bitstream
methods.
Consider small-$k$ (say, $k <= 4$) NFA methods.
For the reasonable range of $m$, the tables fit into cache.
The running time is predicted to be a small constant times the
$\frac{m}{k} >= \frac{m}{4}$ lookups.
The small constant, which we will under approximate with 4 cycles, is for the
table addressing computation, combining the lookups with boolean OR, and final state
detection and handling.
Thus, the running time per input character may be lower-bounded by
$4*\frac{m}{4}$, or simply $m$, cycles.
Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per
input character, where the constant inside the big-Oh is approximately 2.
Furthermore, we expect no cache misses due to the regular stride of our memory
accesses.
For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles
per character.
This is faster than the small-$k$ NFA methods by a factor of $w/14$.
For processors with a 128-bit word, this is about one order of magnitude.