\section{Running-time Comparison with Base NFA
Implementations}\label{sec:analysis}
Our experimental results indicate that regular expression matching using
bitstreams can outperform current implementations of NFA- (and DFA-) based
matching.
It is worth exploring why this is so, and under what conditions one might
expect bitstreams to perform better than NFA- or DFA-based matchers, and
vice-versa.
The bitstream method starts with a preprocessing step: the compilation
of the regular expression using the algorithm presented above as
well as the compilers of the Parabix toolchain.
Compilation is an offline process whose time is not counted in our
performance measures, as each of these are research tools that have
neither been optimized nor integrated.
This leads to a bias in our results, as our timings for nrgrep and gre2p
include the time taken for preprocessing.
We minimize the bias by performing our tests with
reasonably large inputs, so that the text-scanning costs dominate the preprocessing
costs. We assume that the length $m$ of regular expressions is typically less than
100 bytes and that data files are typically over 10 MB.
Provided that a well-engineered implementation of our regular
expression compilation algorithm together with the compilers of
the Parabix tool chain requires no more than $10000m$ cycles,
the overhead of compilation will not substantially increase the
running time. As our regular expression algorithm is $O(m)$,
and the other steps of the Parabix tool chain require $O(m \log m)$
time, we expect that such performance is well within reason.
It is important to note that our algorithms construct neither
NFAs nor DFAs and so are no subject to the exponential-time
behaviour of NFA-to-DFA transformation.
For simplicity, we will first assume that the input regular expressions
are restricted to having Kleene closures only of single characters or
alternations of single characters.
This is a broad class of regular expressions, covering the majority of
common uses of grep.
Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
As we are comparing techniques in practice, we assume that $\Sigma$ is a
standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma = 1114112$).
This assumption allows us to equate the number of bits in the encoding of a
character (a parameter for the bitstream method) with $\log \sigma$.
The bitstream method compiles a regular expression of size $m$ into
bitstream code that is $O(m \log \sigma)$ statements long (with one operation
per statement; it is essentially three-address code).
This is translated to machine code and placed inside a
loop\footnote{Technically, it is inside two loops: an inner one that
executes once per $w$ characters in a large buffer, and an outer one
that successively fetches
buffers until the input is exhausted.} that
executes once per $w$ characters, where $w$ is the width of the processor's
word.
Also inside this loop is the transposition step that converts character-encoded
files into their bitstream representation;
this transposition takes $O(\log \sigma \log \log \sigma)$ work per loop
iteration.
In total, this is $O(m \log \sigma + \log w + \log \sigma \log \log \sigma)$
work per iteration.
In current practice, we have $\log w$ around 8 (for 256-bit architectures),
and $\log \sigma$ at least 7.
Thus, $m \log \sigma$ will dominate $\log w$
with current and foreseeable technology--we do not expect to see $\log w$
skyrocket.
So we can absorb the $\log w$ term and state the work as $O(m \log \sigma + \log
\sigma \log \log \sigma)$ per iteration.
We multiply this by $O(\frac{n}{w})$ iterations to give
$O(\frac{n (m + \log \log \sigma) \log \sigma}{w})$ work.
We further note that all of the work in the loop is
done by superscalar instructions, with the exception of the additions,
which require carry propagation.
There will be at most $C$ of these additions in the loop, where $C$ is the
number of concatenation and Kleene star operations in the regular
expression; $C < m$.
Almost all intermediate bitstreams in the loop body can be kept in
registers,
requiring no storage in memory.
Good register allocation--and limited live ranges for bitstream
variables--keeps register spillage to a minimum.
For those bitstreams that do require storage in memory,
long buffers are allocated, allowing the successive iterations of the
loop to access successive memory locations.
That is, for the few streams requiring it,
memory is accessed in a sequential fashion.
As this is the best case for
hardware prefetching,
we expect few cache misses with bitstream method.
We compare this with base NFA methods; by ``base'' here we mean NFA methods
that do not skip input characters.
The performance of input-skipping methods can be approximated by first analyzing
the performance of the base method and then multiplying this by the
expected fraction of examined (non-skipped) input.
In the base NFA method, a state set of approximately $m$ states is kept
as a bit
set in $\frac{m}{w}$ machine words (or $\frac{m}{8}$ bytes).
For each character $c$ of the input,
a precomputed transition table, indexed by the $c$ and the current state
set,
is accessed.
Since there are $2^{\Theta(m)}$ state sets, the transition table will have
$\sigma 2^{\Theta(m)}$ entries.
%%, where $\sigma$ is the size of the input alphabet.
Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
quite
large: it can become expensive to precompute, and it consumes a lot of
memory.
For even fairly small $m$ a table of this size will probably not fit in
cache
memory.
Thus, we would expect many cache misses with this base method.
To improve the table size, several authors have separated the transition
table
into several tables, each indexed by a subset of the bits in the bit set
representing the current state.
Suppose one uses $k$ bits of the state set to index each table.
Ignoring ceilings, this requires $\frac{m}{k}$ tables,
each with $\sigma 2^k$
entries of $\frac{m}{8}$ bytes apiece.
Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
At each character, the NFA algorithm does one lookup in each table,
combining the results with $\frac{m}{k}-1$ boolean OR operations.
The original NFA method of Thompson uses $k=1$,
which gives a $m$ tables of $\frac{m \sigma}{4}$ bytes each,
along with
$m$ lookups and $m-1$ boolean OR operations to
combine the lookups, per character.
Navarro and Raffinot use $k= \frac{m}{2}$,
giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
two lookups per character, and 1 boolean OR operation per character to
combine the lookups.
In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
methods, listing the number of table lookups per input character and the size of the
tables for various values of $m$, the number of states.
We assume the ASCII character set ($\sigma = 128$); any of the
UTF character sets would yield larger tables.
\begin{table}
\small{
\begin{tabular}{rrrrrrr}
%% & & Thompson & \ & NavRaff & NFA \\
$k$ & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
%%\hline
lookups & & $m$ & $\frac{m}{4}$ & $\frac{m}{8}$ & 2 & 1\\
\hline
& $m$ & & & & \\
\multirow{4}{1.35cm}{memory (KiB)}
& 5 & 0.8 & 1.6 & 12.5 & 1.3 & 2.5\\
& 10 & 3.1 & 6.2 & 50.0 & 10.0 & 160.0\\
& 15 & 7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\
\end{tabular}
}
\caption{lookups per character and memory consumed by tables in NFA methods
(in kibibytes)}
\label{tab:ascii}
\end{table}
Of particular importance to the speed of NFA methods is whether the
table lookups result in cache hits or not.
If the tables are small enough, then they will fit into cache and
lookups will all be cache hits, taking minimal time.
In this case, the time per input character will be a small constant times the
number of lookups.
If the tables are not small enough to fit into cache, some proportion
of the lookups will generate cache misses.
This will stall the processor and these stalls will come to dominate the
computation time.
In this case, the time per input character will be some large constant
(a cache miss can take about two orders of magnitude longer than a cache hit)
times the number of lookups.
Using 256KiB as an estimate of the size of a current standard data cache,
we can consider those entries of Table \ref{tab:ascii} above 256 to be
relatively slow.
We can summarize these theoretical predictions by saying that the NFA methods
with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
limited to a small number of states.
We can now directly (but roughly) compare the NFA methods with bitstream
methods.
Consider small-$k$ (say, $k <= 4$) NFA methods.
For the reasonable range of $m$, the tables fit into cache.
The running time is predicted to be a small constant times the
$\frac{m}{k} >= \frac{m}{4}$ lookups.
The small constant, which we will under approximate with 4 cycles, is for the
table addressing computation, combining the lookups with boolean OR, and final state
detection and handling.
Thus, the running time per input character may be lower-bounded by
$4*\frac{m}{4}$, or simply $m$, cycles.
Our method, on the other hand, takes time $O(\frac{m \log \sigma + \log \log
\sigma \log \sigma}{w})$ per input character, where the constant inside the
big-Oh is approximately 2 for the first part of the numerator and 6 for the
second part.
Furthermore, we expect no cache misses due to the regular stride of our memory
accesses.
For UTF-8 (or ASCII), this time becomes at most $2 \frac{8m}{w} + 6 \frac{24}{w}
= \frac{16m + 144}{w}$ cycles per character.
For processors with a 128-bit word, this is $\frac{16m + 144}{128} = \frac{m}{8}
+ \frac{9}{8}$ cycles per character.
Comparing this with the at least $m$ cycles per character of the base NFA
methods, we expect these NFA methods to be competitive with our method only when
the size of the regular expression is $1$.
As the size of the regular expression increases, we expect our method
to approach a factor-of-8 improvement over the base NFA methods.
In theory, our improvement factor should scale closely with the word size; so
that for processors with a 256-bit word, we expect an 16x improvement, and for
processors with a 512-bit word, we expect a 32x improvement.
In practice, there is some reduction in these improvement factors due to
the instruction sets of larger-width processors not yet being as versatile as
those of 128-bit processors. (For example, full-width addition is not yet
supported.)