Index: /docs/Working/re/analysis.tex
===================================================================
 /docs/Working/re/analysis.tex (revision 3640)
+++ /docs/Working/re/analysis.tex (revision 3641)
@@ 32,7 +32,15 @@
common uses of grep.
+Let $\Sigma$ be our input alphabet and $\sigma =  \Sigma $.
+As we are comparing techniques in practice, we assume that $\Sigma$ is a
+standard input alphabet, such as ASCII ($\sigma = 128$), UTF8 ($\sigma = 256$),
+UTF16 ($\sigma = 65536$ ), or UTF32 ($\sigma \approx 4.3 \times 10^9$).
+This assumption allows us to equate the number of bits in the encoding of a
+character (a parameter for the bitstream method) with $\log \sigma$.
+
+
The bitstream method compiles a regular expression of size $m$ into
bitstream code that is $O(m)$ statements long (with one operation per
statement; it is essentially threeaddress code).
+bitstream code that is $O(m \log \sigma)$ statements long (with one operation
+per statement; it is essentially threeaddress code).
This is translated to machine code and placed inside a
loop\footnote{Technically, it is inside two loops: an inner one that
@@ 42,11 +50,25 @@
executes once per $w$ characters, where $w$ is the width of the processor's
word.
This gives $O(\frac{nm}{w})$ work.
We can further break this down by noting that all of the work in the loop is
+Also inside this loop is the transposition step that converts characterencoded
+files into their bitstream representation;
+this transposition takes $O(\log w)$ work per loop iteration.
+
+In total, this is $O(m \log \sigma + \log w)$ work per iteration.
+In current practice, we have $\log w$ around 8 (for 256bit architectures),
+and $\log \sigma$ at least 7.
+Thus, $m \log \sigma$ will dominate $\log w$
+with current and forseeable technologywe do not expect to see $\log w$
+skyrocket.
+So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per
+iteration.
+We multiply this by $O(\frac{n}{w})$ iterations to give
+$O(\frac{n m \log \sigma}{w})$ work.
+
+We further note that all of the work in the loop is
done by superscalar instructions, with the exception of the additions,
which require carry propagation.
There will be at most $C$ of these additions in the loop, where $C$ is the
number of concatenation and Kleene star operations in the regular
expression.
+expression; $C < m$.
Almost all intermediate bitstreams in the loop body can be kept in
@@ 72,7 +94,7 @@
set,
is accessed.
Since there are $2^{\Theta(m)}$ state sets, the transition table can have
$\sigma 2^{\Theta(m)}$ entries, where $\sigma$ is the size of the input
alphabet.
+Since there are $2^{\Theta(m)}$ state sets, the transition table will have
+$\sigma 2^{\Theta(m)}$ entries.
+%%, where $\sigma$ is the size of the input alphabet.
Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
@@ 93,5 +115,5 @@
each with $\sigma 2^k$
entries of $\frac{m}{8}$ bytes apiece.
Each table therefore takes up $2^{k3} m \sigma$ bytes,
+Each table therefore takes up $ m 2^{k3} \sigma$ bytes,
and so the collection of them takes up $\frac{m^2 2^{k3}\sigma}{k}$ bytes.
At each character, the NFA algorithm does one lookup in each table,
@@ 107,8 +129,79 @@
giving $2$ tables of $2^{\frac{m}{2}3} m \sigma$ bytes each,
two lookups per character, and 1 boolean OR operation per character to
combine
the lookups.




+combine the lookups.
+
+In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
+methods, listing the number of table lookups per input character and the size of the
+tables for various values of $m$, the number of states.
+We assume the ASCII character set ($\sigma = 128$); any of the
+UTF character sets would yield larger tables.
+
+\begin{table}
+\small{
+\begin{tabular}{rrrrrrr}
+%% & & Thompson & \ & NavRaff & NFA \\
+$k$ & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
+%%\hline
+lookups & & $m$ & $\frac{m}{4}$ & $\frac{m}{8}$ & 2 & 1\\
+\hline
+& $m$ & & & & \\
+\multirow{4}{1.35cm}{memory (KiB)}
+& 5 & 0.8 & 1.6 & 12.5 & 1.3 & 2.5\\
+& 10 & 3.1 & 6.2 & 50.0 & 10.0 & 160.0\\
+& 15 & 7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
+& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
+& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\
+\end{tabular}
+}
+\caption{lookups per character and memory consumed by tables in NFA methods
+(in kibibytes)}
+\label{tab:ascii}
+\end{table}
+
+Of particular importance to the speed of NFA methods is whether the
+table lookups result in cache hits or not.
+If the tables are small enough, then they will fit into cache and
+lookups will all be cache hits, taking minimal time.
+In this case, the time per input character will be a small constant times the
+number of lookups.
+
+If the tables are not small enough to fit into cache, some proportion
+of the lookups will generate cache misses.
+This will stall the processor and these stalls will come to dominate the
+computation time.
+In this case, the time per input character will be some large constant
+(a cache miss can take about two orders of magnitude longer than a cache hit)
+times the number of lookups.
+
+Using 256KiB as an estimate of the size of a current standard data cache,
+we can consider those entries of Table \ref{tab:ascii} above 256 to be
+relatively slow.
+We can summarize these theoretical predictions by saying that the NFA methods
+with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
+limited to a small number of states.
+
+We can now directly (but roughly) compare the NFA methods with bitstream
+methods.
+Consider small$k$ (say, $k <= 4$) NFA methods.
+For the reasonable range of $m$, the tables fit into cache.
+The running time is predicted to be a small constant times the
+$\frac{m}{k} >= \frac{m}{4}$ lookups.
+The small constant, which we will underapproximate with 4 cycles, is for the
+table addressing computation, combining the lookups with boolean OR, and final state
+detection and handling.
+Thus, the running time per input character may be lowerbounded by
+$4*\frac{m}{4}$, or simply $m$, cycles.
+
+Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per
+input character, where the constant inside the bigOh is approximately 2.
+Furthermore, we expect no cache misses due to the regular stride of our memory
+accesses.
+For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles
+per character.
+This is faster than the small$k$ NFA methods by a factor of $w/14$.
+For processors with a 128bit word, this is about one order of magnitude.
+
+
+
+
+