# Changeset 3641 for docs

Ignore:
Timestamp:
Feb 24, 2014, 12:08:02 AM (5 years ago)
Message:

Roughed in a rough analysis of NFA method speed vs. bitstreams. Didn't get around to large-k NFA methods, like NRGrep.

File:
1 edited

### Legend:

Unmodified
 r3502 common uses of grep. Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$. As we are comparing techniques in practice, we assume that $\Sigma$ is a standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$), UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma \approx 4.3 \times 10^9$). This assumption allows us to equate the number of bits in the encoding of a character (a parameter for the bitstream method) with $\log \sigma$. The bitstream method compiles a regular expression of size $m$ into bitstream code that is $O(m)$ statements long (with one operation per statement; it is essentially three-address code). bitstream code that is $O(m \log \sigma)$ statements long (with one operation per statement; it is essentially three-address code). This is translated to machine code and placed inside a loop\footnote{Technically, it is inside two loops: an inner one that executes once per $w$ characters, where $w$ is the width of the processor's word. This gives $O(\frac{nm}{w})$ work. We can further break this down by noting that all of the work in the loop is Also inside this loop is the transposition step that converts character-encoded files into their bitstream representation; this transposition takes $O(\log w)$ work per loop iteration. In total, this is $O(m \log \sigma + \log w)$ work per iteration. In current practice, we have $\log w$ around 8 (for 256-bit architectures), and $\log \sigma$ at least 7. Thus, $m \log \sigma$ will dominate $\log w$ with current and forseeable technology--we do not expect to see $\log w$ skyrocket. So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per iteration. We multiply this by $O(\frac{n}{w})$ iterations to give $O(\frac{n m \log \sigma}{w})$ work. We further note that all of the work in the loop is done by superscalar instructions, with the exception of the additions, which require carry propagation. There will be at most $C$ of these additions in the loop, where $C$ is the number of concatenation and Kleene star operations in the regular expression. expression; $C < m$. Almost all intermediate bitstreams in the loop body can be kept in set, is accessed. Since there are $2^{\Theta(m)}$ state sets, the transition table can have $\sigma 2^{\Theta(m)}$ entries, where $\sigma$ is the size of the input alphabet. Since there are $2^{\Theta(m)}$ state sets, the transition table will have $\sigma 2^{\Theta(m)}$ entries. %%, where $\sigma$ is the size of the input alphabet. Each entry is a new state set, which requires $\frac{m}{8}$ bytes. Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is each with $\sigma 2^k$ entries of $\frac{m}{8}$ bytes apiece. Each table therefore takes up $2^{k-3} m \sigma$ bytes, Each table therefore takes up $m 2^{k-3} \sigma$ bytes, and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes. At each character, the NFA algorithm does one lookup in each table, giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each, two lookups per character, and 1 boolean OR operation per character to combine the lookups. combine the lookups. In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA methods, listing the number of table lookups per input character and the size of the tables for various values of $m$, the number of states. We assume the ASCII character set ($\sigma = 128$); any of the UTF character sets would yield larger tables. \begin{table} \small{ \begin{tabular}{rrrrrrr} %%   & & Thompson & \  & NavRaff & NFA \\ $k$   & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\ %%\hline lookups & & $m$ & $\frac{m}{4}$  & $\frac{m}{8}$  & 2 & 1\\ \hline & $m$ &  &  &  & \\ \multirow{4}{1.35cm}{memory (KiB)} &  5 &  0.8 &  1.6 &  12.5 &   1.3 & 2.5\\ & 10 &  3.1 &  6.2 &  50.0 &  10.0 & 160.0\\ & 15 &  7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\ & 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\ & 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ \end{tabular} } \caption{lookups per character and memory consumed by tables in NFA methods (in kibibytes)} \label{tab:ascii} \end{table} Of particular importance to the speed of NFA methods is whether the table lookups result in cache hits or not. If the tables are small enough, then they will fit into cache and lookups will all be cache hits, taking minimal time. In this case, the time per input character will be a small constant times the number of lookups. If the tables are not small enough to fit into cache, some proportion of the lookups will generate cache misses. This will stall the processor and these stalls will come to dominate the computation time. In this case, the time per input character will be some large constant (a cache miss can take about two orders of magnitude longer than a cache hit) times the number of lookups. Using  256KiB as an estimate of the size of a current standard data cache, we can consider those entries of Table \ref{tab:ascii} above 256 to be relatively slow. We can summarize these theoretical predictions by saying that the NFA methods with small $k$ scale well with an increase in NFA states, but with large $k$ the method is limited to a small number of states. We can now directly (but roughly) compare the NFA methods with bitstream methods. Consider small-$k$ (say, $k <= 4$) NFA methods. For the reasonable range of $m$, the tables fit into cache. The running time is predicted to be a small constant times the $\frac{m}{k} >= \frac{m}{4}$ lookups. The small constant, which we will underapproximate with 4 cycles, is for the table addressing computation, combining the lookups with boolean OR, and final state detection and handling. Thus, the running time per input character may be lower-bounded by $4*\frac{m}{4}$, or simply $m$, cycles. Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per input character, where the constant inside the big-Oh is approximately 2. Furthermore, we expect no cache misses due to the regular stride of our memory accesses. For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles per character. This is faster than the small-$k$ NFA methods by a factor of $w/14$. For processors with a 128-bit word, this is about one order of magnitude.