Changeset 3641 for docs

Feb 24, 2014, 12:08:02 AM (5 years ago)

Roughed in a rough analysis of NFA method speed vs. bitstreams. Didn't get around to large-k NFA methods, like NRGrep.

1 edited


  • docs/Working/re/analysis.tex

    r3502 r3641  
    3232common uses of grep.
     34Let $\Sigma$ be our input alphabet and $\sigma = | \Sigma |$.
     35As we are comparing techniques in practice, we assume that $\Sigma$ is a
     36standard input alphabet, such as ASCII ($\sigma = 128$), UTF-8 ($\sigma = 256$),
     37UTF-16 ($\sigma = 65536$ ), or UTF-32 ($\sigma \approx 4.3 \times 10^9$).
     38This assumption allows us to equate the number of bits in the encoding of a
     39character (a parameter for the bitstream method) with $\log \sigma$.
    3442The bitstream method compiles a regular expression of size $m$ into
    35 bitstream code that is $O(m)$ statements long (with one operation per
    36 statement; it is essentially three-address code).
     43bitstream code that is $O(m \log \sigma)$ statements long (with one operation
     44per statement; it is essentially three-address code).
    3745This is translated to machine code and placed inside a
    3846loop\footnote{Technically, it is inside two loops: an inner one that
    4250executes once per $w$ characters, where $w$ is the width of the processor's
    44 This gives $O(\frac{nm}{w})$ work.
    45 We can further break this down by noting that all of the work in the loop is
     52Also inside this loop is the transposition step that converts character-encoded
     53files into their bitstream representation;
     54this transposition takes $O(\log w)$ work per loop iteration.
     56In total, this is $O(m \log \sigma + \log w)$ work per iteration.
     57In current practice, we have $\log w$ around 8 (for 256-bit architectures),
     58and $\log \sigma$ at least 7.
     59Thus, $m \log \sigma$ will dominate $\log w$
     60with current and forseeable technology--we do not expect to see $\log w$
     62So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per
     64We multiply this by $O(\frac{n}{w})$ iterations to give
     65$O(\frac{n m \log \sigma}{w})$ work.
     67We further note that all of the work in the loop is
    4668done by superscalar instructions, with the exception of the additions,
    4769which require carry propagation.
    4870There will be at most $C$ of these additions in the loop, where $C$ is the
    4971number of concatenation and Kleene star operations in the regular
    50 expression.
     72expression; $C < m$.
    5274Almost all intermediate bitstreams in the loop body can be kept in
    7395is accessed.
    74 Since there are $2^{\Theta(m)}$ state sets, the transition table can have
    75 $\sigma 2^{\Theta(m)}$ entries, where $\sigma$ is the size of the input
    76 alphabet.
     96Since there are $2^{\Theta(m)}$ state sets, the transition table will have
     97$\sigma 2^{\Theta(m)}$ entries.
     98%%, where $\sigma$ is the size of the input alphabet.
    7799Each entry is a new state set, which requires $\frac{m}{8}$ bytes.
    78100Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is
    93115each with $\sigma 2^k$
    94116entries of $\frac{m}{8}$ bytes apiece.
    95 Each table therefore takes up $2^{k-3} m \sigma$ bytes,
     117Each table therefore takes up $ m 2^{k-3} \sigma$ bytes,
    96118and so the collection of them takes up $\frac{m^2 2^{k-3}\sigma}{k}$ bytes.
    97119At each character, the NFA algorithm does one lookup in each table,
    107129giving $2$ tables of $2^{\frac{m}{2}-3} m \sigma$ bytes each,
    108130two lookups per character, and 1 boolean OR operation per character to
    109 combine
    110 the lookups.
     131combine the lookups.
     133In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA
     134methods, listing the number of table lookups per input character and the size of the
     135tables for various values of $m$, the number of states.
     136We assume the ASCII character set ($\sigma = 128$); any of the
     137UTF character sets would yield larger tables.
     142%%   & & Thompson & \  & NavRaff & NFA \\
     143$k$   & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\
     145lookups & & $m$ & $\frac{m}{4}$  & $\frac{m}{8}$  & 2 & 1\\
     147& $m$ &  &  &  & \\
     148\multirow{4}{1.35cm}{memory (KiB)}
     149&  5 &  0.8 &  1.6 &  12.5 &   1.3 & 2.5\\
     150& 10 &  3.1 &  6.2 &  50.0 &  10.0 & 160.0\\
     151& 15 &  7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\
     152& 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\
     153& 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\
     156\caption{lookups per character and memory consumed by tables in NFA methods
     157(in kibibytes)}
     161Of particular importance to the speed of NFA methods is whether the
     162table lookups result in cache hits or not.
     163If the tables are small enough, then they will fit into cache and
     164lookups will all be cache hits, taking minimal time.
     165In this case, the time per input character will be a small constant times the
     166number of lookups.
     168If the tables are not small enough to fit into cache, some proportion
     169of the lookups will generate cache misses. 
     170This will stall the processor and these stalls will come to dominate the
     171computation time.
     172In this case, the time per input character will be some large constant
     173(a cache miss can take about two orders of magnitude longer than a cache hit)
     174times the number of lookups. 
     176Using  256KiB as an estimate of the size of a current standard data cache,
     177we can consider those entries of Table \ref{tab:ascii} above 256 to be
     178relatively slow.
     179We can summarize these theoretical predictions by saying that the NFA methods
     180with small $k$ scale well with an increase in NFA states, but with large $k$ the method is
     181limited to a small number of states.
     183We can now directly (but roughly) compare the NFA methods with bitstream
     185Consider small-$k$ (say, $k <= 4$) NFA methods.
     186For the reasonable range of $m$, the tables fit into cache.
     187The running time is predicted to be a small constant times the
     188$\frac{m}{k} >= \frac{m}{4}$ lookups.
     189The small constant, which we will underapproximate with 4 cycles, is for the
     190table addressing computation, combining the lookups with boolean OR, and final state
     191detection and handling.
     192Thus, the running time per input character may be lower-bounded by
     193$4*\frac{m}{4}$, or simply $m$, cycles.
     195Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per
     196input character, where the constant inside the big-Oh is approximately 2.
     197Furthermore, we expect no cache misses due to the regular stride of our memory
     199For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles
     200per character.
     201This is faster than the small-$k$ NFA methods by a factor of $w/14$.
     202For processors with a 128-bit word, this is about one order of magnitude.
Note: See TracChangeset for help on using the changeset viewer.