 Timestamp:
 Feb 24, 2014, 12:08:02 AM (5 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/re/analysis.tex
r3502 r3641 32 32 common uses of grep. 33 33 34 Let $\Sigma$ be our input alphabet and $\sigma =  \Sigma $. 35 As we are comparing techniques in practice, we assume that $\Sigma$ is a 36 standard input alphabet, such as ASCII ($\sigma = 128$), UTF8 ($\sigma = 256$), 37 UTF16 ($\sigma = 65536$ ), or UTF32 ($\sigma \approx 4.3 \times 10^9$). 38 This assumption allows us to equate the number of bits in the encoding of a 39 character (a parameter for the bitstream method) with $\log \sigma$. 40 41 34 42 The bitstream method compiles a regular expression of size $m$ into 35 bitstream code that is $O(m )$ statements long (with one operation per36 statement; it is essentially threeaddress code).43 bitstream code that is $O(m \log \sigma)$ statements long (with one operation 44 per statement; it is essentially threeaddress code). 37 45 This is translated to machine code and placed inside a 38 46 loop\footnote{Technically, it is inside two loops: an inner one that … … 42 50 executes once per $w$ characters, where $w$ is the width of the processor's 43 51 word. 44 This gives $O(\frac{nm}{w})$ work. 45 We can further break this down by noting that all of the work in the loop is 52 Also inside this loop is the transposition step that converts characterencoded 53 files into their bitstream representation; 54 this transposition takes $O(\log w)$ work per loop iteration. 55 56 In total, this is $O(m \log \sigma + \log w)$ work per iteration. 57 In current practice, we have $\log w$ around 8 (for 256bit architectures), 58 and $\log \sigma$ at least 7. 59 Thus, $m \log \sigma$ will dominate $\log w$ 60 with current and forseeable technologywe do not expect to see $\log w$ 61 skyrocket. 62 So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per 63 iteration. 64 We multiply this by $O(\frac{n}{w})$ iterations to give 65 $O(\frac{n m \log \sigma}{w})$ work. 66 67 We further note that all of the work in the loop is 46 68 done by superscalar instructions, with the exception of the additions, 47 69 which require carry propagation. 48 70 There will be at most $C$ of these additions in the loop, where $C$ is the 49 71 number of concatenation and Kleene star operations in the regular 50 expression .72 expression; $C < m$. 51 73 52 74 Almost all intermediate bitstreams in the loop body can be kept in … … 72 94 set, 73 95 is accessed. 74 Since there are $2^{\Theta(m)}$ state sets, the transition table canhave75 $\sigma 2^{\Theta(m)}$ entries , where $\sigma$ is the size of the input76 alphabet.96 Since there are $2^{\Theta(m)}$ state sets, the transition table will have 97 $\sigma 2^{\Theta(m)}$ entries. 98 %%, where $\sigma$ is the size of the input alphabet. 77 99 Each entry is a new state set, which requires $\frac{m}{8}$ bytes. 78 100 Thus, the transition table is of size $\sigma m 2^{\Theta(m)}$, which is … … 93 115 each with $\sigma 2^k$ 94 116 entries of $\frac{m}{8}$ bytes apiece. 95 Each table therefore takes up $ 2^{k3} m\sigma$ bytes,117 Each table therefore takes up $ m 2^{k3} \sigma$ bytes, 96 118 and so the collection of them takes up $\frac{m^2 2^{k3}\sigma}{k}$ bytes. 97 119 At each character, the NFA algorithm does one lookup in each table, … … 107 129 giving $2$ tables of $2^{\frac{m}{2}3} m \sigma$ bytes each, 108 130 two lookups per character, and 1 boolean OR operation per character to 109 combine 110 the lookups. 111 112 113 114 131 combine the lookups. 132 133 In Table \ref{tab:ascii}, we summarize the theoretical analysis of these NFA 134 methods, listing the number of table lookups per input character and the size of the 135 tables for various values of $m$, the number of states. 136 We assume the ASCII character set ($\sigma = 128$); any of the 137 UTF character sets would yield larger tables. 138 139 \begin{table} 140 \small{ 141 \begin{tabular}{rrrrrrr} 142 %% & & Thompson & \ & NavRaff & NFA \\ 143 $k$ & & $1$ & $4$ & $8$ & $\frac{m}{2}$ & $m$\\ 144 %%\hline 145 lookups & & $m$ & $\frac{m}{4}$ & $\frac{m}{8}$ & 2 & 1\\ 146 \hline 147 & $m$ & & & & \\ 148 \multirow{4}{1.35cm}{memory (KiB)} 149 & 5 & 0.8 & 1.6 & 12.5 & 1.3 & 2.5\\ 150 & 10 & 3.1 & 6.2 & 50.0 & 10.0 & 160.0\\ 151 & 15 & 7.0 & 14.1 & 112.5 & 120.0 & 7680.0\\ 152 & 20 & 12.5 & 25.0 & 200.0 & 640.0 & 327680.0\\ 153 & 25 & 19.5 & 39.1 & 312.5 & 6400.0 & 13107200.0\\ 154 \end{tabular} 155 } 156 \caption{lookups per character and memory consumed by tables in NFA methods 157 (in kibibytes)} 158 \label{tab:ascii} 159 \end{table} 160 161 Of particular importance to the speed of NFA methods is whether the 162 table lookups result in cache hits or not. 163 If the tables are small enough, then they will fit into cache and 164 lookups will all be cache hits, taking minimal time. 165 In this case, the time per input character will be a small constant times the 166 number of lookups. 167 168 If the tables are not small enough to fit into cache, some proportion 169 of the lookups will generate cache misses. 170 This will stall the processor and these stalls will come to dominate the 171 computation time. 172 In this case, the time per input character will be some large constant 173 (a cache miss can take about two orders of magnitude longer than a cache hit) 174 times the number of lookups. 175 176 Using 256KiB as an estimate of the size of a current standard data cache, 177 we can consider those entries of Table \ref{tab:ascii} above 256 to be 178 relatively slow. 179 We can summarize these theoretical predictions by saying that the NFA methods 180 with small $k$ scale well with an increase in NFA states, but with large $k$ the method is 181 limited to a small number of states. 182 183 We can now directly (but roughly) compare the NFA methods with bitstream 184 methods. 185 Consider small$k$ (say, $k <= 4$) NFA methods. 186 For the reasonable range of $m$, the tables fit into cache. 187 The running time is predicted to be a small constant times the 188 $\frac{m}{k} >= \frac{m}{4}$ lookups. 189 The small constant, which we will underapproximate with 4 cycles, is for the 190 table addressing computation, combining the lookups with boolean OR, and final state 191 detection and handling. 192 Thus, the running time per input character may be lowerbounded by 193 $4*\frac{m}{4}$, or simply $m$, cycles. 194 195 Our method, on the other hand, takes time $O(\frac{m \log \sigma}{w})$ per 196 input character, where the constant inside the bigOh is approximately 2. 197 Furthermore, we expect no cache misses due to the regular stride of our memory 198 accesses. 199 For ASCII, this time becomes at most $2 \frac{7m}{w} = \frac{14m}{w}$ cycles 200 per character. 201 This is faster than the small$k$ NFA methods by a factor of $w/14$. 202 For processors with a 128bit word, this is about one order of magnitude. 203 204 205 206 207
Note: See TracChangeset
for help on using the changeset viewer.