1 | %----------------------------------------------------------------------------- |
---|
2 | % |
---|
3 | % Template for sigplanconf LaTeX Class |
---|
4 | % |
---|
5 | % Name: sigplanconf-template.tex |
---|
6 | % |
---|
7 | % Purpose: A template for sigplanconf.cls, which is a LaTeX 2e class |
---|
8 | % file for SIGPLAN conference proceedings. |
---|
9 | % |
---|
10 | % Guide: Refer to "Author's Guide to the ACM SIGPLAN Class," |
---|
11 | % sigplanconf-guide.pdf |
---|
12 | % |
---|
13 | % Author: Paul C. Anagnostopoulos |
---|
14 | % Windfall Software |
---|
15 | % 978 371-2316 |
---|
16 | % paul@windfall.com |
---|
17 | % |
---|
18 | % Created: 15 February 2005 |
---|
19 | % |
---|
20 | %----------------------------------------------------------------------------- |
---|
21 | |
---|
22 | |
---|
23 | \documentclass{sigplanconf} |
---|
24 | |
---|
25 | % The following \documentclass options may be useful: |
---|
26 | |
---|
27 | % preprint Remove this option only once the paper is in final form. |
---|
28 | % 10pt To set in 10-point type instead of 9-point. |
---|
29 | % 11pt To set in 11-point type instead of 9-point. |
---|
30 | % authoryear To obtain author/year citation style instead of numeric. |
---|
31 | |
---|
32 | \usepackage{amsmath} |
---|
33 | \usepackage{pgfplots} |
---|
34 | |
---|
35 | \begin{document} |
---|
36 | |
---|
37 | \special{papersize=8.5in,11in} |
---|
38 | \setlength{\pdfpageheight}{\paperheight} |
---|
39 | \setlength{\pdfpagewidth}{\paperwidth} |
---|
40 | |
---|
41 | \conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} |
---|
42 | \copyrightyear{2013} |
---|
43 | \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} |
---|
44 | \doi{nnnnnnn.nnnnnnn} |
---|
45 | |
---|
46 | % Uncomment one of the following two, if you are not going for the |
---|
47 | % traditional copyright transfer agreement. |
---|
48 | |
---|
49 | %\exclusivelicense % ACM gets exclusive license to publish, |
---|
50 | % you retain copyright |
---|
51 | |
---|
52 | %\permissiontopublish % ACM gets nonexclusive license to publish |
---|
53 | % (paid open-access papers, |
---|
54 | % short abstracts) |
---|
55 | |
---|
56 | \titlebanner{banner above paper title} % These are ignored unless |
---|
57 | \preprintfooter{short description of paper} % 'preprint' option specified. |
---|
58 | |
---|
59 | \title{Bitwise Data Parallelism in Regular Expression Matching} |
---|
60 | \subtitle{Subtitle Text, if any} |
---|
61 | |
---|
62 | \authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman} |
---|
63 | {Simon Fraser University} |
---|
64 | {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca} |
---|
65 | |
---|
66 | \maketitle |
---|
67 | |
---|
68 | \begin{abstract} |
---|
69 | \input{abstract} |
---|
70 | \end{abstract} |
---|
71 | \category{Theory of computation}{Formal languages and automata theory}{Regular languages} |
---|
72 | \category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data} |
---|
73 | |
---|
74 | % general terms are not compulsory anymore, |
---|
75 | % you may leave them out |
---|
76 | %\terms |
---|
77 | %term1, term2 |
---|
78 | |
---|
79 | \keywords |
---|
80 | regular expression matching, grep, parallel bit stream technology |
---|
81 | |
---|
82 | \section{Introduction} |
---|
83 | |
---|
84 | The use of regular expressions to search texts for occurrences |
---|
85 | of string patterns has a long history and |
---|
86 | remains a pervasive technique throughout computing applications today. |
---|
87 | % {\em a brief history} |
---|
88 | The origins of regular expression matching date back to automata theory |
---|
89 | developed by Kleene in the 1950s \cite{kleene1951}. |
---|
90 | Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions |
---|
91 | to nondeterministic finite automata (NFA). |
---|
92 | Following Thompson's approach, a regular expression of length m is first converted |
---|
93 | to an NFA with O(m) nodes. It is then possible to search a text of length n using the |
---|
94 | NFA in worst case O(mn) time. Often, a more efficient choice |
---|
95 | is to convert the NFA into a DFA. A DFA has only a single active state at any time |
---|
96 | in the matching process and |
---|
97 | hence it is possible to search a text at of length n in worst-case O(n) optimal. |
---|
98 | However, it is well known that the conversion of an NFA to an equivalent DFA may result |
---|
99 | in state explosion. That is, the number of resultant DFA states may increase exponentially. |
---|
100 | In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}. |
---|
101 | This technique takes advantage of the intrinsic parallelism of bitwise operations |
---|
102 | within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the |
---|
103 | bit-parallel approach to |
---|
104 | simulate an NFA in O($nm/w$) worst-case time. |
---|
105 | |
---|
106 | A disadvantage of the bit-parallel Shift-Or pattern matching approach |
---|
107 | in comparison to simple string matching algorithms is an inability to skip input characters. |
---|
108 | For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters |
---|
109 | to achieve sublinear times in the average case. Backward Dawg Matching |
---|
110 | (BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters. |
---|
111 | The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} |
---|
112 | combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm. |
---|
113 | The nrgrep pattern matching tool is built over the BNDM algorithm, |
---|
114 | and hence the name nrgrep \cite{navarro2000}. |
---|
115 | |
---|
116 | {\em a brief review} |
---|
117 | There has been considerable interest in using parallelization techniques |
---|
118 | to improve the performance of regular expression matching on parallel hardware |
---|
119 | such as multi-core processors (CPUs), graphics processing units (GPUs), |
---|
120 | field-programmable gate arrays (FPGAs), and even more exotic architectures such as |
---|
121 | the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) |
---|
122 | %CPU |
---|
123 | Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that |
---|
124 | text processing algorithms that exhibit irregular memory access patterns |
---|
125 | can be efficiently executed on multicore hardware. |
---|
126 | In related work, Pasetto et al. presented a flexible tool that |
---|
127 | performs small-ruleset regular expression matching at a rate of |
---|
128 | 2.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}. |
---|
129 | Naghmouchi et al. demonstrated that the Aho-Corasick (AC) |
---|
130 | string matching algorithm \cite{aho1975} is well suited for parallel |
---|
131 | implementation on multi-core CPUs, GPUs and the Cell BE \cite{scarpazza2011top, naghmouchi2010}. |
---|
132 | On each hardware, both thread-level parallelism (additional cores) and data-level parallelism |
---|
133 | (wide SIMD units) are leveraged for performance. |
---|
134 | Salapura et. al., advocated the use of vector-style processing for regular expressions |
---|
135 | in business analytics applications and leveraged the SIMD hardware available |
---|
136 | on multi-core processors to acheive a speedup of better than 1.8 over a |
---|
137 | range of data sizes of interest \cite{salapura2012accelerating}. |
---|
138 | %Cell |
---|
139 | In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer |
---|
140 | that delivered 1.00â1.78 Gbps on a single |
---|
141 | Cell BE chip and extended this approach for emulation on the Intel Larrabee |
---|
142 | instruction set \cite{scarpazza2009larrabee}. |
---|
143 | On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching |
---|
144 | implementation that delivered a throughput of 40 |
---|
145 | Gbps for a small dictionary of approximately 100 patterns, and a throughput of 3.3-3.4 |
---|
146 | Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} |
---|
147 | presented a string matching implementation for automata that achieves |
---|
148 | 4 Gbps on the Cell BE. |
---|
149 | % GPU |
---|
150 | In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based |
---|
151 | implementation of the AC algorithm for |
---|
152 | accelerating string matching on GPUs. Lin et al., proposed |
---|
153 | the Parallel Failureless Aho-Corasick (PFAC) |
---|
154 | algorithm to accelerate pattern matching on GPU hardware and |
---|
155 | achieved 143 Gbps throughput, 14.74 times faster |
---|
156 | than the AC algorithm performed on a four core |
---|
157 | multi-core processor using OpenMP \cite{lin2013accelerating}. |
---|
158 | |
---|
159 | Whereas the existing approaches to parallelization have been |
---|
160 | based on adapting traditional sequential algorithms to emergent |
---|
161 | parallel architectures, we introduce both a new algorithmic |
---|
162 | approach and its implementation on SIMD and GPU architectures. |
---|
163 | This approach relies on a bitwise data parallel view of text |
---|
164 | streams as well as a surprising use of addition to match |
---|
165 | runs of characters in a sin`gle step. The closest previous |
---|
166 | work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD |
---|
167 | technology together with a parallel scanning primitive also |
---|
168 | based on addition \cite{cameron2011parallel}. |
---|
169 | However, in contrast to the deterministic, longest-match |
---|
170 | scanning associated with the ScanThru primitive of that |
---|
171 | work, we introduce here a new primitive MatchStar |
---|
172 | that can be used in full generality for nondeterministic |
---|
173 | regular expression matching. We also introduce a long-stream |
---|
174 | addition technique involving a further application of MatchStar |
---|
175 | that enables us to scale the technique to $n$-bit addition |
---|
176 | in $\lceil\lg_{64}{n}\rceil)$ steps. We ultimately apply this technique, |
---|
177 | for example, to perform |
---|
178 | synchronized 4096-bit addition on GPU warps of 64 threads. |
---|
179 | |
---|
180 | There is also a strong keyword match between the bit-parallel |
---|
181 | data streams used in our approach and the bit-parallelism |
---|
182 | used for NFA state transitions in the classical algorithms of |
---|
183 | Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new} |
---|
184 | and Navarro and Raffinot \cite{navarro1998bit}. |
---|
185 | However those algorithms use bit-parallelism in a fundamentally |
---|
186 | different way: representing all possible current NFA states |
---|
187 | as a bit vector and performing parallel transitions to a new |
---|
188 | set of states using table lookups and bitwise logic. Whereas |
---|
189 | our approach can match multiple characters per step, bit-parallel |
---|
190 | NFA algorithms proceed through the input one byte at a time. |
---|
191 | Nevertheless, the agrep \cite{wu1992agrep} and |
---|
192 | nrgrep \cite{navarro2000} programs implemented using these techniques remain |
---|
193 | among the strongest competitors in regular expression matching |
---|
194 | performance, so we include them in our comparative evaluation. |
---|
195 | |
---|
196 | |
---|
197 | The remainder of this paper is organized as follows. |
---|
198 | Section \ref{sec:bitwise} presents our basic algorithm and MatchStar |
---|
199 | using a model of arbitrary-length bit-parallel data streams. |
---|
200 | Section \ref{sec:blockwise} discusses the block-by-block |
---|
201 | implementation of our techniques including the long stream |
---|
202 | addition techniques for 256-bit addition with AVX2 and |
---|
203 | 4096-bit additions with GPGPU SIMT. |
---|
204 | Section \ref{sec:analysis} |
---|
205 | Section \ref{sec:SSE2} |
---|
206 | Section \ref{sec:AVX2} |
---|
207 | Section \ref{sec:GPU} |
---|
208 | Section \ref{sec:Concl} concludes the paper with a discussion of areas for future work. |
---|
209 | |
---|
210 | |
---|
211 | \section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise} |
---|
212 | |
---|
213 | Whereas the traditional approaches to regular expression matching |
---|
214 | using NFAs, DFAs or backtracking all rely on a byte-at-a-time |
---|
215 | processing model, the approach we introduce in this paper is based |
---|
216 | on quite a different concept: a data-parallel approach to simultaneous |
---|
217 | processing of data stream elements. Indeed, our most abstract model |
---|
218 | is that of unbounded data parallelism: processing all elements of |
---|
219 | the input data stream simultaneously. In essence, we view |
---|
220 | data streams as (very large) integers. The fundamental operations |
---|
221 | we apply are based on bitwise logic and long-stream addition. |
---|
222 | |
---|
223 | Depending on the available parallel processing resources, an actual |
---|
224 | implementation may divide an input stream into blocks and process |
---|
225 | the blocks sequentially. Within each block all elements of the |
---|
226 | input stream are processed together, relying the availability of |
---|
227 | bitwise logic and addition scaled to the block size. On commodity |
---|
228 | Intel and AMD processors with 128-bit SIMD capabilities (SSE2), |
---|
229 | we typically process input streams 128 bytes at a time. In this |
---|
230 | case, we rely on the Parabix tool chain \cite{lin2012parabix} |
---|
231 | to handle the details of compilation to block-by-block processing. |
---|
232 | As we show later, however, we have also adapted Parabix technology to processing |
---|
233 | blocks of 4K bytes at time in our GPGPU implementation, |
---|
234 | relying on the application of our long-stream addition technique |
---|
235 | to perform 4096-bit additions using 64 threads working in lock-step |
---|
236 | SIMT fashion each on 64-bit processors. |
---|
237 | |
---|
238 | A key concept in this streaming approach is the derivation of bit streams |
---|
239 | that are parallel to the input data stream, i.e., in one-to-one |
---|
240 | correspondence with the data element positions of the input |
---|
241 | streams. Typically, the input stream is a byte stream comprising |
---|
242 | the 8-bit character code units of a particular encoding such |
---|
243 | as extended ASCII, ISO-8859-1 or UTF-8. However, the method may also |
---|
244 | easily be used with wider code units such as the 16-bit code units of |
---|
245 | UTF-16. In the case of a byte stream, the first step is to transpose |
---|
246 | the byte stream into eight parallel bit streams, such that bit stream |
---|
247 | $i$ comprises the $i^\text{th}$ bit of each byte. These streams form |
---|
248 | a set of basis bit streams from which many other parallel bit |
---|
249 | streams can be calculated, such as character class bit |
---|
250 | streams such that each bit $j$ of the stream specifies |
---|
251 | whether character $j$ of the input stream is in the class |
---|
252 | or not. Figure \ref{fig:streams} shows an example of an |
---|
253 | input byte stream in ASCII, the eight basis bit streams of the |
---|
254 | transposed representation, and several character class bit streams |
---|
255 | that may be computed from the basis bit streams using bitwise logic. |
---|
256 | Transposition and character class construction are straightforward |
---|
257 | using the Parabix tool chain \cite{lin2012parabix}. |
---|
258 | |
---|
259 | \paragraph*{Marker Streams.} Now consider how bit-parallel data |
---|
260 | streams can be used in regular expression matching. Consider |
---|
261 | the problem of searching the input stream of Figure \ref{fig:streams} |
---|
262 | to finding occurrence of strings matching |
---|
263 | the regular expression \verb:a[0-9]*z:. |
---|
264 | The matching process involves the concept of {\em marker streams}, that |
---|
265 | is streams that mark the positions of current matches during the |
---|
266 | overall process. In this case there are three marker streams computed |
---|
267 | during the match process, namely, |
---|
268 | $M_1$ representing match positions after an initial \verb:a: |
---|
269 | character has been found, $M_2$ representing positions |
---|
270 | reachable from positions marked by $M_1$ by further matching zero or |
---|
271 | more digits (\verb:[0-9]*:) and finally $M_3$ the stream |
---|
272 | marking positions after a final \verb:z: has been found. |
---|
273 | Without describing the details of how these streams are computed |
---|
274 | for the time being, Figure \ref{fig:streams} shows what each |
---|
275 | of these streams should be for our example matching problem. |
---|
276 | Note our convention that a marker stream contains a 1 bit |
---|
277 | at the next character position to be matched, that is, |
---|
278 | immediately past the last position that was matched. |
---|
279 | |
---|
280 | |
---|
281 | \paragraph*{MatchStar.} |
---|
282 | MatchStar takes a marker bitstream and a character class bitstream as input. It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream. |
---|
283 | |
---|
284 | Figure \ref{fig:matchstar} illustrates the MatchStar method. The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data. |
---|
285 | |
---|
286 | In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic. Next, the temporary marker bitstream is added to the character class bitstream. $T_1$ has 1s in three types of positions. There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions. Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output. These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$. The |
---|
287 | output marker stream is obtained by ORing $T_2$ with the initial marker stream. |
---|
288 | |
---|
289 | \begin{figure*}[tbh] |
---|
290 | \begin{center} |
---|
291 | \begin{tabular}{cr}\\ |
---|
292 | source data & \verb`--142578---125-231-----127--5394---94761205-`\\ |
---|
293 | $M_0$ & \verb`.......1......1..1..1...1.............1..1..`\\ |
---|
294 | $D = $\verb:[0-9]: & \verb`..111111...111.111.....111..1111...11111111.`\\ |
---|
295 | $T_0 = M_0 \wedge D$ & \verb`.......1.........1......1.............1..1..`\\ |
---|
296 | $T_1 = T_0 + D$ & \verb`.1.........1111.......1..1..1111..1...1...1.`\\ |
---|
297 | $T_2 = T_1 \oplus D$ & \verb`.1111111......1111....111.........1111.111..`\\ |
---|
298 | $M_1 = T_2 \, | \, M_0$ & \verb`.1111111......1111..1.111.........11111111..` |
---|
299 | \end{tabular} |
---|
300 | |
---|
301 | \end{center} |
---|
302 | \caption{Match Star} |
---|
303 | \label{fig:matchstar1} |
---|
304 | \end{figure*} |
---|
305 | |
---|
306 | |
---|
307 | In general, given a marker stream $M$ and a character class stream $C$, |
---|
308 | the operation of MatchStar is defined by the following equation. |
---|
309 | \[\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) | M\] |
---|
310 | Given a set of initial marker positions, the result stream marks |
---|
311 | all possible positions that can be reached by 0 or more occurrences |
---|
312 | of characters in class $C$ from each position in $M$. |
---|
313 | |
---|
314 | |
---|
315 | |
---|
316 | |
---|
317 | \section{Block-at-a-Time Processing}\label{sec:blockwise} |
---|
318 | |
---|
319 | The unbounded stream model of the previous section must of course |
---|
320 | be translated an implementation that proceeds block-at-a-time for |
---|
321 | realistic application. In this, we primarily rely on the Pablo |
---|
322 | compiler of the Parabix toolchain \cite{lin2012parabix}. Given input |
---|
323 | statements expressed as arbitrary-length bitstream equations, Pablo |
---|
324 | produces block-at-a-time C++ code that initializes and maintains all the necessary |
---|
325 | carry bits for each of the additions and shifts involved in the |
---|
326 | bitstream calculations. |
---|
327 | |
---|
328 | In the present work, our principal contribution to the block-at-a-time |
---|
329 | model is the technique of long-stream addition described below. |
---|
330 | Otherwise, we were able to use Pablo directly in compiling our |
---|
331 | SSE2 and AVX2 implementations. Our GPU implementation required |
---|
332 | some scripting to modify the output of the Pablo compiler for our |
---|
333 | purpose. |
---|
334 | |
---|
335 | \paragraph*{Long-Stream Addition.} The maximum word size for |
---|
336 | addition on commodity processors is typically 64 bits. In order |
---|
337 | to implement long-stream addition for block sizes of 256 or larger, |
---|
338 | a method for propagating carries through the individual stages of |
---|
339 | 64-bit addition is required. However, the normal technique of |
---|
340 | sequential addition using add-with-carry instructions, for example, |
---|
341 | is far from ideal. |
---|
342 | |
---|
343 | We have developed a technique using SIMD or SIMT methods for constant-time |
---|
344 | long-stream addition up to 4096 bits. |
---|
345 | We assume the availability of the following SIMD/SIMT operations |
---|
346 | operating on vectors of $f$ 64-bit fields. |
---|
347 | \begin{itemize} |
---|
348 | \item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields |
---|
349 | in two vectors to produce a result vector of $f$ 64-bit fields. |
---|
350 | \item \verb#simd<64>::eq(X, -1)# : comparison of the 64-bit fields |
---|
351 | of \verb:x: each with the constant value -1 (all bits 1), producing |
---|
352 | an $f$-bit mask value, |
---|
353 | \item \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit |
---|
354 | field into a single compressed $f$-bit mask value, and |
---|
355 | \item normal bitwise logic operations on $f$-bit masks. |
---|
356 | \item \verb#simd<64>::spread(x)# : distributing the bits of |
---|
357 | an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector, and |
---|
358 | \end{itemize} |
---|
359 | |
---|
360 | Given these operations, our method for long stream addition of |
---|
361 | two $f \times 64$ bit values \verb:X: and \verb:Y: is the following. |
---|
362 | \begin{enumerate} |
---|
363 | \item Form the vector of 64-bit sums of \verb:x: and \verb:y:. |
---|
364 | \[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \] |
---|
365 | |
---|
366 | \item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:. |
---|
367 | \[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \] |
---|
368 | \[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \] |
---|
369 | \[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \] |
---|
370 | |
---|
371 | \item Compute an $f$-bit mask of carries generated for each of the |
---|
372 | 64-bit additions of \verb:X: and \verb:Y:. |
---|
373 | \[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\] |
---|
374 | |
---|
375 | \item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with |
---|
376 | an incoming carry bit. This is the {\em bubble mask}. |
---|
377 | \[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\] |
---|
378 | |
---|
379 | \item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be |
---|
380 | incremented to produce the final sum. Here we find a new application of |
---|
381 | MatchStar! |
---|
382 | \[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\] |
---|
383 | |
---|
384 | This is the key step. The mask {\tt c} of outgoing carries must be |
---|
385 | shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated |
---|
386 | with the next digit. At the incoming position, the carry will |
---|
387 | increment the 64-bit digit. However, if this digit is all ones (as |
---|
388 | signalled by the corresponding bit of bubble mask {\tt b}, then the addition |
---|
389 | will generate another carry. In fact, if there is a sequence of |
---|
390 | digits that are all ones, then the carry must bubble through |
---|
391 | each of them. This is just MatchStar! |
---|
392 | |
---|
393 | \item Compute the final result {\tt Z}. |
---|
394 | \[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\] |
---|
395 | |
---|
396 | \end{enumerate} |
---|
397 | \begin{figure} |
---|
398 | \begin{center} |
---|
399 | \begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9} |
---|
400 | {\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9} |
---|
401 | {\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9} |
---|
402 | {\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9} |
---|
403 | {\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} |
---|
404 | {\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} |
---|
405 | {\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} |
---|
406 | {\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} |
---|
407 | {\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} |
---|
408 | {\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} |
---|
409 | {\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} |
---|
410 | {\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9} |
---|
411 | \end{tabular} |
---|
412 | \end{center} |
---|
413 | \caption{Long Stream Addition}\label{fig:longadd} |
---|
414 | \end{figure} |
---|
415 | |
---|
416 | Figure \ref{fig:longadd} illustrates the process. In the figure, |
---|
417 | we illustrate the process with 8-bit fields rather than 64-bit fields |
---|
418 | and show all field values in hexadecimal notation. Note that |
---|
419 | two of the individual 8-bit additions produce carries, while two |
---|
420 | others produce {\tt FF} values that generate bubble bits. The |
---|
421 | net result is that four of the original 8-bit sums must be |
---|
422 | incremented to produce the long stream result. |
---|
423 | |
---|
424 | A slight extension to the process produces a long-stream full adder |
---|
425 | that can be used in chained addition. In this case, the |
---|
426 | adder must take an additional carry-in bit |
---|
427 | {\tt p} and produce a carry-out bit {\tt q}. |
---|
428 | This may be accomplished by incorporating {\tt p} |
---|
429 | in calculating the increment mask in the low bit position, |
---|
430 | and then extracting the carry-out {\tt q} from the high bit position. |
---|
431 | \[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\] |
---|
432 | \[\text{\tt q} = \text{\tt i >> f}\] |
---|
433 | |
---|
434 | As described subsequently, we use a two-level long-stream addition technique |
---|
435 | in both our AVX2 and GPU implementations. In principle, one can extend |
---|
436 | the technique to additional levels. Using 64-bit adders throughout, |
---|
437 | $\lceil\lg_{64}{n}\rceil)$ steps are needed for $n$-bit addition. |
---|
438 | A three-level scheme could coordinate |
---|
439 | 64 groups each performing 4096-bit long additions in a two-level structure. |
---|
440 | However, whether there are reasonable architectures that can support fine-grained |
---|
441 | SIMT style coordinate at this level is an open question. |
---|
442 | |
---|
443 | Using the methods outlined, it is quite conceivable that instruction |
---|
444 | set extensions to support long-stream addition could be added for |
---|
445 | future SIMD and GPU processors. Given the fundamental nature |
---|
446 | of addition as a primitive and its novel application to regular |
---|
447 | expression matching as shown herein, it seems reasonable to expect |
---|
448 | such instructions to become available. |
---|
449 | \raggedbottom |
---|
450 | \section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis} |
---|
451 | |
---|
452 | \begin{enumerate} |
---|
453 | \item Operations |
---|
454 | \item Memory behaviour per input byte: note tables of DFA/NFA. |
---|
455 | |
---|
456 | Bille and Throup \em{Faster regular expression matching}\cite{bille2009faster} |
---|
457 | |
---|
458 | \end{enumerate} |
---|
459 | |
---|
460 | |
---|
461 | |
---|
462 | \section{Commodity SIMD Implementation and Experimental Evaluation}\label{sec:SSE2} |
---|
463 | \subsection{Implementation Notes} |
---|
464 | \subsection{Evaluation Methodology} |
---|
465 | \subsection{Comparison} |
---|
466 | \begin{figure} |
---|
467 | \begin{center} |
---|
468 | \begin{tikzpicture} |
---|
469 | \begin{axis}[ |
---|
470 | xtick=data, |
---|
471 | ylabel=Cycles per Byte, |
---|
472 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
473 | tick label style={font=\tiny}, |
---|
474 | enlargelimits=0.15, |
---|
475 | legend style={at={(0.5,-0.15)}, |
---|
476 | anchor=north,legend columns=-1}, |
---|
477 | ymax=8, |
---|
478 | ybar, |
---|
479 | bar width=7pt, |
---|
480 | ] |
---|
481 | \addplot |
---|
482 | file {data/cycles1.dat}; |
---|
483 | \addplot |
---|
484 | file {data/cycles2.dat}; |
---|
485 | \addplot |
---|
486 | file {data/cycles3.dat}; |
---|
487 | |
---|
488 | \legend{Bitstreams,NRGrep,Grep,Annot} |
---|
489 | \end{axis} |
---|
490 | \end{tikzpicture} |
---|
491 | \end{center} |
---|
492 | \caption{Cycles per Byte} |
---|
493 | \end{figure} |
---|
494 | |
---|
495 | \begin{figure} |
---|
496 | \begin{center} |
---|
497 | \begin{tikzpicture} |
---|
498 | \begin{axis}[ |
---|
499 | xtick=data, |
---|
500 | ylabel=Instructions per Byte, |
---|
501 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
502 | tick label style={font=\tiny}, |
---|
503 | enlargelimits=0.15, |
---|
504 | legend style={at={(0.5,-0.15)}, |
---|
505 | anchor=north,legend columns=-1}, |
---|
506 | ymax=16, |
---|
507 | ybar, |
---|
508 | bar width=7pt, |
---|
509 | ] |
---|
510 | \addplot |
---|
511 | file {data/instructions1.dat}; |
---|
512 | \addplot |
---|
513 | file {data/instructions2.dat}; |
---|
514 | \addplot |
---|
515 | file {data/instructions3.dat}; |
---|
516 | |
---|
517 | \legend{Bitstreams,NRGrep,Grep,Annot} |
---|
518 | \end{axis} |
---|
519 | \end{tikzpicture} |
---|
520 | \end{center} |
---|
521 | \caption{Instructions per Byte} |
---|
522 | \end{figure} |
---|
523 | |
---|
524 | \begin{figure} |
---|
525 | \begin{center} |
---|
526 | \begin{tikzpicture} |
---|
527 | \begin{axis}[ |
---|
528 | xtick=data, |
---|
529 | ylabel=Instructions per Cycle, |
---|
530 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
531 | tick label style={font=\tiny}, |
---|
532 | enlargelimits=0.15, |
---|
533 | legend style={at={(0.5,-0.15)}, |
---|
534 | anchor=north,legend columns=-1}, |
---|
535 | ybar, |
---|
536 | bar width=7pt, |
---|
537 | ] |
---|
538 | \addplot |
---|
539 | file {data/ipc1.dat}; |
---|
540 | \addplot |
---|
541 | file {data/ipc2.dat}; |
---|
542 | \addplot |
---|
543 | file {data/ipc3.dat}; |
---|
544 | |
---|
545 | \legend{Bitstreams,NRGrep,Grep,Annot} |
---|
546 | \end{axis} |
---|
547 | \end{tikzpicture} |
---|
548 | \end{center} |
---|
549 | \caption{Instructions per Cycle} |
---|
550 | \end{figure} |
---|
551 | |
---|
552 | \begin{figure} |
---|
553 | \begin{center} |
---|
554 | \begin{tikzpicture} |
---|
555 | \begin{axis}[ |
---|
556 | xtick=data, |
---|
557 | ylabel=Branch Misses per Byte, |
---|
558 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
559 | tick label style={font=\tiny}, |
---|
560 | enlargelimits=0.15, |
---|
561 | legend style={at={(0.5,-0.15)}, |
---|
562 | anchor=north,legend columns=-1}, |
---|
563 | ymax=0.03, |
---|
564 | ybar, |
---|
565 | bar width=7pt, |
---|
566 | ] |
---|
567 | \addplot |
---|
568 | file {data/branch-misses1.dat}; |
---|
569 | \addplot |
---|
570 | file {data/branch-misses2.dat}; |
---|
571 | \addplot |
---|
572 | file {data/branch-misses3.dat}; |
---|
573 | |
---|
574 | \legend{Bitstreams,NRGrep,Grep,Annot} |
---|
575 | \end{axis} |
---|
576 | \end{tikzpicture} |
---|
577 | \end{center} |
---|
578 | \caption{Branch Misses per Byte} |
---|
579 | \end{figure} |
---|
580 | |
---|
581 | \section{SIMD Scalability}\label{sec:AVX2} |
---|
582 | |
---|
583 | |
---|
584 | |
---|
585 | |
---|
586 | \begin{figure} |
---|
587 | \begin{center} |
---|
588 | \begin{tikzpicture} |
---|
589 | \begin{axis}[ |
---|
590 | xtick=data, |
---|
591 | ylabel=Cycles per Byte, |
---|
592 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
593 | tick label style={font=\tiny}, |
---|
594 | enlargelimits=0.15, |
---|
595 | legend style={at={(0.5,-0.15)}, |
---|
596 | anchor=north,legend columns=-1}, |
---|
597 | ybar, |
---|
598 | bar width=7pt, |
---|
599 | ] |
---|
600 | \addplot |
---|
601 | file {data/ssecycles.dat}; |
---|
602 | \addplot |
---|
603 | file {data/avxcycles.dat}; |
---|
604 | |
---|
605 | \legend{SSE2,AVX2,Annot} |
---|
606 | \end{axis} |
---|
607 | \end{tikzpicture} |
---|
608 | \end{center} |
---|
609 | \caption{Cycles per Byte} |
---|
610 | \end{figure} |
---|
611 | |
---|
612 | \begin{figure} |
---|
613 | \begin{center} |
---|
614 | \begin{tikzpicture} |
---|
615 | \begin{axis}[ |
---|
616 | xtick=data, |
---|
617 | ylabel=Instructions per Byte, |
---|
618 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
619 | tick label style={font=\tiny}, |
---|
620 | enlargelimits=0.15, |
---|
621 | legend style={at={(0.5,-0.15)}, |
---|
622 | anchor=north,legend columns=-1}, |
---|
623 | ybar, |
---|
624 | bar width=7pt, |
---|
625 | ] |
---|
626 | \addplot |
---|
627 | file {data/sseinstructions.dat}; |
---|
628 | \addplot |
---|
629 | file {data/avxinstructions.dat}; |
---|
630 | |
---|
631 | \legend{SSE2,AVX2,Annot} |
---|
632 | \end{axis} |
---|
633 | \end{tikzpicture} |
---|
634 | \end{center} |
---|
635 | \caption{Instructions per Byte} |
---|
636 | \end{figure} |
---|
637 | |
---|
638 | \begin{figure} |
---|
639 | \begin{center} |
---|
640 | \begin{tikzpicture} |
---|
641 | \begin{axis}[ |
---|
642 | xtick=data, |
---|
643 | ylabel=Instructions per Cycle, |
---|
644 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
645 | tick label style={font=\tiny}, |
---|
646 | enlargelimits=0.15, |
---|
647 | legend style={at={(0.5,-0.15)}, |
---|
648 | anchor=north,legend columns=-1}, |
---|
649 | ybar, |
---|
650 | bar width=7pt, |
---|
651 | ] |
---|
652 | \addplot |
---|
653 | file {data/sseipc.dat}; |
---|
654 | \addplot |
---|
655 | file {data/avxipc.dat}; |
---|
656 | |
---|
657 | |
---|
658 | \legend{SSE2,AVX2,Annot} |
---|
659 | \end{axis} |
---|
660 | \end{tikzpicture} |
---|
661 | \end{center} |
---|
662 | \caption{Instructions per Cycle} |
---|
663 | \end{figure} |
---|
664 | |
---|
665 | \begin{figure} |
---|
666 | \begin{center} |
---|
667 | \begin{tikzpicture} |
---|
668 | \begin{axis}[ |
---|
669 | xtick=data, |
---|
670 | ylabel=Branch Misses per Byte, |
---|
671 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
672 | tick label style={font=\tiny}, |
---|
673 | enlargelimits=0.15, |
---|
674 | legend style={at={(0.5,-0.15)}, |
---|
675 | anchor=north,legend columns=-1}, |
---|
676 | ybar, |
---|
677 | bar width=7pt, |
---|
678 | ] |
---|
679 | \addplot |
---|
680 | file {data/ssebranch-misses.dat}; |
---|
681 | \addplot |
---|
682 | file {data/avxbranch-misses.dat}; |
---|
683 | |
---|
684 | \legend{SSE2,AVX2,Annot} |
---|
685 | \end{axis} |
---|
686 | \end{tikzpicture} |
---|
687 | \end{center} |
---|
688 | \caption{Branch Misses per Byte} |
---|
689 | \end{figure} |
---|
690 | |
---|
691 | |
---|
692 | |
---|
693 | |
---|
694 | \subsection{AVX Stream Addition} |
---|
695 | \begin{figure*}[tbh] |
---|
696 | \begin{center} |
---|
697 | \begin{verbatim} |
---|
698 | void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) { |
---|
699 | bitblock256_t all_ones = simd256<1>::constant<1>(); |
---|
700 | bitblock256_t gen = simd_and(x, y); |
---|
701 | bitblock256_t prop = simd_xor(x, y); |
---|
702 | bitblock256_t partial_sum = simd256<64>::add(x, y); |
---|
703 | bitblock256_t carry = simd_or(gen, simd_andc(prop, partial_sum)); |
---|
704 | bitblock256_t bubble = simd256<64>::eq(partial_sum, all_ones); |
---|
705 | uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in); |
---|
706 | uint64_t bubble_mask = hsimd256<64>::signmask(bubble); |
---|
707 | uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask; |
---|
708 | uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask); |
---|
709 | carry_out = convert(increments >> 4); |
---|
710 | uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001; |
---|
711 | sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread)))); |
---|
712 | } |
---|
713 | |
---|
714 | \end{verbatim} |
---|
715 | |
---|
716 | \end{center} |
---|
717 | \caption{Match Star} |
---|
718 | \label{fig:matchstar1} |
---|
719 | \end{figure*} |
---|
720 | |
---|
721 | \section{GPU Implementation}\label{sec:GPU} |
---|
722 | \begin{figure} |
---|
723 | \begin{center} |
---|
724 | \begin{tikzpicture} |
---|
725 | \begin{axis}[ |
---|
726 | xtick=data, |
---|
727 | ylabel=Running Time (ms per byte), |
---|
728 | xticklabels={@,Date,Email,URIorEmail,xquote}, |
---|
729 | tick label style={font=\tiny}, |
---|
730 | enlargelimits=0.15, |
---|
731 | legend style={at={(0.5,-0.15)}, |
---|
732 | anchor=north,legend columns=-1}, |
---|
733 | ybar, |
---|
734 | bar width=7pt, |
---|
735 | ] |
---|
736 | \addplot |
---|
737 | file {data/ssetime.dat}; |
---|
738 | \addplot |
---|
739 | file {data/avxtime.dat}; |
---|
740 | \addplot |
---|
741 | file {data/gputime.dat}; |
---|
742 | |
---|
743 | \legend{SSE2,AVX2,GPU,Annot} |
---|
744 | \end{axis} |
---|
745 | \end{tikzpicture} |
---|
746 | \end{center} |
---|
747 | \caption{Running Time} |
---|
748 | \end{figure} |
---|
749 | |
---|
750 | |
---|
751 | |
---|
752 | |
---|
753 | \section{Miscellaneous} |
---|
754 | \subsection{Skipping} |
---|
755 | \subsection{Unicode} |
---|
756 | |
---|
757 | \section{Conclusion}\label{sec:Concl} |
---|
758 | \subsection{Contributions} |
---|
759 | \begin{enumerate} |
---|
760 | \item New Algorithm Class for Regular Expression Matching |
---|
761 | \item MatchStar for Character Class Repetition |
---|
762 | \item Long Stream Addition |
---|
763 | \item Implementations showing performance and scalability |
---|
764 | \end{enumerate} |
---|
765 | \subsection{Future Work} |
---|
766 | \begin{enumerate} |
---|
767 | \item Substring capture |
---|
768 | \item Unicode character classes |
---|
769 | \item Nonregular regexp features: zero-width assertions, backreferences |
---|
770 | \item Multicore for ruleset parallelism |
---|
771 | \end{enumerate} |
---|
772 | |
---|
773 | |
---|
774 | |
---|
775 | %\appendix |
---|
776 | %\section{Appendix Title} |
---|
777 | |
---|
778 | %This is the text of the appendix, if you need one. |
---|
779 | |
---|
780 | \acks |
---|
781 | |
---|
782 | This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and |
---|
783 | MITACS, Inc. |
---|
784 | |
---|
785 | % We recommend abbrvnat bibliography style. |
---|
786 | |
---|
787 | \bibliographystyle{abbrvnat} |
---|
788 | |
---|
789 | % The bibliography should be embedded for final submission. |
---|
790 | |
---|
791 | \bibliography{reference} |
---|
792 | |
---|
793 | %\begin{thebibliography}{} |
---|
794 | %\softraggedright |
---|
795 | |
---|
796 | %\bibitem[Smith et~al.(2009)Smith, Jones]{smith02} |
---|
797 | %P. Q. Smith, and X. Y. Jones. ...reference text... |
---|
798 | % |
---|
799 | %\end{thebibliography} |
---|
800 | |
---|
801 | |
---|
802 | \end{document} |
---|
803 | |
---|
804 | |
---|