source: docs/Working/re/ppopp-re.tex @ 3485

Last change on this file since 3485 was 3485, checked in by bhull, 6 years ago

New charts.

File size: 30.8 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass{sigplanconf}
24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
33\usepackage{pgfplots}
34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
41\conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} 
42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
56\titlebanner{banner above paper title}        % These are ignored unless
57\preprintfooter{short description of paper}   % 'preprint' option specified.
58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
60\subtitle{Subtitle Text, if any}
61
62\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
63           {Simon Fraser University}
64           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
65
66\maketitle
67
68\begin{abstract}
69\input{abstract}
70\end{abstract}
71\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
72\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
73
74% general terms are not compulsory anymore,
75% you may leave them out
76%\terms
77%term1, term2
78
79\keywords
80regular expression matching, grep, parallel bit stream technology
81
82\section{Introduction}
83
84The use of regular expressions to search texts for occurrences
85of string patterns has a long history and
86remains a pervasive technique throughout computing applications today.
87% {\em a brief history}
88The origins of regular expression matching date back to automata theory
89developed by Kleene in the 1950s \cite{kleene1951}.
90Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
91to nondeterministic finite automata (NFA).
92Following Thompson's approach, a regular expression of length m is first converted
93to an NFA with O(m) nodes. It is then possible to search a text of length n using the
94NFA in worst case O(mn) time. Often, a more efficient choice
95is to convert the NFA into a DFA. A DFA has only a single active state at any time
96in the matching process and
97hence it is possible to search a text at of length n in worst-case O(n) optimal.
98However, it is well known that the conversion of an NFA to an equivalent DFA may result
99in state explosion. That is, the number of resultant DFA states may increase exponentially.
100In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}.
101This technique takes advantage of the intrinsic parallelism of bitwise operations
102within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the
103bit-parallel approach to
104simulate an NFA in O($nm/w$) worst-case time.
105
106A disadvantage of the bit-parallel Shift-Or pattern matching approach
107in comparison to simple string matching algorithms is an inability to skip input characters.
108For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters
109to achieve sublinear times in the average case. Backward Dawg Matching
110(BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters.
111The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
112combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm.
113The nrgrep pattern matching tool is built over the BNDM algorithm,
114and hence the name nrgrep \cite{navarro2000}.
115
116{\em a brief review} 
117There has been considerable interest in using parallelization techniques
118to improve the performance of regular expression matching on parallel hardware
119such as multi-core processors (CPUs), graphics processing units (GPUs),
120field-programmable gate arrays (FPGAs), and even more exotic architectures such as
121the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
122%CPU
123Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
124text processing algorithms that exhibit irregular memory access patterns
125can be efficiently executed on multicore hardware.
126In related work, Pasetto et al. presented a flexible tool that
127performs small-ruleset regular expression matching at a rate of
1282.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
129Naghmouchi et al. demonstrated that the Aho-Corasick (AC)
130string matching algorithm \cite{aho1975} is well suited for parallel
131implementation on multi-core CPUs, GPUs and the Cell BE \cite{scarpazza2011top, naghmouchi2010}.
132On each hardware, both thread-level parallelism (additional cores) and data-level parallelism
133(wide SIMD units) are leveraged for performance.
134Salapura et. al., advocated the use of vector-style processing for regular expressions
135in business analytics applications and leveraged the SIMD hardware available
136on multi-core processors to acheive a speedup of better than 1.8 over a
137range of data sizes of interest \cite{salapura2012accelerating}.
138%Cell
139In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
140that delivered 1.00–1.78 Gbps on a single
141Cell BE chip and extended this approach for emulation on the Intel Larrabee
142instruction set \cite{scarpazza2009larrabee}.
143On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
144implementation that delivered a throughput of 40
145Gbps for a small dictionary of approximately 100 patterns, and a throughput of 3.3-3.4
146Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
147presented a string matching implementation for automata that achieves
1484 Gbps on the Cell BE.
149% GPU
150In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
151implementation of the AC algorithm for
152accelerating string matching on GPUs. Lin et al., proposed
153the Parallel Failureless Aho-Corasick (PFAC)
154algorithm to accelerate pattern matching on GPU hardware and
155achieved 143 Gbps throughput, 14.74 times faster
156than the AC algorithm performed on a four core
157multi-core processor using OpenMP \cite{lin2013accelerating}.
158
159Whereas the existing approaches to parallelization have been
160based on adapting traditional sequential algorithms to emergent
161parallel architectures, we introduce both a new algorithmic
162approach and its implementation on SIMD and GPU architectures.
163This approach relies on a bitwise data parallel view of text
164streams as well as a surprising use of addition to match
165runs of characters in a sin`gle step.  The closest previous
166work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
167technology together with a parallel scanning primitive also
168based on addition \cite{cameron2011parallel}.   
169However, in contrast to the deterministic, longest-match
170scanning associated with the ScanThru primitive of that
171work, we introduce here a new primitive MatchStar
172that can be used in full generality for nondeterministic
173regular expression matching.   We also introduce a long-stream
174addition technique involving a further application of MatchStar
175that enables us to scale the technique to $n$-bit addition
176in $\lceil\lg_{64}{n}\rceil)$ steps.   We ultimately apply this technique,
177for example, to perform
178synchronized 4096-bit addition on GPU warps of 64 threads.
179
180There is also a strong keyword match between the bit-parallel
181data streams used in our approach and the bit-parallelism
182used for NFA state transitions in the classical algorithms of
183Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
184and Navarro and Raffinot \cite{navarro1998bit}.
185However those algorithms use bit-parallelism in a fundamentally
186different way: representing all possible current NFA states
187as a bit vector and performing parallel transitions to a new
188set of states using table lookups and bitwise logic.    Whereas
189our approach can match multiple characters per step, bit-parallel
190NFA algorithms proceed through the input one byte at a time.
191Nevertheless, the agrep \cite{wu1992agrep} and
192nrgrep \cite{navarro2000} programs implemented using these techniques remain
193among the strongest competitors in regular expression matching
194performance, so we include them in our comparative evaluation.
195
196
197The remainder of this paper is organized as follows.
198Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
199using a model of arbitrary-length bit-parallel data streams.
200Section \ref{sec:blockwise} discusses the block-by-block
201implementation of our techniques including the long stream
202addition techniques for 256-bit addition with AVX2 and
2034096-bit additions with GPGPU SIMT.
204Section \ref{sec:analysis} 
205Section \ref{sec:SSE2} 
206Section \ref{sec:AVX2} 
207Section \ref{sec:GPU} 
208Section \ref{sec:Concl} concludes the paper with a discussion of areas for future work.
209
210
211\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
212
213Whereas the traditional approaches to regular expression matching
214using NFAs, DFAs or backtracking all rely on a byte-at-a-time
215processing model, the approach  we introduce in this paper is based
216on quite a different concept:  a data-parallel approach to simultaneous
217processing of data stream elements.  Indeed, our most abstract model
218is that of unbounded data parallelism: processing all elements of
219the input data stream simultaneously.   In essence, we view
220data streams as (very large) integers.   The fundamental operations
221we apply are based on bitwise logic and long-stream addition.
222
223Depending on the available parallel processing resources, an actual
224implementation may divide an input stream into blocks  and process
225the blocks sequentially.   Within each block  all elements of the
226input stream are processed together, relying the availability of
227bitwise logic and addition scaled to the block size.   On commodity
228Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
229we typically process input streams 128 bytes at a time.   In this
230case, we rely on the Parabix tool chain \cite{lin2012parabix}
231to handle the details of compilation to block-by-block processing.
232As we show later, however, we have also adapted Parabix technology to processing
233blocks of 4K bytes at time in our GPGPU implementation,
234relying on the application of our long-stream addition technique
235to perform 4096-bit additions using 64 threads working in lock-step
236SIMT fashion each on 64-bit processors.
237
238A key concept in this streaming approach is the derivation of bit streams
239that are parallel to the input data stream, i.e., in one-to-one
240correspondence with the data element positions of the input
241streams.   Typically, the input stream is a byte stream comprising
242the 8-bit character code units of a particular encoding such
243as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
244easily be used with wider code units such as the 16-bit code units of
245UTF-16.   In the case of a byte stream, the first step is to transpose
246the byte stream into eight parallel bit streams, such that bit stream
247$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
248a set of basis bit streams from which many other parallel bit
249streams can be calculated, such as character class bit
250streams such that each bit $j$ of the stream specifies
251whether character $j$ of the input stream is in the class
252or not.  Figure \ref{fig:streams} shows an example of an
253input byte stream in ASCII, the eight basis bit streams of the
254transposed representation, and several character class bit streams
255that may be computed from the basis bit streams using bitwise logic.
256Transposition and character class construction are straightforward
257using the Parabix tool chain \cite{lin2012parabix}.
258
259\paragraph*{Marker Streams.}  Now consider how bit-parallel data
260streams can be used in regular expression matching.   Consider
261the problem of searching the input stream of Figure \ref{fig:streams}
262to finding occurrence of strings matching
263the regular expression \verb:a[0-9]*z:.
264The matching process involves the concept of {\em marker streams}, that
265is streams that mark the positions of current matches during the
266overall process.  In this case there are three marker streams computed
267during the match process, namely,
268$M_1$ representing match positions after an initial \verb:a:
269character has been found, $M_2$ representing positions
270reachable from positions marked by $M_1$ by further matching zero or
271more digits (\verb:[0-9]*:) and finally $M_3$ the stream
272marking positions after a final \verb:z: has been found.
273Without describing the details of how these streams are computed
274for the time being, Figure \ref{fig:streams} shows what each
275of these streams should be for our example matching problem.
276Note our convention that a marker stream contains a 1 bit
277at the next character position to be matched, that is,
278immediately past the last position that was matched.
279
280
281\paragraph*{MatchStar.}
282MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
283
284Figure \ref{fig:matchstar} illustrates the MatchStar method.  The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data.
285
286In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.  $T_1$ has 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
287output marker stream is obtained by ORing $T_2$ with the initial marker stream.
288
289\begin{figure*}[tbh]
290\begin{center}
291\begin{tabular}{cr}\\
292source data & \verb`--142578---125-231-----127--5394---94761205-`\\
293$M_0$ & \verb`.......1......1..1..1...1.............1..1..`\\
294$D = $\verb:[0-9]: & \verb`..111111...111.111.....111..1111...11111111.`\\
295$T_0 = M_0 \wedge D$ & \verb`.......1.........1......1.............1..1..`\\
296$T_1 = T_0 + D$ & \verb`.1.........1111.......1..1..1111..1...1...1.`\\
297$T_2 = T_1 \oplus D$ & \verb`.1111111......1111....111.........1111.111..`\\
298$M_1 = T_2 \, | \, M_0$ & \verb`.1111111......1111..1.111.........11111111..`
299\end{tabular}
300
301\end{center}
302\caption{Match Star}
303\label{fig:matchstar1}
304\end{figure*}
305
306
307In general, given a marker stream $M$ and a character class stream $C$,
308the operation of MatchStar is defined by the following equation. 
309\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
310Given a set of initial marker positions, the result stream marks
311all possible positions that can be reached by 0 or more occurrences
312of characters in class $C$ from each position in $M$
313
314
315
316
317\section{Block-at-a-Time Processing}\label{sec:blockwise} 
318
319The unbounded stream model of the previous section must of course
320be translated an implementation that proceeds block-at-a-time for
321realistic application.  In this, we primarily rely on the Pablo
322compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
323statements expressed as arbitrary-length bitstream equations, Pablo
324produces block-at-a-time C++ code that initializes and maintains all the necessary
325carry bits for each of the additions and shifts involved in the
326bitstream calculations.   
327
328In the present work, our principal contribution to the block-at-a-time
329model is the technique of long-stream addition described below.
330Otherwise, we were able to use Pablo directly in compiling our
331SSE2 and AVX2 implementations.   Our GPU implementation required
332some scripting to modify the output of the Pablo compiler for our
333purpose.
334
335\paragraph*{Long-Stream Addition.}  The maximum word size for
336addition on commodity processors is typically 64 bits.  In order
337to implement long-stream addition for block sizes of 256 or larger,
338a method for propagating carries through the individual stages of
33964-bit addition is required.  However, the normal technique of
340sequential addition using add-with-carry instructions, for example,
341is far from ideal.
342
343We have developed a technique using SIMD or SIMT methods for constant-time
344long-stream addition up to 4096 bits.   
345We assume the availability of the following SIMD/SIMT operations
346operating on vectors of $f$ 64-bit fields.
347\begin{itemize}
348\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
349in two vectors to produce a result vector of $f$ 64-bit fields.
350\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
351of \verb:x: each with the constant value -1 (all bits 1), producing
352an $f$-bit mask value,
353\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
354field into a single compressed $f$-bit mask value, and
355\item normal bitwise logic operations on $f$-bit masks.
356\item  \verb#simd<64>::spread(x)# : distributing the bits of
357an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector, and
358\end{itemize}
359
360Given these operations, our method for long stream addition of
361two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
362\begin{enumerate}
363\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
364\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
365
366\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
367\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
368\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
369\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
370
371\item Compute an $f$-bit mask of carries generated for each of the
37264-bit additions of \verb:X: and \verb:Y:.
373\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
374
375\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
376an incoming carry bit.  This is the {\em bubble mask}.
377\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
378
379\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
380incremented to produce the final sum.  Here we find a new application of
381MatchStar!
382\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
383
384This is the key step.  The mask {\tt c} of outgoing carries must be
385shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
386with the next digit.  At the incoming position, the carry will
387increment the 64-bit digit.   However, if this digit is all ones (as
388signalled by the corresponding bit of bubble mask {\tt b}, then the addition
389will generate another carry.  In fact, if there is a sequence of
390digits that are all ones, then the carry must bubble through
391each of them.   This is just MatchStar!
392
393\item Compute the final result {\tt Z}.
394\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
395
396\end{enumerate}
397\begin{figure}
398\begin{center}
399\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
400{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
401{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
402{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
403{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
404{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
405{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
406{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
407{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
408{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
409{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
410{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
411\end{tabular}
412\end{center}
413\caption{Long Stream Addition}\label{fig:longadd}
414\end{figure}
415
416Figure \ref{fig:longadd} illustrates the process.  In the figure,
417we illustrate the process with 8-bit fields rather than 64-bit fields
418and show all field values in hexadecimal notation.  Note that
419two of the individual 8-bit additions produce carries, while two
420others produce {\tt FF} values that generate bubble bits.  The
421net result is that four of the original 8-bit sums must be
422incremented to produce the long stream result.
423
424A slight extension to the process produces a long-stream full adder
425that can be used in chained addition.  In this case, the
426adder must take an additional carry-in bit
427{\tt p} and produce a carry-out bit {\tt q}.
428This may be accomplished by incorporating {\tt p}
429in calculating the increment mask in the low bit position,
430and then extracting the carry-out {\tt q} from the high bit position.
431\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
432\[\text{\tt q} = \text{\tt i >> f}\]
433
434As described subsequently, we use a two-level long-stream addition technique
435in both our AVX2 and GPU implementations.  In principle, one can extend
436the technique to additional levels.  Using 64-bit adders throughout,
437$\lceil\lg_{64}{n}\rceil)$ steps are needed for $n$-bit addition.
438A three-level scheme could coordinate
43964 groups each performing 4096-bit long additions in a two-level structure.
440However, whether there are reasonable architectures that can support fine-grained
441SIMT style coordinate at this level is an open question.
442
443Using the methods outlined, it is quite conceivable that instruction
444set extensions to support long-stream addition could be added for
445future SIMD and GPU processors.   Given the fundamental nature
446of addition as a primitive and its novel application to regular
447expression matching as shown herein, it seems reasonable to expect
448such instructions to become available.
449\raggedbottom
450\section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis}
451
452\begin{enumerate}
453\item Operations
454\item Memory behaviour per input byte: note tables of DFA/NFA.
455
456Bille and Throup \em{Faster regular expression matching}\cite{bille2009faster}
457
458\end{enumerate}
459
460
461
462\section{Commodity SIMD Implementation and Experimental Evaluation}\label{sec:SSE2}
463\subsection{Implementation Notes}
464\subsection{Evaluation Methodology}
465\subsection{Comparison}
466\begin{figure}
467\begin{center}
468\begin{tikzpicture}
469\begin{axis}[
470xtick=data,
471ylabel=Cycles per Byte,
472xticklabels={@,Date,Email,URIorEmail,xquote},
473tick label style={font=\tiny},
474enlargelimits=0.15,
475legend style={at={(0.5,-0.15)},
476anchor=north,legend columns=-1},
477ymax=8,
478ybar,
479bar width=7pt,
480]
481\addplot
482file {data/cycles1.dat};
483\addplot
484file {data/cycles2.dat};
485\addplot
486file {data/cycles3.dat};
487
488\legend{Bitstreams,NRGrep,Grep,Annot}
489\end{axis}
490\end{tikzpicture}
491\end{center}
492\caption{Cycles per Byte}
493\end{figure}
494
495\begin{figure}
496\begin{center}
497\begin{tikzpicture}
498\begin{axis}[
499xtick=data,
500ylabel=Instructions per Byte,
501xticklabels={@,Date,Email,URIorEmail,xquote},
502tick label style={font=\tiny},
503enlargelimits=0.15,
504legend style={at={(0.5,-0.15)},
505anchor=north,legend columns=-1},
506ymax=16,
507ybar,
508bar width=7pt,
509]
510\addplot
511file {data/instructions1.dat};
512\addplot
513file {data/instructions2.dat};
514\addplot
515file {data/instructions3.dat};
516
517\legend{Bitstreams,NRGrep,Grep,Annot}
518\end{axis}
519\end{tikzpicture}
520\end{center}
521\caption{Instructions per Byte}
522\end{figure}
523
524\begin{figure}
525\begin{center}
526\begin{tikzpicture}
527\begin{axis}[
528xtick=data,
529ylabel=Instructions per Cycle,
530xticklabels={@,Date,Email,URIorEmail,xquote},
531tick label style={font=\tiny},
532enlargelimits=0.15,
533legend style={at={(0.5,-0.15)},
534anchor=north,legend columns=-1},
535ybar,
536bar width=7pt,
537]
538\addplot
539file {data/ipc1.dat};
540\addplot
541file {data/ipc2.dat};
542\addplot
543file {data/ipc3.dat};
544
545\legend{Bitstreams,NRGrep,Grep,Annot}
546\end{axis}
547\end{tikzpicture}
548\end{center}
549\caption{Instructions per Cycle}
550\end{figure}
551
552\begin{figure}
553\begin{center}
554\begin{tikzpicture}
555\begin{axis}[
556xtick=data,
557ylabel=Branch Misses per Byte,
558xticklabels={@,Date,Email,URIorEmail,xquote},
559tick label style={font=\tiny},
560enlargelimits=0.15,
561legend style={at={(0.5,-0.15)},
562anchor=north,legend columns=-1},
563ymax=0.03,
564ybar,
565bar width=7pt,
566]
567\addplot
568file {data/branch-misses1.dat};
569\addplot
570file {data/branch-misses2.dat};
571\addplot
572file {data/branch-misses3.dat};
573
574\legend{Bitstreams,NRGrep,Grep,Annot}
575\end{axis}
576\end{tikzpicture}
577\end{center}
578\caption{Branch Misses per Byte}
579\end{figure}
580
581\section{SIMD Scalability}\label{sec:AVX2}
582
583
584
585
586\begin{figure}
587\begin{center}
588\begin{tikzpicture}
589\begin{axis}[
590xtick=data,
591ylabel=Cycles per Byte,
592xticklabels={@,Date,Email,URIorEmail,xquote},
593tick label style={font=\tiny},
594enlargelimits=0.15,
595legend style={at={(0.5,-0.15)},
596anchor=north,legend columns=-1},
597ybar,
598bar width=7pt,
599]
600\addplot
601file {data/ssecycles.dat};
602\addplot
603file {data/avxcycles.dat};
604
605\legend{SSE2,AVX2,Annot}
606\end{axis}
607\end{tikzpicture}
608\end{center}
609\caption{Cycles per Byte}
610\end{figure}
611
612\begin{figure}
613\begin{center}
614\begin{tikzpicture}
615\begin{axis}[
616xtick=data,
617ylabel=Instructions per Byte,
618xticklabels={@,Date,Email,URIorEmail,xquote},
619tick label style={font=\tiny},
620enlargelimits=0.15,
621legend style={at={(0.5,-0.15)},
622anchor=north,legend columns=-1},
623ybar,
624bar width=7pt,
625]
626\addplot
627file {data/sseinstructions.dat};
628\addplot
629file {data/avxinstructions.dat};
630
631\legend{SSE2,AVX2,Annot}
632\end{axis}
633\end{tikzpicture}
634\end{center}
635\caption{Instructions per Byte}
636\end{figure}
637
638\begin{figure}
639\begin{center}
640\begin{tikzpicture}
641\begin{axis}[
642xtick=data,
643ylabel=Instructions per Cycle,
644xticklabels={@,Date,Email,URIorEmail,xquote},
645tick label style={font=\tiny},
646enlargelimits=0.15,
647legend style={at={(0.5,-0.15)},
648anchor=north,legend columns=-1},
649ybar,
650bar width=7pt,
651]
652\addplot
653file {data/sseipc.dat};
654\addplot
655file {data/avxipc.dat};
656
657
658\legend{SSE2,AVX2,Annot}
659\end{axis}
660\end{tikzpicture}
661\end{center}
662\caption{Instructions per Cycle}
663\end{figure}
664
665\begin{figure}
666\begin{center}
667\begin{tikzpicture}
668\begin{axis}[
669xtick=data,
670ylabel=Branch Misses per Byte,
671xticklabels={@,Date,Email,URIorEmail,xquote},
672tick label style={font=\tiny},
673enlargelimits=0.15,
674legend style={at={(0.5,-0.15)},
675anchor=north,legend columns=-1},
676ybar,
677bar width=7pt,
678]
679\addplot
680file {data/ssebranch-misses.dat};
681\addplot
682file {data/avxbranch-misses.dat};
683
684\legend{SSE2,AVX2,Annot}
685\end{axis}
686\end{tikzpicture}
687\end{center}
688\caption{Branch Misses per Byte}
689\end{figure}
690
691
692
693
694\subsection{AVX Stream Addition}
695 \begin{figure*}[tbh]
696\begin{center}
697\begin{verbatim}
698void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {
699  bitblock256_t all_ones = simd256<1>::constant<1>();
700  bitblock256_t gen = simd_and(x, y);
701  bitblock256_t prop = simd_xor(x, y);
702  bitblock256_t partial_sum = simd256<64>::add(x, y);
703  bitblock256_t carry = simd_or(gen, simd_andc(prop, partial_sum));
704  bitblock256_t bubble = simd256<64>::eq(partial_sum, all_ones);
705  uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
706  uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
707  uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask;
708  uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask);
709  carry_out = convert(increments >> 4);
710  uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001;
711  sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread))));
712}
713
714\end{verbatim}
715
716\end{center}
717\caption{Match Star}
718\label{fig:matchstar1}
719\end{figure*}
720
721\section{GPU Implementation}\label{sec:GPU}
722\begin{figure}
723\begin{center}
724\begin{tikzpicture}
725\begin{axis}[
726xtick=data,
727ylabel=Running Time (ms per byte),
728xticklabels={@,Date,Email,URIorEmail,xquote},
729tick label style={font=\tiny},
730enlargelimits=0.15,
731legend style={at={(0.5,-0.15)},
732anchor=north,legend columns=-1},
733ybar,
734bar width=7pt,
735]
736\addplot
737file {data/ssetime.dat};
738\addplot
739file {data/avxtime.dat};
740\addplot
741file {data/gputime.dat};
742
743\legend{SSE2,AVX2,GPU,Annot}
744\end{axis}
745\end{tikzpicture}
746\end{center}
747\caption{Running Time}
748\end{figure}
749
750
751
752
753\section{Miscellaneous}
754\subsection{Skipping}
755\subsection{Unicode}
756
757\section{Conclusion}\label{sec:Concl}
758\subsection{Contributions}
759\begin{enumerate}
760\item New Algorithm Class for Regular Expression Matching
761\item MatchStar for Character Class Repetition
762\item Long Stream Addition
763\item Implementations showing performance and scalability
764\end{enumerate}
765\subsection{Future Work}
766\begin{enumerate}
767\item Substring capture
768\item Unicode character classes
769\item Nonregular regexp features: zero-width assertions, backreferences
770\item Multicore for ruleset parallelism
771\end{enumerate}
772
773
774
775%\appendix
776%\section{Appendix Title}
777
778%This is the text of the appendix, if you need one.
779
780\acks
781
782This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
783MITACS, Inc.
784
785% We recommend abbrvnat bibliography style.
786
787\bibliographystyle{abbrvnat}
788
789% The bibliography should be embedded for final submission.
790 
791\bibliography{reference}
792
793%\begin{thebibliography}{}
794%\softraggedright
795
796%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
797%P. Q. Smith, and X. Y. Jones. ...reference text...
798%
799%\end{thebibliography}
800
801
802\end{document}
803
804
Note: See TracBrowser for help on using the repository browser.