source: docs/Working/re/re-main.tex @ 3633

Last change on this file since 3633 was 3633, checked in by lindanl, 5 years ago

GPU section

File size: 32.2 KB
Line 
1\documentclass[pageno]{jpaper}
2
3%replace XXX with the submission number you are given from the PACT submission site.
4\newcommand{\pactsubmissionnumber}{XXX}
5
6\usepackage[normalem]{ulem}
7\usepackage{amssymb}
8\usepackage{amsmath}
9\usepackage{graphicx}
10\usepackage{tikz}
11\usepackage{pgfplots}
12
13\begin{document}
14
15\title{Bitwise Data Parallelism in Regular Expression Matching}
16
17
18\date{}
19\maketitle
20
21\thispagestyle{empty}
22
23
24\begin{abstract}
25\input{abstract}
26\end{abstract}
27
28\section{Introduction}
29
30The use of regular expressions to search texts for occurrences
31of string patterns has a long history and
32remains a pervasive technique throughout computing applications today.
33% {\em a brief history}
34The origins of regular expression matching date back to automata theory
35developed by Kleene in the 1950s \cite{kleene1951}.
36Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
37to nondeterministic finite automata (NFA).
38Following Thompson's approach, a regular expression of length $m$ is converted
39to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
40NFA in worst case $O(mn)$ time. Often, a more efficient choice
41is to convert an NFA into a DFA. A DFA has a single active state at any time
42throughout the matching process and
43hence it is possible to search a text at of length $n$ in $O(n)$ time
44\footnote{It is well known that the conversion of an NFA to an equivalent DFA may result
45in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
46
47A significant portion of the research in fast regular expression matching can be
48regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
49In \cite{baeza1992new}, Baeza-Yates and Gonnet
50presented a new approach to string search based on bit-parallelism.
51This technique takes advantage of the intrinsic parallelism of bitwise operations
52within a computer word.
53Given a $w$-bit word, the number of operations that a string search algorithms
54performs can be reduced by a factor $w$.
55Using this fact, the Shift-Or algorithm simulates an NFA using
56bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}.
57A disadvantage of the bit-parallel Shift-Or pattern matching approach
58is an inability to skip input characters.
59Simple string matching algorithms,
60such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
61to achieve sublinear times in the average case.
62Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 
63based on suffix automata are able to skip characters.
64The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
65combines the bit-parallel advantages of the Shift-Or approach
66with the character skipping property of BDM algorithms. The nrgrep pattern matching tool,
67is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous
68processing of data stream elements as discussed herein, nrgrep
69was by far the fastest grep tool
70for matching complex patterns and achieved similar performance
71to that of the fastest existing string
72matching tools for simple patterns \cite{navarro2000}.
73
74There has been considerable
75interest in accelerating regular expression matching
76on parallel hardware
77such as multi-core processors (CPUs), graphics processing units (GPUs),
78field-programmable gate arrays (FPGAs), and specialized architectures such as
79the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
80%CPU
81Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
82text processing algorithms that exhibit irregular memory access patterns
83can be efficiently executed on multicore hardware.
84In related work, Pasetto et al. presented a flexible tool that
85performs small-ruleset regular expression matching at a rate of
862.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
87Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
88string matching algorithm \cite{aho1975} is well suited for parallel
89implementation on multi-core CPUs, GPUs and the Cell BE.
90On each hardware, both thread-level parallelism (cores) and data-level parallelism
91(SIMD units) were leveraged for performance.
92Salapura et. al. \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
93in business analytics applications and leveraged the SIMD hardware available
94on multi-core processors to acheive a speedup of greater than 1.8 over a
95range of data sizes of interest.
96%Cell
97In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
98that delivered 1.00--1.78 Gbps on a single
99Cell BE chip and extended this approach for emulation on the Intel Larrabee
100instruction set \cite{scarpazza2009larrabee}.
101On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
102implementation that delivered a throughput of 40
103Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
104Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
105presented a string matching implementation for automata that achieved
1064 Gbps on the Cell BE.
107% GPU
108In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
109implementation of the AC algorithm for
110accelerating string matching on GPUs. Lin et al., proposed
111the Parallel Failureless Aho-Corasick (PFAC)
112algorithm to accelerate pattern matching on GPU hardware and
113achieved 143 Gbps throughput, 14.74 times faster
114than the AC algorithm performed on a four core
115multi-core processor using OpenMP \cite{lin2013accelerating}.
116
117Whereas the existing approaches to parallelization have been
118based on adapting traditional sequential algorithms to emergent
119parallel architectures, we introduce both a new algorithmic
120approach and its implementation on SIMD and GPU architectures.
121This approach relies on a bitwise data parallel view of text
122streams as well as a surprising use of addition to match
123runs of characters in a single step.  The closest previous
124work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
125technology together with a parallel scanning primitive also
126based on addition \cite{cameron2011parallel}.   
127However, in contrast to the deterministic, longest-match
128scanning associated with the ScanThru primitive of that
129work, we introduce here a new primitive MatchStar
130that can be used in full generality for nondeterministic
131regular expression matching.   We also introduce a long-stream
132addition technique involving a further application of MatchStar
133that enables us to scale the technique to $n$-bit addition
134in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
135for example, to perform
136synchronized 4096-bit addition on GPU warps of 64 threads.
137
138There is also a strong keyword match between the bit-parallel
139data streams used in our approach and the bit-parallelism
140used for NFA state transitions in the classical algorithms of
141Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
142and Navarro and Raffinot \cite{navarro1998bit}.
143However those algorithms use bit-parallelism in a fundamentally
144different way: representing all possible current NFA states
145as a bit vector and performing parallel transitions to a new
146set of states using table lookups and bitwise logic.    Whereas
147our approach can match multiple characters per step, bit-parallel
148NFA algorithms proceed through the input one byte at a time.
149Nevertheless, the agrep \cite{wu1992agrep} and
150nrgrep \cite{navarro2000} programs implemented using these techniques remain
151among the strongest competitors in regular expression matching
152performance, so we include them in our comparative evaluation.
153
154The remainder of this paper is organized as follows.
155Section \ref{sec:grep} briefly describes regular expression
156notation and the grep problem.
157Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
158using a model of arbitrary-length bit-parallel data streams.
159Section \ref{sec:blockwise} discusses the block-by-block
160implementation of our techniques including the long stream
161addition techniques for 256-bit addition with AVX2 and
1624096-bit additions with GPGPU SIMT.
163Section \ref{sec:SSE2} describes our overall SSE2 implementation
164and carries out a performance study in comparison with
165existing grep implementations.
166Given the dramatic variation in grep performance across
167different implementation techniques, expressions and data sets,
168Section \ref{sec:analysis} considers a comparison between
169the bit-stream and NFA approaches from a theoretical perspective.
170Section \ref{sec:AVX2} then examines and demonstrates
171the scalability of our
172bitwise data-parallel approach in moving from
173128-bit to 256-bit SIMD on Intel Haswell architecture.
174To further investigate scalability, Section \ref{sec:GPU} 
175addresses the implementation of our matcher using
176groups of 64 threads working together SIMT-style on a GPGPU system.
177Section \ref{sec:Concl} concludes the paper with a discussion of results and
178areas for future work.
179
180\section{Regular Expression Notation and Grep}\label{sec:grep}
181
182We follow common POSIX notation for regular expressions.
183A regular expression specifies a set of strings through
184a pattern notation.   Individual characters normally
185stand for themselves, unless they are one of the
186special characters \verb:*+?[{\(|^$.: that serve as metacharacters
187of the notation system.  Thus the regular expression \verb:cat:
188is a pattern for the set consisting of the single 3-character
189string ``\verb:cat:''.   The special characters must be escaped
190with a backslash to prevent interpretation as metacharacter, thus
191\verb:\$: represents the dollar-sign and \verb:\\\\: represent
192the string consisting of two backslash characters.
193Character class bracket expressions are pattern elements
194that allow any character in a given class to be used in a particular
195context.  For example, \verb:[@#%]: is a regular expression
196that stands for any of the three given symbols.  Contiguous
197ranges of characters may be specified using hyphens;
198for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
199for any alphanumeric character or underscore.  If the
200caret character immediately follows the opening bracket,
201the class is negated, thus \verb:[^0-9]: stands for
202any character except a digit.  The period metacharacter
203\verb:.: stands for the class of all characters.
204
205Consecutive pattern elements stand for strings formed by
206concatenation, thus \verb:[cd][ao][tg]: stands for the
207set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
208The alternation operator \verb:|: allows a pattern to be
209defined to have to alternative forms, thus \verb:cat|dog:
210matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
211takes precedence over alternation, but parenthesis may be
212used to change this, thus \verb:(ab|cd)[0-9]: stands for any
213digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
214
215Repetition operators may be appended to a pattern to specify
216a variable number of occurrences of that pattern. 
217The Kleene \verb:*: specifies zero-or-more occurrences
218matching the previous pattern, while \verb:+: specifies one-or
219more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
220both specify the same set: strings of at least one lower-case
221letter.  The postfix operator \verb:?: specifies an optional
222component, i.e., zero-or-one occurrence of strings matching
223the element.  Specific bounds may be given within braces:
224\verb:(ab){3}: specifies the string ``\verb:ababab:'',
225\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
226or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
227strings of at least 4 consecutive capital letters.
228
229The grep program searches a file for lines containing matches
230to a regular expression using any of the above notations.
231In addition, the pattern elements \verb:^: and \verb:$:
232may be used to match respectively the beginning or the
233end of a line.  In line-based tools such as grep, \verb:.:
234matches any character except newlines; matches cannot extend
235over lines.
236Normally, grep prints all matching
237lines to its output.  However, grep programs typically
238allow a command line flag such as \verb:-c: to specify
239that only a count of matching lines be produced; we use
240this option in our experimental evaluation to focus
241our comparisons on the performance of the underlying matching
242algorithms.
243
244\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
245
246Whereas the traditional approaches to regular expression matching
247using NFAs, DFAs or backtracking all rely on a byte-at-a-time
248processing model, the approach  we introduce in this paper is based
249on quite a different concept:  a data-parallel approach to simultaneous
250processing of data stream elements.  Indeed, our most abstract model
251is that of unbounded data parallelism: processing all elements of
252the input data stream simultaneously.   In essence, data streams are viewed
253as (very large) integers.   The fundamental operations are bitwise
254logic, stream shifting and long-stream addition.
255
256Depending on the available parallel processing resources, an actual
257implementation may divide an input stream into blocks  and process
258the blocks sequentially.   Within each block  all elements of the
259input stream are processed together, relying the availability of
260bitwise logic and addition scaled to the block size.   On commodity
261Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
262we typically process input streams 128 bytes at a time.   
263In this
264case, we rely on the Parabix tool chain \cite{lin2012parabix}
265to handle the details of compilation to block-by-block processing.
266On the
267latest processors supporting the 256-bit AVX2 SIMD operations,
268we also use the Parabix tool chain, but substitute a parallelized
269long-stream addition technique to avoid the sequential chaining
270of 4 64-bit additions.
271Our GPGPU implementation uses scripts to modify the output
272of the Parabix tools, effectively dividing the input into blocks
273of 4K bytes.   
274We also have adapted our long-stream addition technique
275to perform 4096-bit additions using 64 threads working in lock-step
276SIMT fashion.  A similar technique is known to the GPU programming
277community\cite{}.   
278 
279\begin{figure}[tbh]
280\begin{center}
281\begin{tabular}{cr}\\
282input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
283$B_7$ & \verb`.................................`\\
284$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
285$B_5$ & \verb`111111111111111111111111111111111`\\
286$B_4$ & \verb`.11111...111....1....11.....111..`\\
287$B_3$ & \verb`......11..11111.1111...11.....111`\\
288$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
289$B_1$ & \verb`...1....11.1....1........11.111..`\\
290$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
291\verb:[a]: & \verb`1..............1....1......1.....`\\
292\verb:[z]: & \verb`...........1....1.............1..`\\
293\verb:[0-9]: & \verb`.1111....11..........1......11...`
294\end{tabular}
295
296\end{center}
297\caption{Basis and Character Class Streams}
298\label{fig:streams}
299\end{figure}
300
301A key concept in this streaming approach is the derivation of bit streams
302that are parallel to the input data stream, i.e., in one-to-one
303correspondence with the data element positions of the input
304streams.   Typically, the input stream is a byte stream comprising
305the 8-bit character code units of a particular encoding such
306as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
307easily be used with wider code units such as the 16-bit code units of
308UTF-16.   In the case of a byte stream, the first step is to transpose
309the byte stream into eight parallel bit streams, such that bit stream
310$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
311a set of basis bit streams from which many other parallel bit
312streams can be calculated, such as character class bit
313streams such that each bit $j$ of the stream specifies
314whether character $j$ of the input stream is in the class
315or not.  Figure \ref{fig:streams} shows an example of an
316input byte stream in ASCII, the eight basis bit streams of the
317transposed representation, and the character class bit streams
318\verb:[a]:,
319\verb:[z]:, and
320\verb:[0-9]:
321that may be computed from the basis bit streams using bitwise logic.
322Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
323Transposition and character class construction are straightforward
324using the Parabix tool chain \cite{lin2012parabix}.
325
326\begin{figure}[tbh]
327\begin{center}
328\begin{tabular}{cr}\\
329input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
330$M_1$ & \verb`.1..............1....1......1....`\\
331$M_2$ & \verb`.11111..........1....11.....111..`\\
332$M_3$ & \verb`.................1.............1.`
333\end{tabular}
334
335\end{center}
336\caption{Marker Streams in Matching {\tt a[0-9]*z}}
337\label{fig:streams2}
338\end{figure}
339
340\paragraph*{Marker Streams.}  Now consider how bit-parallel data
341streams can be used in regular expression matching.   Consider
342the problem of searching the input stream of Figure \ref{fig:streams}
343to finding occurrence of strings matching
344the regular expression \verb:a[0-9]*z:.
345The matching process involves the concept of {\em marker streams}, that
346is streams that mark the positions of current matches during the
347overall process.  In this case there are three marker streams computed
348during the match process, namely,
349$M_1$ representing match positions after an initial \verb:a:
350character has been found, $M_2$ representing positions
351reachable from positions marked by $M_1$ by further matching zero or
352more digits (\verb:[0-9]*:) and finally $M_3$ the stream
353marking positions after a final \verb:z: has been found.
354Without describing the details of how these streams are computed
355for the time being, Figure \ref{fig:streams2} shows what each
356of these streams should be for our example matching problem.
357Note our convention that a marker stream contains a 1 bit
358at the next character position to be matched, that is,
359immediately past the last position that was matched.
360
361
362\paragraph*{MatchStar.}
363MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
364
365\begin{figure}[tbh]
366\begin{center}
367\begin{tabular}{cr}\\
368input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
369$M_1$ & \verb`.1..............1....1......1....`\\
370$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
371$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
372$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
373$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
374$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
375\end{tabular}
376
377\end{center}
378\caption{$M_2 = \text{MatchStar}(M_1, D)$}
379\label{fig:matchstar}
380\end{figure}
381
382
383Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
384it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
385conventional presentation of our character data input.   However, this reverses the normal
386order of presentation when considering bitstreams as numeric values.  The key point here is
387that when we perform bitstream addition, we will show bit movement from left-to-right.
388For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
389
390The first row of the figure is the input data,
391the second and third rows are the input bitstreams: the initial marker position bitstream and the
392character class bitstream for digits derived from input data. 
393
394In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
395The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
396output marker stream is obtained by ORing $T_2$ with the initial marker stream.
397
398In general, given a marker stream $M$ and a character class stream $C$,
399the operation of MatchStar is defined by the following equation. 
400\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
401Given a set of initial marker positions, the result stream marks
402all possible positions that can be reached by 0 or more occurrences
403of characters in class $C$ from each position in $M$
404
405\input{compilation}
406
407\input{re-Unicode}
408
409\section{Block-at-a-Time Processing}\label{sec:blockwise}
410
411The unbounded stream model of the previous section must of course
412be translated an implementation that proceeds block-at-a-time for
413realistic application.  In this, we primarily rely on the Pablo
414compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
415statements expressed as arbitrary-length bitstream equations, Pablo
416produces block-at-a-time C++ code that initializes and maintains all the necessary
417carry bits for each of the additions and shifts involved in the
418bitstream calculations.   
419
420In the present work, our principal contribution to the Parabix tool
421chain is to incorporate the technique of long-stream addition described below.
422Otherwise, we were able to use Pablo directly in compiling our
423SSE2 and AVX2 implementations.   Our GPU implementation required
424some scripting to modify the output of the Pablo compiler for our
425purpose.
426
427\paragraph*{Long-Stream Addition.}  The maximum word size for
428addition on commodity processors is typically 64 bits.  In order
429to implement long-stream addition for block sizes of 256 or larger,
430a method for propagating carries through the individual stages of
43164-bit addition is required.  However, the normal technique of
432sequential addition using add-with-carry instructions, for example,
433is far from ideal.
434
435We have developed a general model using SIMD methods for constant-time
436long-stream addition up to 4096 bits.   
437We assume the availability of the following SIMD/SIMT operations
438operating on vectors of $f$ 64-bit fields.
439\begin{itemize}
440\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
441in two vectors to produce a result vector of $f$ 64-bit fields.
442\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
443of \verb:x: each with the constant value -1 (all bits 1), producing
444an $f$-bit mask value,
445\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
446field into a single compressed $f$-bit mask value, and
447\item normal bitwise logic operations on $f$-bit masks, and
448\item  \verb#simd<64>::spread(x)# : distributing the bits of
449an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
450\end{itemize}
451
452In this model, the \verb#hsimd<64>::mask(X)# and
453\verb#simd<64>::spread(x)# model the minimum
454communication requirements between the parallel processing units
455(SIMD lanes or SIMT processors).    In essence, we just need
456the ability to quickly send and receive 1 bit of information
457per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
458gathers 1 bit from each of the processors to a central resource.
459After calculations on the gather bits are performed, we then
460just need an operation to invert the communication, i.e.,
461sending 1 bit each from the central processor to each of
462the parallel units.   There are a variety of ways in which
463these facilities may be implemented depending on the
464underlying architecture; details of our AVX2 and GPU implementations
465are presented later.   
466
467Given these operations, our method for long stream addition of
468two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
469\begin{enumerate}
470\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
471\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
472
473\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
474\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
475\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
476\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
477
478\item Compute an $f$-bit mask of carries generated for each of the
47964-bit additions of \verb:X: and \verb:Y:.
480\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
481
482\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
483an incoming carry bit.  This is the {\em bubble mask}.
484\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
485
486\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
487incremented to produce the final sum.  Here we find a new application of
488MatchStar.
489\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
490
491This is the key step.  The mask {\tt c} of outgoing carries must be
492shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
493with the next digit.  At the incoming position, the carry will
494increment the 64-bit digit.   However, if this digit is all ones (as
495signalled by the corresponding bit of bubble mask {\tt b}, then the addition
496will generate another carry.  In fact, if there is a sequence of
497digits that are all ones, then the carry must bubble through
498each of them.   This is just MatchStar.
499
500\item Compute the final result {\tt Z}.
501\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
502
503\end{enumerate}
504\begin{figure}
505\begin{center}
506\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
507{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
508{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
509{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
510{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
511{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
512{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
513{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
514{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
515{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
516{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
517{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
518\end{tabular}
519\end{center}
520\caption{Long Stream Addition}\label{fig:longadd}
521\end{figure}
522
523Figure \ref{fig:longadd} illustrates the process.  In the figure,
524we illustrate the process with 8-bit fields rather than 64-bit fields
525and show all field values in hexadecimal notation.  Note that
526two of the individual 8-bit additions produce carries, while two
527others produce {\tt FF} values that generate bubble bits.  The
528net result is that four of the original 8-bit sums must be
529incremented to produce the long stream result.
530
531A slight extension to the process produces a long-stream full adder
532that can be used in chained addition.  In this case, the
533adder must take an additional carry-in bit
534{\tt p} and produce a carry-out bit {\tt q}.
535This may be accomplished by incorporating {\tt p}
536in calculating the increment mask in the low bit position,
537and then extracting the carry-out {\tt q} from the high bit position.
538\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
539\[\text{\tt q} = \text{\tt i >> f}\]
540
541As described subsequently, we use a two-level long-stream addition technique
542in both our AVX2 and GPU implementations.  In principle, one can extend
543the technique to additional levels.  Using 64-bit adders throughout,
544$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
545A three-level scheme could coordinate
54664 groups each performing 4096-bit long additions in a two-level structure.
547However, whether there are reasonable architectures that can support fine-grained
548SIMT style coordinate at this level is an open question.
549
550Using the methods outlined, it is quite conceivable that instruction
551set extensions to support long-stream addition could be added for
552future SIMD and GPU processors.   Given the fundamental nature
553of addition as a primitive and its particular application to regular
554expression matching as shown herein, it seems reasonable to expect
555such instructions to become available.    Alternatively, it may
556be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)#
557\verb#simd<64>::spread(X)# operations are efficiently supported.
558
559
560\input{sse2}
561
562\input{analysis}
563
564\input{avx2}
565
566
567
568\section{GPU Implementation}\label{sec:GPU}
569
570To further assess the scalability of our regular expression matching
571using bit-parallel data streams, we implemented a GPGPU version
572in OpenCL.   
573We arranged for 64 work groups each having 64 threads.
574The size of work group and number of work groups is choosen
575to provide the best occupancy calculated by AMD App Profiler.
576Input files are divided in data parallel fashion among
577the 64 work groups.  Each work group carries out the regular
578expression matching operations 4096 bytes at a time using SIMT
579processing.  Figure \ref{fig:GPUadd} shows our implementation
580of long-stream addition on the GPU.  Each thread maintains
581its own carry and bubble values in shared memroy and performs
582synchronized updates with the other threads using a six-step
583parallel-prefix style process.
584
585
586We performed our test on an AMD Radeon HD A10-6800K APU machine.
587On the AMD Fusion systems, the input buffer is allocated in
588pinned memory to take advantage of the zero-copy memory regions
589where data can be read directly into this region by CPU
590and also accessed by GPU for further processing. Therefore,
591the expensive data transferring time that needed by traditional
592discrete GPUs is hinden and we compare only the kernel execution
593time with our SSE2 and AVX mplementations as shown in Figure
594\ref{fig:SSE-AVX-GPU}. The GPU version gives 30\% to 55\% performance
595improvement over SSE version and 10\% to 40\% performance
596improvement over AVX version.
597
598
599\begin{figure*}[tbh]
600\begin{center}\small
601\begin{verbatim}
602inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
603                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
604        BitBlock carry_mask;
605        BitBlock bubble_mask;
606
607        BitBlock partial_sum = a+b;
608        BitBlock gen = a&b;
609        BitBlock prop = a^b;
610        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
611        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
612       
613        barrier(CLK_LOCAL_MEM_FENCE);
614        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
615                carry[idx] = carry[idx]|carry[idx^offset];
616                bubble[idx] = bubble[idx]|bubble[idx^offset];
617                barrier(CLK_LOCAL_MEM_FENCE);
618        }
619       
620        carry_mask = (carry[0]<<1)|group_carry[carryno];
621        bubble_mask = bubble[0];
622       
623        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
624        BitBlock inc = s | (s-carry_mask);
625        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
626        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
627        return rslt;
628}
629\end{verbatim}
630
631\end{center}
632\caption{OpenCL 4096-bit Addition}
633\label{fig:GPUadd}
634\end{figure*}
635
636
637\begin{figure}
638\begin{center}
639\begin{tikzpicture}
640\begin{axis}[
641xtick=data,
642ylabel=Running Time (ms per megabyte),
643xticklabels={@,Date,Email,URIorEmail,xquote},
644tick label style={font=\tiny},
645enlarge x limits=0.15,
646%enlarge y limits={0.15, upper},
647ymin=0,
648legend style={at={(0.5,-0.15)},
649anchor=north,legend columns=-1},
650ybar,
651bar width=7pt,
652]
653\addplot
654file {data/ssetime.dat};
655\addplot
656file {data/avxtime.dat};
657\addplot
658file {data/gputime.dat};
659
660\legend{SSE2,AVX2,GPU,Annot}
661\end{axis}
662\end{tikzpicture}
663\end{center}
664\caption{Running Time}\label{fig:SSE-AVX-GPU}
665\end{figure}
666
667
668
669
670
671
672
673
674\input{conclusion}
675
676
677
678%\appendix
679%\section{Appendix Title}
680
681%This is the text of the appendix, if you need one.
682
683
684This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
685MITACS, Inc.
686
687\bibliographystyle{IEEEtranS}
688\bibliography{reference}
689
690\end{document}
691
692
Note: See TracBrowser for help on using the repository browser.