source: docs/Working/re/ppopp-re.tex @ 3516

Last change on this file since 3516 was 3516, checked in by cameron, 6 years ago

Elim. dup ref.

File size: 32.5 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass[preprint]{sigplanconf}
24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
33\usepackage{pgfplots}
34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
41\conferenceinfo{PPoPP 2014}{February 15-19, 2014, Orlando, Florida, United States} 
42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
56\titlebanner{}        % These are ignored unless
57\preprintfooter{Bitwise Data Parallel Grep}   % 'preprint' option specified.
58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
60%\subtitle{Subtitle Text, if any}
61
62\authorinfo{Anonymous Authors}{Institutions}{emails}
63%\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
64%          {Simon Fraser University}
65%           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
66
67\maketitle
68
69\begin{abstract}
70\input{abstract}
71\end{abstract}
72\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
73\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
74
75% general terms are not compulsory anymore,
76% you may leave them out
77%\terms
78%term1, term2
79
80\keywords
81regular expression matching, grep, parallel bit stream technology
82
83\section{Introduction}
84
85The use of regular expressions to search texts for occurrences
86of string patterns has a long history and
87remains a pervasive technique throughout computing applications today.
88% {\em a brief history}
89The origins of regular expression matching date back to automata theory
90developed by Kleene in the 1950s \cite{kleene1951}.
91Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
92to nondeterministic finite automata (NFA).
93Following Thompson's approach, a regular expression of length $m$ is first converted
94to an NFA with $O(m)$ nodes. It is then possible to search a text of length $n$ using the
95NFA in worst case $O(mn)$ time. Often, a more efficient choice
96is to convert the NFA into a DFA. A DFA has only a single active state at any time
97in the matching process and
98hence it is possible to search a text at of length $n$ in $O(n)$ time.
99However, it is well known that the conversion of an NFA to an equivalent DFA may result
100in state explosion. That is, the number of resultant DFA states may increase exponentially.
101In \cite{baeza1992new}  text searching was
102proposed based on bit-parallelism.
103The technique takes advantage of the intrinsic parallelism of bitwise operations
104within a computer word. Given a $w$-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the
105bit-parallel approach to
106simulate an NFA in $O(nm/w)$ worst-case time.
107
108A disadvantage of the bit-parallel Shift-Or pattern matching approach
109in comparison to simple string matching algorithms is an inability to skip input characters.
110For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters
111to achieve sublinear times in the average case. Backward Dawg Matching
112(BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters.
113The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
114combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm.
115The nrgrep pattern matching tool is built over the BNDM algorithm,
116and hence the name nrgrep \cite{navarro2000}.
117
118There has been considerable interest in using parallelization techniques
119to improve the performance of regular expression matching on parallel hardware
120such as multi-core processors (CPUs), graphics processing units (GPUs),
121field-programmable gate arrays (FPGAs), and specialized architectures such as
122the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
123%CPU
124Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
125text processing algorithms that exhibit irregular memory access patterns
126can be efficiently executed on multicore hardware.
127In related work, Pasetto et al. presented a flexible tool that
128performs small-ruleset regular expression matching at a rate of
1292.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
130Naghmouchi et al. demonstrated that the Aho-Corasick (AC)
131string matching algorithm \cite{aho1975} is well suited for parallel
132implementation on multi-core CPUs, GPUs and the Cell BE \cite{scarpazza2011top, naghmouchi2010}.
133On each hardware, both thread-level parallelism (additional cores) and data-level parallelism
134(wide SIMD units) are leveraged for performance.
135Salapura et. al., advocated the use of vector-style processing for regular expressions
136in business analytics applications and leveraged the SIMD hardware available
137on multi-core processors to acheive a speedup of better than 1.8 over a
138range of data sizes of interest \cite{salapura2012accelerating}.
139%Cell
140In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
141that delivered 1.00–1.78 Gbps on a single
142Cell BE chip and extended this approach for emulation on the Intel Larrabee
143instruction set \cite{scarpazza2009larrabee}.
144On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
145implementation that delivered a throughput of 40
146Gbps for a small dictionary of approximately 100 patterns, and a throughput of 3.3-3.4
147Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
148presented a string matching implementation for automata that achieves
1494 Gbps on the Cell BE.
150% GPU
151In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
152implementation of the AC algorithm for
153accelerating string matching on GPUs. Lin et al., proposed
154the Parallel Failureless Aho-Corasick (PFAC)
155algorithm to accelerate pattern matching on GPU hardware and
156achieved 143 Gbps throughput, 14.74 times faster
157than the AC algorithm performed on a four core
158multi-core processor using OpenMP \cite{lin2013accelerating}.
159
160Whereas the existing approaches to parallelization have been
161based on adapting traditional sequential algorithms to emergent
162parallel architectures, we introduce both a new algorithmic
163approach and its implementation on SIMD and GPU architectures.
164This approach relies on a bitwise data parallel view of text
165streams as well as a surprising use of addition to match
166runs of characters in a single step.  The closest previous
167work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
168technology together with a parallel scanning primitive also
169based on addition \cite{cameron2011parallel}.   
170However, in contrast to the deterministic, longest-match
171scanning associated with the ScanThru primitive of that
172work, we introduce here a new primitive MatchStar
173that can be used in full generality for nondeterministic
174regular expression matching.   We also introduce a long-stream
175addition technique involving a further application of MatchStar
176that enables us to scale the technique to $n$-bit addition
177in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
178for example, to perform
179synchronized 4096-bit addition on GPU warps of 64 threads.
180
181There is also a strong keyword match between the bit-parallel
182data streams used in our approach and the bit-parallelism
183used for NFA state transitions in the classical algorithms of
184Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
185and Navarro and Raffinot \cite{navarro1998bit}.
186However those algorithms use bit-parallelism in a fundamentally
187different way: representing all possible current NFA states
188as a bit vector and performing parallel transitions to a new
189set of states using table lookups and bitwise logic.    Whereas
190our approach can match multiple characters per step, bit-parallel
191NFA algorithms proceed through the input one byte at a time.
192Nevertheless, the agrep \cite{wu1992agrep} and
193nrgrep \cite{navarro2000} programs implemented using these techniques remain
194among the strongest competitors in regular expression matching
195performance, so we include them in our comparative evaluation.
196
197The remainder of this paper is organized as follows.
198Section \ref{sec:grep} briefly describes regular expression
199notation and the grep problem.
200Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
201using a model of arbitrary-length bit-parallel data streams.
202Section \ref{sec:blockwise} discusses the block-by-block
203implementation of our techniques including the long stream
204addition techniques for 256-bit addition with AVX2 and
2054096-bit additions with GPGPU SIMT.
206Section \ref{sec:SSE2} describes our overall SSE2 implementation
207and carries out a performance study in comparison with
208existing grep implementations.
209Given the dramatic variation in grep performance across
210different implementation techniques, expressions and data sets,
211Section \ref{sec:analysis} considers a comparison between
212the bit-stream and NFA approaches from a theoretical perspective.
213Section \ref{sec:AVX2} then examines and demonstrates
214the scalability of our
215bitwise data-parallel approach in moving from
216128-bit to 256-bit SIMD on Intel Haswell architecture.
217To further investigate scalability, Section \ref{sec:GPU} 
218addresses the implementation of our matcher using
219groups of 64 threads working together SIMT-style on a GPGPU system.
220Section \ref{sec:Concl} concludes the paper with a discussion of results and
221areas for future work.
222
223\section{Regular Expression Notation and Grep}\label{sec:grep}
224
225We follow common Posix notation for regular expressions.
226A regular expression specifies a set of strings through
227a pattern notation.   Individual characters normally
228stand for themselves, unless they are one of the
229special characters \verb:*+?[{\(|^$.: that serve as metacharacters
230of the notation system.  Thus the regular expression \verb:cat:
231is a pattern for the set consisting of the single 3-character
232string ``\verb:cat:''.   The special characters must be escaped
233with a backslash to prevent interpretation as metacharacter, thus
234\verb:\$: represents the dollar-sign and \verb:\\\\: represent
235the string consisting of two backslash characters.
236Character class bracket expressions are pattern elements
237that allow any character in a given class to be used in a particular
238context.  For example, \verb:[@#%]: is a regular expression
239that stands for any of the three given symbols.  Contiguous
240ranges of characters may be specified using hyphens;
241for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
242for any alphanumeric character or underscore.  If the
243caret character immediately follows the opening bracket,
244the class is negated, thus \verb:[^0-9]: stands for
245any character except a digit.  The period metacharacter
246\verb:.: stands for the class of all characters.
247
248Consecutive pattern elements stand for strings formed by
249concatenation, thus \verb:[cd][ao][tg]: stands for the
250set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
251The alternation operator \verb:|: allows a pattern to be
252defined to have to alternative forms, thus \verb:cat|dog:
253matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
254takes precedence over alternation, but parenthesis may be
255used to change this, thus \verb:(ab|cd)[0-9]: stands for any
256digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
257
258Repetition operators may be appended to a pattern to specify
259a variable number of occurrences of that pattern. 
260The Kleene \verb:*: specifies zero-or-more occurrences
261matching the previous pattern, while \verb:+: specifies one-or
262more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
263both specify the same set: strings of at least one lower-case
264letter.  The postfix operator \verb:?: specifies an optional
265component, i.e., zero-or-one occurrence of strings matching
266the element.  Specific bounds may be given within braces:
267\verb:(ab){3}: specifies the string ``\verb:ababab:'',
268\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
269or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
270strings of at least 4 consecutive capital letters.
271
272The grep program searches a file for lines containing matches
273to a regular expression using any of the above notations.
274In addition, the pattern elements \verb:^: and \verb:$:
275may be used to match respectively the beginning or the
276end of a line.  In line-based tools such as grep, \verb:.:
277matches any character except newlines; matches cannot extend
278over lines.
279Normally, grep prints all matching
280lines to its output.  However, grep programs typically
281allow a command line flag such as \verb:-c: to specify
282that only a count of matching lines be produced; we use
283this option in our experimental evaluation to focus
284our comparisons on the performance of the underlying matching
285algorithms.
286
287\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
288
289Whereas the traditional approaches to regular expression matching
290using NFAs, DFAs or backtracking all rely on a byte-at-a-time
291processing model, the approach  we introduce in this paper is based
292on quite a different concept:  a data-parallel approach to simultaneous
293processing of data stream elements.  Indeed, our most abstract model
294is that of unbounded data parallelism: processing all elements of
295the input data stream simultaneously.   In essence, data streams are viewed
296as (very large) integers.   The fundamental operations are bitwise
297logic, stream shifting and long-stream addition.
298
299Depending on the available parallel processing resources, an actual
300implementation may divide an input stream into blocks  and process
301the blocks sequentially.   Within each block  all elements of the
302input stream are processed together, relying the availability of
303bitwise logic and addition scaled to the block size.   On commodity
304Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
305we typically process input streams 128 bytes at a time.   In this
306case, we rely on the Parabix tool chain \cite{lin2012parabix}
307to handle the details of compilation to block-by-block processing.
308For our GPGPU implementation, we have developed a long-stream
309addition technique that allows us to perform 4096-bit additions
310using 64 threads working in lock-step SIMT fashion.  Using scripts
311to modify the output of the Parabix tools, we effectively divide
312the input into blocks of 4K bytes processed in a fully data-parallel
313manner.
314
315\begin{figure}[tbh]
316\begin{center}
317\begin{tabular}{cr}\\
318input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
319$B_7$ & \verb`.................................`\\
320$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
321$B_5$ & \verb`111111111111111111111111111111111`\\
322$B_4$ & \verb`.11111...111....1....11.....111..`\\
323$B_3$ & \verb`......11..11111.1111...11.....111`\\
324$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
325$B_1$ & \verb`...1....11.1....1........11.111..`\\
326$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
327\verb:[a]: & \verb`1..............1....1......1.....`\\
328\verb:[z]: & \verb`...........1....1.............1..`\\
329\verb:[0-9]: & \verb`.1111....11..........1......11...`
330\end{tabular}
331
332\end{center}
333\caption{Basis and Character Class Streams}
334\label{fig:streams}
335\end{figure}
336
337A key concept in this streaming approach is the derivation of bit streams
338that are parallel to the input data stream, i.e., in one-to-one
339correspondence with the data element positions of the input
340streams.   Typically, the input stream is a byte stream comprising
341the 8-bit character code units of a particular encoding such
342as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
343easily be used with wider code units such as the 16-bit code units of
344UTF-16.   In the case of a byte stream, the first step is to transpose
345the byte stream into eight parallel bit streams, such that bit stream
346$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
347a set of basis bit streams from which many other parallel bit
348streams can be calculated, such as character class bit
349streams such that each bit $j$ of the stream specifies
350whether character $j$ of the input stream is in the class
351or not.  Figure \ref{fig:streams} shows an example of an
352input byte stream in ASCII, the eight basis bit streams of the
353transposed representation, and the character class bit streams
354\verb:[a]:,
355\verb:[z]:, and
356\verb:[0-9]:
357that may be computed from the basis bit streams using bitwise logic.
358Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
359Transposition and character class construction are straightforward
360using the Parabix tool chain \cite{lin2012parabix}.
361
362\begin{figure}[tbh]
363\begin{center}
364\begin{tabular}{cr}\\
365input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
366$M_1$ & \verb`.1..............1....1......1....`\\
367$M_2$ & \verb`.11111..........1....11.....111..`\\
368$M_3$ & \verb`.................1.............1.`
369\end{tabular}
370
371\end{center}
372\caption{Marker Streams in Matching {\tt a[0-9]*z}}
373\label{fig:streams2}
374\end{figure}
375
376\paragraph*{Marker Streams.}  Now consider how bit-parallel data
377streams can be used in regular expression matching.   Consider
378the problem of searching the input stream of Figure \ref{fig:streams}
379to finding occurrence of strings matching
380the regular expression \verb:a[0-9]*z:.
381The matching process involves the concept of {\em marker streams}, that
382is streams that mark the positions of current matches during the
383overall process.  In this case there are three marker streams computed
384during the match process, namely,
385$M_1$ representing match positions after an initial \verb:a:
386character has been found, $M_2$ representing positions
387reachable from positions marked by $M_1$ by further matching zero or
388more digits (\verb:[0-9]*:) and finally $M_3$ the stream
389marking positions after a final \verb:z: has been found.
390Without describing the details of how these streams are computed
391for the time being, Figure \ref{fig:streams2} shows what each
392of these streams should be for our example matching problem.
393Note our convention that a marker stream contains a 1 bit
394at the next character position to be matched, that is,
395immediately past the last position that was matched.
396
397
398\paragraph*{MatchStar.}
399MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
400
401\begin{figure}[tbh]
402\begin{center}
403\begin{tabular}{cr}\\
404input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
405$M_1$ & \verb`.1..............1....1......1....`\\
406$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
407$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
408$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
409$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
410$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
411\end{tabular}
412
413\end{center}
414\caption{$M_2 = \text{MatchStar}(M_1, D)$}
415\label{fig:matchstar}
416\end{figure}
417
418
419Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
420it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
421conventional presentation of our character data input.   However, this reverses the normal
422order of presentation when considering bitstreams as numeric values.  The key point here is
423that when we perform bitstream addition, we will show bit movement from left-to-right.
424For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
425
426The first row of the figure is the input data,
427the second and third rows are the input bitstreams: the initial marker position bitstream and the
428character class bitstream for digits derived from input data. 
429
430In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
431The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
432output marker stream is obtained by ORing $T_2$ with the initial marker stream.
433
434In general, given a marker stream $M$ and a character class stream $C$,
435the operation of MatchStar is defined by the following equation. 
436\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
437Given a set of initial marker positions, the result stream marks
438all possible positions that can be reached by 0 or more occurrences
439of characters in class $C$ from each position in $M$
440
441\input{compilation}
442
443\input{re-Unicode}
444
445\section{Block-at-a-Time Processing}\label{sec:blockwise}
446
447The unbounded stream model of the previous section must of course
448be translated an implementation that proceeds block-at-a-time for
449realistic application.  In this, we primarily rely on the Pablo
450compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
451statements expressed as arbitrary-length bitstream equations, Pablo
452produces block-at-a-time C++ code that initializes and maintains all the necessary
453carry bits for each of the additions and shifts involved in the
454bitstream calculations.   
455
456In the present work, our principal contribution to the block-at-a-time
457model is the technique of long-stream addition described below.
458Otherwise, we were able to use Pablo directly in compiling our
459SSE2 and AVX2 implementations.   Our GPU implementation required
460some scripting to modify the output of the Pablo compiler for our
461purpose.
462
463\paragraph*{Long-Stream Addition.}  The maximum word size for
464addition on commodity processors is typically 64 bits.  In order
465to implement long-stream addition for block sizes of 256 or larger,
466a method for propagating carries through the individual stages of
46764-bit addition is required.  However, the normal technique of
468sequential addition using add-with-carry instructions, for example,
469is far from ideal.
470
471We have developed a technique using SIMD or SIMT methods for constant-time
472long-stream addition up to 4096 bits.   
473We assume the availability of the following SIMD/SIMT operations
474operating on vectors of $f$ 64-bit fields.
475\begin{itemize}
476\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
477in two vectors to produce a result vector of $f$ 64-bit fields.
478\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
479of \verb:x: each with the constant value -1 (all bits 1), producing
480an $f$-bit mask value,
481\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
482field into a single compressed $f$-bit mask value, and
483\item normal bitwise logic operations on $f$-bit masks, and
484\item  \verb#simd<64>::spread(x)# : distributing the bits of
485an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
486\end{itemize}
487
488Given these operations, our method for long stream addition of
489two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
490\begin{enumerate}
491\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
492\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
493
494\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
495\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
496\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
497\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
498
499\item Compute an $f$-bit mask of carries generated for each of the
50064-bit additions of \verb:X: and \verb:Y:.
501\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
502
503\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
504an incoming carry bit.  This is the {\em bubble mask}.
505\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
506
507\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
508incremented to produce the final sum.  Here we find a new application of
509MatchStar!
510\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
511
512This is the key step.  The mask {\tt c} of outgoing carries must be
513shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
514with the next digit.  At the incoming position, the carry will
515increment the 64-bit digit.   However, if this digit is all ones (as
516signalled by the corresponding bit of bubble mask {\tt b}, then the addition
517will generate another carry.  In fact, if there is a sequence of
518digits that are all ones, then the carry must bubble through
519each of them.   This is just MatchStar!
520
521\item Compute the final result {\tt Z}.
522\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
523
524\end{enumerate}
525\begin{figure}
526\begin{center}
527\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
528{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
529{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
530{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
531{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
532{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
533{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
534{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
535{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
536{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
537{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
538{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
539\end{tabular}
540\end{center}
541\caption{Long Stream Addition}\label{fig:longadd}
542\end{figure}
543
544Figure \ref{fig:longadd} illustrates the process.  In the figure,
545we illustrate the process with 8-bit fields rather than 64-bit fields
546and show all field values in hexadecimal notation.  Note that
547two of the individual 8-bit additions produce carries, while two
548others produce {\tt FF} values that generate bubble bits.  The
549net result is that four of the original 8-bit sums must be
550incremented to produce the long stream result.
551
552A slight extension to the process produces a long-stream full adder
553that can be used in chained addition.  In this case, the
554adder must take an additional carry-in bit
555{\tt p} and produce a carry-out bit {\tt q}.
556This may be accomplished by incorporating {\tt p}
557in calculating the increment mask in the low bit position,
558and then extracting the carry-out {\tt q} from the high bit position.
559\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
560\[\text{\tt q} = \text{\tt i >> f}\]
561
562As described subsequently, we use a two-level long-stream addition technique
563in both our AVX2 and GPU implementations.  In principle, one can extend
564the technique to additional levels.  Using 64-bit adders throughout,
565$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
566A three-level scheme could coordinate
56764 groups each performing 4096-bit long additions in a two-level structure.
568However, whether there are reasonable architectures that can support fine-grained
569SIMT style coordinate at this level is an open question.
570
571Using the methods outlined, it is quite conceivable that instruction
572set extensions to support long-stream addition could be added for
573future SIMD and GPU processors.   Given the fundamental nature
574of addition as a primitive and its novel application to regular
575expression matching as shown herein, it seems reasonable to expect
576such instructions to become available.
577
578
579
580\input{sse2}
581
582\input{analysis}
583
584\input{avx2}
585
586
587
588\section{GPU Implementation}\label{sec:GPU}
589
590To further assess the scalability of our regular expression matching
591using bit-parallel data streams, we implemented a GPGPU version
592in OpenCL.   
593We arranged for 64 work groups each having 64
594threads.  Input files are divided in data parallel fashion among
595the 64 work groups.  Each work group carries out the regular
596expression matching operations 4096 bytes at a time using SIMT
597processing.  Figure \ref{fig:GPUadd} shows our implementation
598of long-stream addition on the GPU.  Each thread maintains
599its own carry and bubble values and performs synchronized
600updates with the other threads using a six-step parallel-prefix
601style process.
602
603Our GPU test machine was an AMD A10-5800K APU with Radeon(tm) HD Graphics
604having a processor speed of 4.30 GHz and 32.0GB of memory.
605
606
607\begin{figure*}[tbh]
608\begin{center}
609\begin{verbatim}
610inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
611                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
612        BitBlock carry_mask;
613        BitBlock bubble_mask;
614
615        BitBlock partial_sum = a+b;
616        BitBlock gen = a&b;
617        BitBlock prop = a^b;
618        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
619        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
620       
621        barrier(CLK_LOCAL_MEM_FENCE);
622        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
623                carry[idx] = carry[idx]|carry[idx^offset];
624                bubble[idx] = bubble[idx]|bubble[idx^offset];
625                barrier(CLK_LOCAL_MEM_FENCE);
626        }
627       
628        carry_mask = (carry[0]<<1)|group_carry[carryno];
629        bubble_mask = bubble[0];
630       
631        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
632        BitBlock inc = s | (s-carry_mask);
633        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
634        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
635        return rslt;
636}
637\end{verbatim}
638
639   
640
641\end{center}
642\caption{OpenCL 4096-bit Addition}
643\label{fig:GPUadd}
644\end{figure*}
645
646Figure \ref{fig:SSE-AVX-GPU} compares the performance of
647our SSE2, AVX and GPU implementations.
648
649\begin{figure}
650\begin{center}
651\begin{tikzpicture}
652\begin{axis}[
653xtick=data,
654ylabel=Running Time (ms per megabyte),
655xticklabels={@,Date,Email,URIorEmail,xquote},
656tick label style={font=\tiny},
657enlarge x limits=0.15,
658enlarge y limits={0.15, upper},
659ymin=0,
660legend style={at={(0.5,-0.15)},
661anchor=north,legend columns=-1},
662ybar,
663bar width=7pt,
664]
665\addplot
666file {data/ssetime.dat};
667\addplot
668file {data/avxtime.dat};
669\addplot
670file {data/gputime.dat};
671
672\legend{SSE2,AVX2,GPU,Annot}
673\end{axis}
674\end{tikzpicture}
675\end{center}
676\caption{Running Time}\label{fig:SSE-AVX-GPU}
677\end{figure}
678
679
680
681
682
683
684
685
686\input{conclusion}
687
688
689
690%\appendix
691%\section{Appendix Title}
692
693%This is the text of the appendix, if you need one.
694
695\acks
696
697This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
698MITACS, Inc.
699
700% We recommend abbrvnat bibliography style.
701
702\bibliographystyle{abbrvnat}
703
704% The bibliography should be embedded for final submission.
705 
706\bibliography{reference}
707
708%\begin{thebibliography}{}
709%\softraggedright
710
711%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
712%P. Q. Smith, and X. Y. Jones. ...reference text...
713%
714%\end{thebibliography}
715
716
717\end{document}
718
719
Note: See TracBrowser for help on using the repository browser.