source: docs/Working/re/re-main.tex @ 3617

Last change on this file since 3617 was 3617, checked in by cameron, 6 years ago

Some cleanups

File size: 32.7 KB
Line 
1
2%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
3%
4% This is the LaTeX source for the instructions to authors using
5% the LaTeX document class 'llncs.cls' for contributions to
6% the Lecture Notes in Computer Sciences series.
7% http://www.springer.com/lncs       Springer Heidelberg 2006/05/04
8%
9% It may be used as a template for your own input - copy it
10% to a new file with a new name and use it as the basis
11% for your article.
12%
13% NB: the document class 'llncs' has its own and detailed documentation, see
14% ftp://ftp.springer.de/data/pubftp/pub/tex/latex/llncs/latex2e/llncsdoc.pdf
15%
16%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
17
18
19\documentclass[runningheads,a4paper]{llncs}
20
21\usepackage{amssymb}
22\usepackage{amsmath}
23\setcounter{tocdepth}{3}
24\usepackage{graphicx}
25\usepackage{tikz}
26\usepackage{pgfplots}
27
28\usepackage{url}
29\urldef{\mails}\path|{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram}@cs.sfu.ca|   
30\newcommand{\keywords}[1]{\par\addvspace\baselineskip
31\noindent\keywordname\enspace\ignorespaces#1}
32
33\begin{document}
34
35\mainmatter  % start of an individual contribution
36
37% first the title is needed
38\title{Bitwise Data Parallelism in Regular Expression Matching}
39
40% a short form should be given in case it is too long for the running head
41\titlerunning{Bitwise Data Parallel Grep}
42
43% the name(s) of the author(s) follow(s) next
44%
45% NB: Chinese authors should write their first names(s) in front of
46% their surnames. This ensures that the names appear correctly in
47% the running heads and the author index.
48%
49\author{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
50%          {Simon Fraser University}
51%           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
52
53%
54\authorrunning{Cameron et al}
55% (feature abused for this document to repeat the title also on left hand pages)
56
57% the affiliations are given next; don't give your e-mail address
58% unless you accept that it will be published
59\institute{Simon Fraser University\\
60\mails\\
61}
62
63%
64% NB: a more complex sample for affiliations and the mapping to the
65% corresponding authors can be found in the file "llncs.dem"
66% (search for the string "\mainmatter" where a contribution starts).
67% "llncs.dem" accompanies the document class "llncs.cls".
68%
69
70\toctitle{Bitwise Data Parallelism in Regular Expression Matching}
71\tocauthor{Cameron, et al}
72\maketitle
73
74
75
76
77\begin{abstract}
78\input{abstract}
79\end{abstract}
80
81\section{Introduction}
82
83The use of regular expressions to search texts for occurrences
84of string patterns has a long history and
85remains a pervasive technique throughout computing applications today.
86% {\em a brief history}
87The origins of regular expression matching date back to automata theory
88developed by Kleene in the 1950s \cite{kleene1951}.
89Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
90to nondeterministic finite automata (NFA).
91Following Thompson's approach, a regular expression of length $m$ is converted
92to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
93NFA in worst case $O(mn)$ time. Often, a more efficient choice
94is to convert an NFA into a DFA. A DFA has a single active state at any time
95throughout the matching process and
96hence it is possible to search a text at of length $n$ in $O(n)$ time
97\footnote{It is well known that the conversion of an NFA to an equivalent DFA may result
98in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
99
100A significant portion of the research in fast regular expression matching can be
101regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
102In \cite{baeza1992new}, Baeza-Yates and Gonnet
103presented a new approach to string search based on bit-parallelism.
104This technique takes advantage of the intrinsic parallelism of bitwise operations
105within a computer word.
106Given a $w$-bit word, the number of operations that a string search algorithms
107performs can be reduced by a factor $w$.
108Using this fact, the Shift-Or algorithm simulates an NFA using
109bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}.
110A disadvantage of the bit-parallel Shift-Or pattern matching approach
111is an inability to skip input characters.
112Simple string matching algorithms,
113such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
114to achieve sublinear times in the average case.
115Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 
116based on suffix automata are able to skip characters.
117The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
118combines the bit-parallel advantages of the Shift-Or approach
119with the character skipping property of BDM algorithms. The nrgrep pattern matching tool,
120is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous
121processing of data stream elements as discussed herein, nrgrep
122was by far the fastest grep tool
123for matching complex patterns and achieved similar performance
124to that of the fastest existing string
125matching tools for simple patterns \cite{navarro2000}.
126
127There has been considerable
128interest in accelerating regular expression matching
129on parallel hardware
130such as multi-core processors (CPUs), graphics processing units (GPUs),
131field-programmable gate arrays (FPGAs), and specialized architectures such as
132the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
133%CPU
134Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
135text processing algorithms that exhibit irregular memory access patterns
136can be efficiently executed on multicore hardware.
137In related work, Pasetto et al. presented a flexible tool that
138performs small-ruleset regular expression matching at a rate of
1392.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
140Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
141string matching algorithm \cite{aho1975} is well suited for parallel
142implementation on multi-core CPUs, GPUs and the Cell BE.
143On each hardware, both thread-level parallelism (cores) and data-level parallelism
144(SIMD units) were leveraged for performance.
145Salapura et. al. \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
146in business analytics applications and leveraged the SIMD hardware available
147on multi-core processors to acheive a speedup of greater than 1.8 over a
148range of data sizes of interest.
149%Cell
150In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
151that delivered 1.00--1.78 Gbps on a single
152Cell BE chip and extended this approach for emulation on the Intel Larrabee
153instruction set \cite{scarpazza2009larrabee}.
154On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
155implementation that delivered a throughput of 40
156Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
157Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
158presented a string matching implementation for automata that achieved
1594 Gbps on the Cell BE.
160% GPU
161In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
162implementation of the AC algorithm for
163accelerating string matching on GPUs. Lin et al., proposed
164the Parallel Failureless Aho-Corasick (PFAC)
165algorithm to accelerate pattern matching on GPU hardware and
166achieved 143 Gbps throughput, 14.74 times faster
167than the AC algorithm performed on a four core
168multi-core processor using OpenMP \cite{lin2013accelerating}.
169
170Whereas the existing approaches to parallelization have been
171based on adapting traditional sequential algorithms to emergent
172parallel architectures, we introduce both a new algorithmic
173approach and its implementation on SIMD and GPU architectures.
174This approach relies on a bitwise data parallel view of text
175streams as well as a surprising use of addition to match
176runs of characters in a single step.  The closest previous
177work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
178technology together with a parallel scanning primitive also
179based on addition \cite{cameron2011parallel}.   
180However, in contrast to the deterministic, longest-match
181scanning associated with the ScanThru primitive of that
182work, we introduce here a new primitive MatchStar
183that can be used in full generality for nondeterministic
184regular expression matching.   We also introduce a long-stream
185addition technique involving a further application of MatchStar
186that enables us to scale the technique to $n$-bit addition
187in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
188for example, to perform
189synchronized 4096-bit addition on GPU warps of 64 threads.
190
191There is also a strong keyword match between the bit-parallel
192data streams used in our approach and the bit-parallelism
193used for NFA state transitions in the classical algorithms of
194Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
195and Navarro and Raffinot \cite{navarro1998bit}.
196However those algorithms use bit-parallelism in a fundamentally
197different way: representing all possible current NFA states
198as a bit vector and performing parallel transitions to a new
199set of states using table lookups and bitwise logic.    Whereas
200our approach can match multiple characters per step, bit-parallel
201NFA algorithms proceed through the input one byte at a time.
202Nevertheless, the agrep \cite{wu1992agrep} and
203nrgrep \cite{navarro2000} programs implemented using these techniques remain
204among the strongest competitors in regular expression matching
205performance, so we include them in our comparative evaluation.
206
207The remainder of this paper is organized as follows.
208Section \ref{sec:grep} briefly describes regular expression
209notation and the grep problem.
210Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
211using a model of arbitrary-length bit-parallel data streams.
212Section \ref{sec:blockwise} discusses the block-by-block
213implementation of our techniques including the long stream
214addition techniques for 256-bit addition with AVX2 and
2154096-bit additions with GPGPU SIMT.
216Section \ref{sec:SSE2} describes our overall SSE2 implementation
217and carries out a performance study in comparison with
218existing grep implementations.
219Given the dramatic variation in grep performance across
220different implementation techniques, expressions and data sets,
221Section \ref{sec:analysis} considers a comparison between
222the bit-stream and NFA approaches from a theoretical perspective.
223Section \ref{sec:AVX2} then examines and demonstrates
224the scalability of our
225bitwise data-parallel approach in moving from
226128-bit to 256-bit SIMD on Intel Haswell architecture.
227To further investigate scalability, Section \ref{sec:GPU} 
228addresses the implementation of our matcher using
229groups of 64 threads working together SIMT-style on a GPGPU system.
230Section \ref{sec:Concl} concludes the paper with a discussion of results and
231areas for future work.
232
233\section{Regular Expression Notation and Grep}\label{sec:grep}
234
235We follow common POSIX notation for regular expressions.
236A regular expression specifies a set of strings through
237a pattern notation.   Individual characters normally
238stand for themselves, unless they are one of the
239special characters \verb:*+?[{\(|^$.: that serve as metacharacters
240of the notation system.  Thus the regular expression \verb:cat:
241is a pattern for the set consisting of the single 3-character
242string ``\verb:cat:''.   The special characters must be escaped
243with a backslash to prevent interpretation as metacharacter, thus
244\verb:\$: represents the dollar-sign and \verb:\\\\: represent
245the string consisting of two backslash characters.
246Character class bracket expressions are pattern elements
247that allow any character in a given class to be used in a particular
248context.  For example, \verb:[@#%]: is a regular expression
249that stands for any of the three given symbols.  Contiguous
250ranges of characters may be specified using hyphens;
251for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
252for any alphanumeric character or underscore.  If the
253caret character immediately follows the opening bracket,
254the class is negated, thus \verb:[^0-9]: stands for
255any character except a digit.  The period metacharacter
256\verb:.: stands for the class of all characters.
257
258Consecutive pattern elements stand for strings formed by
259concatenation, thus \verb:[cd][ao][tg]: stands for the
260set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
261The alternation operator \verb:|: allows a pattern to be
262defined to have to alternative forms, thus \verb:cat|dog:
263matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
264takes precedence over alternation, but parenthesis may be
265used to change this, thus \verb:(ab|cd)[0-9]: stands for any
266digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
267
268Repetition operators may be appended to a pattern to specify
269a variable number of occurrences of that pattern. 
270The Kleene \verb:*: specifies zero-or-more occurrences
271matching the previous pattern, while \verb:+: specifies one-or
272more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
273both specify the same set: strings of at least one lower-case
274letter.  The postfix operator \verb:?: specifies an optional
275component, i.e., zero-or-one occurrence of strings matching
276the element.  Specific bounds may be given within braces:
277\verb:(ab){3}: specifies the string ``\verb:ababab:'',
278\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
279or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
280strings of at least 4 consecutive capital letters.
281
282The grep program searches a file for lines containing matches
283to a regular expression using any of the above notations.
284In addition, the pattern elements \verb:^: and \verb:$:
285may be used to match respectively the beginning or the
286end of a line.  In line-based tools such as grep, \verb:.:
287matches any character except newlines; matches cannot extend
288over lines.
289Normally, grep prints all matching
290lines to its output.  However, grep programs typically
291allow a command line flag such as \verb:-c: to specify
292that only a count of matching lines be produced; we use
293this option in our experimental evaluation to focus
294our comparisons on the performance of the underlying matching
295algorithms.
296
297\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
298
299Whereas the traditional approaches to regular expression matching
300using NFAs, DFAs or backtracking all rely on a byte-at-a-time
301processing model, the approach  we introduce in this paper is based
302on quite a different concept:  a data-parallel approach to simultaneous
303processing of data stream elements.  Indeed, our most abstract model
304is that of unbounded data parallelism: processing all elements of
305the input data stream simultaneously.   In essence, data streams are viewed
306as (very large) integers.   The fundamental operations are bitwise
307logic, stream shifting and long-stream addition.
308
309Depending on the available parallel processing resources, an actual
310implementation may divide an input stream into blocks  and process
311the blocks sequentially.   Within each block  all elements of the
312input stream are processed together, relying the availability of
313bitwise logic and addition scaled to the block size.   On commodity
314Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
315we typically process input streams 128 bytes at a time.   In this
316case, we rely on the Parabix tool chain \cite{lin2012parabix}
317to handle the details of compilation to block-by-block processing.
318For our GPGPU implementation, we have developed a long-stream
319addition technique that allows us to perform 4096-bit additions
320using 64 threads working in lock-step SIMT fashion.  Using scripts
321to modify the output of the Parabix tools, we effectively divide
322the input into blocks of 4K bytes processed in a fully data-parallel
323manner.
324
325\begin{figure}[tbh]
326\begin{center}
327\begin{tabular}{cr}\\
328input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
329$B_7$ & \verb`.................................`\\
330$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
331$B_5$ & \verb`111111111111111111111111111111111`\\
332$B_4$ & \verb`.11111...111....1....11.....111..`\\
333$B_3$ & \verb`......11..11111.1111...11.....111`\\
334$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
335$B_1$ & \verb`...1....11.1....1........11.111..`\\
336$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
337\verb:[a]: & \verb`1..............1....1......1.....`\\
338\verb:[z]: & \verb`...........1....1.............1..`\\
339\verb:[0-9]: & \verb`.1111....11..........1......11...`
340\end{tabular}
341
342\end{center}
343\caption{Basis and Character Class Streams}
344\label{fig:streams}
345\end{figure}
346
347A key concept in this streaming approach is the derivation of bit streams
348that are parallel to the input data stream, i.e., in one-to-one
349correspondence with the data element positions of the input
350streams.   Typically, the input stream is a byte stream comprising
351the 8-bit character code units of a particular encoding such
352as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
353easily be used with wider code units such as the 16-bit code units of
354UTF-16.   In the case of a byte stream, the first step is to transpose
355the byte stream into eight parallel bit streams, such that bit stream
356$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
357a set of basis bit streams from which many other parallel bit
358streams can be calculated, such as character class bit
359streams such that each bit $j$ of the stream specifies
360whether character $j$ of the input stream is in the class
361or not.  Figure \ref{fig:streams} shows an example of an
362input byte stream in ASCII, the eight basis bit streams of the
363transposed representation, and the character class bit streams
364\verb:[a]:,
365\verb:[z]:, and
366\verb:[0-9]:
367that may be computed from the basis bit streams using bitwise logic.
368Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
369Transposition and character class construction are straightforward
370using the Parabix tool chain \cite{lin2012parabix}.
371
372\begin{figure}[tbh]
373\begin{center}
374\begin{tabular}{cr}\\
375input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
376$M_1$ & \verb`.1..............1....1......1....`\\
377$M_2$ & \verb`.11111..........1....11.....111..`\\
378$M_3$ & \verb`.................1.............1.`
379\end{tabular}
380
381\end{center}
382\caption{Marker Streams in Matching {\tt a[0-9]*z}}
383\label{fig:streams2}
384\end{figure}
385
386\paragraph*{Marker Streams.}  Now consider how bit-parallel data
387streams can be used in regular expression matching.   Consider
388the problem of searching the input stream of Figure \ref{fig:streams}
389to finding occurrence of strings matching
390the regular expression \verb:a[0-9]*z:.
391The matching process involves the concept of {\em marker streams}, that
392is streams that mark the positions of current matches during the
393overall process.  In this case there are three marker streams computed
394during the match process, namely,
395$M_1$ representing match positions after an initial \verb:a:
396character has been found, $M_2$ representing positions
397reachable from positions marked by $M_1$ by further matching zero or
398more digits (\verb:[0-9]*:) and finally $M_3$ the stream
399marking positions after a final \verb:z: has been found.
400Without describing the details of how these streams are computed
401for the time being, Figure \ref{fig:streams2} shows what each
402of these streams should be for our example matching problem.
403Note our convention that a marker stream contains a 1 bit
404at the next character position to be matched, that is,
405immediately past the last position that was matched.
406
407
408\paragraph*{MatchStar.}
409MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
410
411\begin{figure}[tbh]
412\begin{center}
413\begin{tabular}{cr}\\
414input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
415$M_1$ & \verb`.1..............1....1......1....`\\
416$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
417$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
418$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
419$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
420$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
421\end{tabular}
422
423\end{center}
424\caption{$M_2 = \text{MatchStar}(M_1, D)$}
425\label{fig:matchstar}
426\end{figure}
427
428
429Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
430it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
431conventional presentation of our character data input.   However, this reverses the normal
432order of presentation when considering bitstreams as numeric values.  The key point here is
433that when we perform bitstream addition, we will show bit movement from left-to-right.
434For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
435
436The first row of the figure is the input data,
437the second and third rows are the input bitstreams: the initial marker position bitstream and the
438character class bitstream for digits derived from input data. 
439
440In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
441The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
442output marker stream is obtained by ORing $T_2$ with the initial marker stream.
443
444In general, given a marker stream $M$ and a character class stream $C$,
445the operation of MatchStar is defined by the following equation. 
446\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
447Given a set of initial marker positions, the result stream marks
448all possible positions that can be reached by 0 or more occurrences
449of characters in class $C$ from each position in $M$
450
451\input{compilation}
452
453\input{re-Unicode}
454
455\section{Block-at-a-Time Processing}\label{sec:blockwise}
456
457The unbounded stream model of the previous section must of course
458be translated an implementation that proceeds block-at-a-time for
459realistic application.  In this, we primarily rely on the Pablo
460compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
461statements expressed as arbitrary-length bitstream equations, Pablo
462produces block-at-a-time C++ code that initializes and maintains all the necessary
463carry bits for each of the additions and shifts involved in the
464bitstream calculations.   
465
466In the present work, our principal contribution to the block-at-a-time
467model is the technique of long-stream addition described below.
468Otherwise, we were able to use Pablo directly in compiling our
469SSE2 and AVX2 implementations.   Our GPU implementation required
470some scripting to modify the output of the Pablo compiler for our
471purpose.
472
473\paragraph*{Long-Stream Addition.}  The maximum word size for
474addition on commodity processors is typically 64 bits.  In order
475to implement long-stream addition for block sizes of 256 or larger,
476a method for propagating carries through the individual stages of
47764-bit addition is required.  However, the normal technique of
478sequential addition using add-with-carry instructions, for example,
479is far from ideal.
480
481We have developed a technique using SIMD or SIMT methods for constant-time
482long-stream addition up to 4096 bits.   
483We assume the availability of the following SIMD/SIMT operations
484operating on vectors of $f$ 64-bit fields.
485\begin{itemize}
486\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
487in two vectors to produce a result vector of $f$ 64-bit fields.
488\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
489of \verb:x: each with the constant value -1 (all bits 1), producing
490an $f$-bit mask value,
491\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
492field into a single compressed $f$-bit mask value, and
493\item normal bitwise logic operations on $f$-bit masks, and
494\item  \verb#simd<64>::spread(x)# : distributing the bits of
495an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
496\end{itemize}
497
498Given these operations, our method for long stream addition of
499two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
500\begin{enumerate}
501\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
502\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
503
504\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
505\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
506\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
507\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
508
509\item Compute an $f$-bit mask of carries generated for each of the
51064-bit additions of \verb:X: and \verb:Y:.
511\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
512
513\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
514an incoming carry bit.  This is the {\em bubble mask}.
515\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
516
517\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
518incremented to produce the final sum.  Here we find a new application of
519MatchStar!
520\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
521
522This is the key step.  The mask {\tt c} of outgoing carries must be
523shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
524with the next digit.  At the incoming position, the carry will
525increment the 64-bit digit.   However, if this digit is all ones (as
526signalled by the corresponding bit of bubble mask {\tt b}, then the addition
527will generate another carry.  In fact, if there is a sequence of
528digits that are all ones, then the carry must bubble through
529each of them.   This is just MatchStar!
530
531\item Compute the final result {\tt Z}.
532\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
533
534\end{enumerate}
535\begin{figure}
536\begin{center}
537\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
538{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
539{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
540{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
541{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
542{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
543{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
544{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
545{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
546{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
547{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
548{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
549\end{tabular}
550\end{center}
551\caption{Long Stream Addition}\label{fig:longadd}
552\end{figure}
553
554Figure \ref{fig:longadd} illustrates the process.  In the figure,
555we illustrate the process with 8-bit fields rather than 64-bit fields
556and show all field values in hexadecimal notation.  Note that
557two of the individual 8-bit additions produce carries, while two
558others produce {\tt FF} values that generate bubble bits.  The
559net result is that four of the original 8-bit sums must be
560incremented to produce the long stream result.
561
562A slight extension to the process produces a long-stream full adder
563that can be used in chained addition.  In this case, the
564adder must take an additional carry-in bit
565{\tt p} and produce a carry-out bit {\tt q}.
566This may be accomplished by incorporating {\tt p}
567in calculating the increment mask in the low bit position,
568and then extracting the carry-out {\tt q} from the high bit position.
569\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
570\[\text{\tt q} = \text{\tt i >> f}\]
571
572As described subsequently, we use a two-level long-stream addition technique
573in both our AVX2 and GPU implementations.  In principle, one can extend
574the technique to additional levels.  Using 64-bit adders throughout,
575$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
576A three-level scheme could coordinate
57764 groups each performing 4096-bit long additions in a two-level structure.
578However, whether there are reasonable architectures that can support fine-grained
579SIMT style coordinate at this level is an open question.
580
581Using the methods outlined, it is quite conceivable that instruction
582set extensions to support long-stream addition could be added for
583future SIMD and GPU processors.   Given the fundamental nature
584of addition as a primitive and its novel application to regular
585expression matching as shown herein, it seems reasonable to expect
586such instructions to become available.
587
588
589
590\input{sse2}
591
592\input{analysis}
593
594\input{avx2}
595
596
597
598\section{GPU Implementation}\label{sec:GPU}
599
600To further assess the scalability of our regular expression matching
601using bit-parallel data streams, we implemented a GPGPU version
602in OpenCL.   
603We arranged for 64 work groups each having 64
604threads.  Input files are divided in data parallel fashion among
605the 64 work groups.  Each work group carries out the regular
606expression matching operations 4096 bytes at a time using SIMT
607processing.  Figure \ref{fig:GPUadd} shows our implementation
608of long-stream addition on the GPU.  Each thread maintains
609its own carry and bubble values and performs synchronized
610updates with the other threads using a six-step parallel-prefix
611style process.
612
613Our GPU test machine was an AMD A10-5800K APU with Radeon(tm) HD Graphics
614having a processor speed of 4.30 GHz and 32.0GB of memory.
615
616
617\begin{figure*}[tbh]
618\begin{center}\small
619\begin{verbatim}
620inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
621                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
622        BitBlock carry_mask;
623        BitBlock bubble_mask;
624
625        BitBlock partial_sum = a+b;
626        BitBlock gen = a&b;
627        BitBlock prop = a^b;
628        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
629        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
630       
631        barrier(CLK_LOCAL_MEM_FENCE);
632        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
633                carry[idx] = carry[idx]|carry[idx^offset];
634                bubble[idx] = bubble[idx]|bubble[idx^offset];
635                barrier(CLK_LOCAL_MEM_FENCE);
636        }
637       
638        carry_mask = (carry[0]<<1)|group_carry[carryno];
639        bubble_mask = bubble[0];
640       
641        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
642        BitBlock inc = s | (s-carry_mask);
643        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
644        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
645        return rslt;
646}
647\end{verbatim}
648
649\end{center}
650\caption{OpenCL 4096-bit Addition}
651\label{fig:GPUadd}
652\end{figure*}
653
654Figure \ref{fig:SSE-AVX-GPU} compares the performance of
655our SSE2, AVX and GPU implementations.
656
657\begin{figure}
658\begin{center}
659\begin{tikzpicture}
660\begin{axis}[
661xtick=data,
662ylabel=Running Time (ms per megabyte),
663xticklabels={@,Date,Email,URIorEmail,xquote},
664tick label style={font=\tiny},
665enlarge x limits=0.15,
666%enlarge y limits={0.15, upper},
667ymin=0,
668legend style={at={(0.5,-0.15)},
669anchor=north,legend columns=-1},
670ybar,
671bar width=7pt,
672]
673\addplot
674file {data/ssetime.dat};
675\addplot
676file {data/avxtime.dat};
677\addplot
678file {data/gputime.dat};
679
680\legend{SSE2,AVX2,GPU,Annot}
681\end{axis}
682\end{tikzpicture}
683\end{center}
684\caption{Running Time}\label{fig:SSE-AVX-GPU}
685\end{figure}
686
687
688
689
690
691
692
693
694\input{conclusion}
695
696
697
698%\appendix
699%\section{Appendix Title}
700
701%This is the text of the appendix, if you need one.
702
703
704This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
705MITACS, Inc.
706
707% We recommend abbrvnat bibliography style.
708
709\bibliographystyle{splncs}
710
711% The bibliography should be embedded for final submission.
712 
713\bibliography{reference}
714
715%\begin{thebibliography}{}
716%\softraggedright
717
718%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
719%P. Q. Smith, and X. Y. Jones. ...reference text...
720%
721%\end{thebibliography}
722
723
724\end{document}
725
726
Note: See TracBrowser for help on using the repository browser.