source: docs/Working/re/ppopp-re.tex @ 3522

Last change on this file since 3522 was 3522, checked in by ksherdy, 6 years ago

General cleanup and editting of portions of the intro section.

File size: 33.0 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass[preprint]{sigplanconf}
24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
33\usepackage{pgfplots}
34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
41\conferenceinfo{PPoPP 2014}{February 15-19, 2014, Orlando, Florida, United States} 
42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
56\titlebanner{}        % These are ignored unless
57\preprintfooter{Bitwise Data Parallel Grep}   % 'preprint' option specified.
58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
60%\subtitle{Subtitle Text, if any}
61
62\authorinfo{Anonymous Authors}{Institutions}{emails}
63%\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
64%          {Simon Fraser University}
65%           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
66
67\maketitle
68
69\begin{abstract}
70\input{abstract}
71\end{abstract}
72\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
73\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
74
75% general terms are not compulsory anymore,
76% you may leave them out
77%\terms
78%term1, term2
79
80\keywords
81regular expression matching, grep, parallel bit stream technology
82
83\section{Introduction}
84
85The use of regular expressions to search texts for occurrences
86of string patterns has a long history and
87remains a pervasive technique throughout computing applications today.
88% {\em a brief history}
89The origins of regular expression matching date back to automata theory
90developed by Kleene in the 1950s \cite{kleene1951}.
91Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
92to nondeterministic finite automata (NFA).
93Following Thompson's approach, a regular expression of length $m$ is converted
94to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
95NFA in worst case $O(mn)$ time. Often, a more efficient choice
96is to convert an NFA into a DFA. A DFA has a single active state at any time
97throughout the matching process and
98hence it is possible to search a text at of length $n$ in $O(n)$ time
99\footnote{It is well known that the conversion of an NFA to an equivalent DFA may result
100in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
101
102A significant portion of the research in fast regular expression matching can be
103regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
104In \cite{baeza1992new}, Baeza-Yates and Gonnet
105presented a new approach to string search based on bit-parallelism.
106This technique takes advantage of the intrinsic parallelism of bitwise operations
107within a computer word.
108Given a $w$-bit word, the number of operations that a string search algorithms
109performs can be reduced by a factor $w$.
110Using this fact, the Shift-Or algorithm simulates an NFA using
111bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}.
112A disadvantage of the bit-parallel Shift-Or pattern matching approach
113is an inability to skip input characters.
114Simple string matching algorithms,
115such as the Boyer-Moore family of algorithms \cite{boyer1977fast, horspool1980practical} skip input characters
116to achieve sublinear times in the average case.
117Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 
118based on suffix automata are able to skip characters.
119The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
120combines the bit-parallel advantages of the Shift-Or approach
121with the character skipping property of BDM algorithms. The nrgrep pattern matching tool,
122is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous
123processing of data stream elements as discussed herein, nrgrep
124was by far the fastest grep tool
125for matching complex patterns and achieved similar performance
126to that of the fastest existing string
127matching tools for simple patterns \cite{navarro2000}.
128
129There has been considerable
130interest in accelerating regular expression matching
131on parallel hardware
132such as multi-core processors (CPUs), graphics processing units (GPUs),
133field-programmable gate arrays (FPGAs), and specialized architectures such as
134the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
135%CPU
136Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
137text processing algorithms that exhibit irregular memory access patterns
138can be efficiently executed on multicore hardware.
139In related work, Pasetto et al. presented a flexible tool that
140performs small-ruleset regular expression matching at a rate of
1412.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
142Naghmouchi et al. \cite{scarpazza2011top, naghmouchi2010} demonstrated that the Aho-Corasick (AC)
143string matching algorithm \cite{aho1975} is well suited for parallel
144implementation on multi-core CPUs, GPUs and the Cell BE.
145On each hardware, both thread-level parallelism (cores) and data-level parallelism
146(SIMD units) were leveraged for performance.
147Salapura et. al. \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
148in business analytics applications and leveraged the SIMD hardware available
149on multi-core processors to acheive a speedup of greater than 1.8 over a
150range of data sizes of interest.
151%Cell
152In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
153that delivered 1.00--1.78 Gbps on a single
154Cell BE chip and extended this approach for emulation on the Intel Larrabee
155instruction set \cite{scarpazza2009larrabee}.
156On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
157implementation that delivered a throughput of 40
158Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
159Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
160presented a string matching implementation for automata that achieved
1614 Gbps on the Cell BE.
162% GPU
163In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
164implementation of the AC algorithm for
165accelerating string matching on GPUs. Lin et al., proposed
166the Parallel Failureless Aho-Corasick (PFAC)
167algorithm to accelerate pattern matching on GPU hardware and
168achieved 143 Gbps throughput, 14.74 times faster
169than the AC algorithm performed on a four core
170multi-core processor using OpenMP \cite{lin2013accelerating}.
171
172Whereas the existing approaches to parallelization have been
173based on adapting traditional sequential algorithms to emergent
174parallel architectures, we introduce both a new algorithmic
175approach and its implementation on SIMD and GPU architectures.
176This approach relies on a bitwise data parallel view of text
177streams as well as a surprising use of addition to match
178runs of characters in a single step.  The closest previous
179work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
180technology together with a parallel scanning primitive also
181based on addition \cite{cameron2011parallel}.   
182However, in contrast to the deterministic, longest-match
183scanning associated with the ScanThru primitive of that
184work, we introduce here a new primitive MatchStar
185that can be used in full generality for nondeterministic
186regular expression matching.   We also introduce a long-stream
187addition technique involving a further application of MatchStar
188that enables us to scale the technique to $n$-bit addition
189in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
190for example, to perform
191synchronized 4096-bit addition on GPU warps of 64 threads.
192
193There is also a strong keyword match between the bit-parallel
194data streams used in our approach and the bit-parallelism
195used for NFA state transitions in the classical algorithms of
196Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
197and Navarro and Raffinot \cite{navarro1998bit}.
198However those algorithms use bit-parallelism in a fundamentally
199different way: representing all possible current NFA states
200as a bit vector and performing parallel transitions to a new
201set of states using table lookups and bitwise logic.    Whereas
202our approach can match multiple characters per step, bit-parallel
203NFA algorithms proceed through the input one byte at a time.
204Nevertheless, the agrep \cite{wu1992agrep} and
205nrgrep \cite{navarro2000} programs implemented using these techniques remain
206among the strongest competitors in regular expression matching
207performance, so we include them in our comparative evaluation.
208
209The remainder of this paper is organized as follows.
210Section \ref{sec:grep} briefly describes regular expression
211notation and the grep problem.
212Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
213using a model of arbitrary-length bit-parallel data streams.
214Section \ref{sec:blockwise} discusses the block-by-block
215implementation of our techniques including the long stream
216addition techniques for 256-bit addition with AVX2 and
2174096-bit additions with GPGPU SIMT.
218Section \ref{sec:SSE2} describes our overall SSE2 implementation
219and carries out a performance study in comparison with
220existing grep implementations.
221Given the dramatic variation in grep performance across
222different implementation techniques, expressions and data sets,
223Section \ref{sec:analysis} considers a comparison between
224the bit-stream and NFA approaches from a theoretical perspective.
225Section \ref{sec:AVX2} then examines and demonstrates
226the scalability of our
227bitwise data-parallel approach in moving from
228128-bit to 256-bit SIMD on Intel Haswell architecture.
229To further investigate scalability, Section \ref{sec:GPU} 
230addresses the implementation of our matcher using
231groups of 64 threads working together SIMT-style on a GPGPU system.
232Section \ref{sec:Concl} concludes the paper with a discussion of results and
233areas for future work.
234
235\section{Regular Expression Notation and Grep}\label{sec:grep}
236
237We follow common Posix notation for regular expressions.
238A regular expression specifies a set of strings through
239a pattern notation.   Individual characters normally
240stand for themselves, unless they are one of the
241special characters \verb:*+?[{\(|^$.: that serve as metacharacters
242of the notation system.  Thus the regular expression \verb:cat:
243is a pattern for the set consisting of the single 3-character
244string ``\verb:cat:''.   The special characters must be escaped
245with a backslash to prevent interpretation as metacharacter, thus
246\verb:\$: represents the dollar-sign and \verb:\\\\: represent
247the string consisting of two backslash characters.
248Character class bracket expressions are pattern elements
249that allow any character in a given class to be used in a particular
250context.  For example, \verb:[@#%]: is a regular expression
251that stands for any of the three given symbols.  Contiguous
252ranges of characters may be specified using hyphens;
253for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
254for any alphanumeric character or underscore.  If the
255caret character immediately follows the opening bracket,
256the class is negated, thus \verb:[^0-9]: stands for
257any character except a digit.  The period metacharacter
258\verb:.: stands for the class of all characters.
259
260Consecutive pattern elements stand for strings formed by
261concatenation, thus \verb:[cd][ao][tg]: stands for the
262set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
263The alternation operator \verb:|: allows a pattern to be
264defined to have to alternative forms, thus \verb:cat|dog:
265matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
266takes precedence over alternation, but parenthesis may be
267used to change this, thus \verb:(ab|cd)[0-9]: stands for any
268digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
269
270Repetition operators may be appended to a pattern to specify
271a variable number of occurrences of that pattern. 
272The Kleene \verb:*: specifies zero-or-more occurrences
273matching the previous pattern, while \verb:+: specifies one-or
274more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
275both specify the same set: strings of at least one lower-case
276letter.  The postfix operator \verb:?: specifies an optional
277component, i.e., zero-or-one occurrence of strings matching
278the element.  Specific bounds may be given within braces:
279\verb:(ab){3}: specifies the string ``\verb:ababab:'',
280\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
281or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
282strings of at least 4 consecutive capital letters.
283
284The grep program searches a file for lines containing matches
285to a regular expression using any of the above notations.
286In addition, the pattern elements \verb:^: and \verb:$:
287may be used to match respectively the beginning or the
288end of a line.  In line-based tools such as grep, \verb:.:
289matches any character except newlines; matches cannot extend
290over lines.
291Normally, grep prints all matching
292lines to its output.  However, grep programs typically
293allow a command line flag such as \verb:-c: to specify
294that only a count of matching lines be produced; we use
295this option in our experimental evaluation to focus
296our comparisons on the performance of the underlying matching
297algorithms.
298
299\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
300
301Whereas the traditional approaches to regular expression matching
302using NFAs, DFAs or backtracking all rely on a byte-at-a-time
303processing model, the approach  we introduce in this paper is based
304on quite a different concept:  a data-parallel approach to simultaneous
305processing of data stream elements.  Indeed, our most abstract model
306is that of unbounded data parallelism: processing all elements of
307the input data stream simultaneously.   In essence, data streams are viewed
308as (very large) integers.   The fundamental operations are bitwise
309logic, stream shifting and long-stream addition.
310
311Depending on the available parallel processing resources, an actual
312implementation may divide an input stream into blocks  and process
313the blocks sequentially.   Within each block  all elements of the
314input stream are processed together, relying the availability of
315bitwise logic and addition scaled to the block size.   On commodity
316Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
317we typically process input streams 128 bytes at a time.   In this
318case, we rely on the Parabix tool chain \cite{lin2012parabix}
319to handle the details of compilation to block-by-block processing.
320For our GPGPU implementation, we have developed a long-stream
321addition technique that allows us to perform 4096-bit additions
322using 64 threads working in lock-step SIMT fashion.  Using scripts
323to modify the output of the Parabix tools, we effectively divide
324the input into blocks of 4K bytes processed in a fully data-parallel
325manner.
326
327\begin{figure}[tbh]
328\begin{center}
329\begin{tabular}{cr}\\
330input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
331$B_7$ & \verb`.................................`\\
332$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
333$B_5$ & \verb`111111111111111111111111111111111`\\
334$B_4$ & \verb`.11111...111....1....11.....111..`\\
335$B_3$ & \verb`......11..11111.1111...11.....111`\\
336$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
337$B_1$ & \verb`...1....11.1....1........11.111..`\\
338$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
339\verb:[a]: & \verb`1..............1....1......1.....`\\
340\verb:[z]: & \verb`...........1....1.............1..`\\
341\verb:[0-9]: & \verb`.1111....11..........1......11...`
342\end{tabular}
343
344\end{center}
345\caption{Basis and Character Class Streams}
346\label{fig:streams}
347\end{figure}
348
349A key concept in this streaming approach is the derivation of bit streams
350that are parallel to the input data stream, i.e., in one-to-one
351correspondence with the data element positions of the input
352streams.   Typically, the input stream is a byte stream comprising
353the 8-bit character code units of a particular encoding such
354as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
355easily be used with wider code units such as the 16-bit code units of
356UTF-16.   In the case of a byte stream, the first step is to transpose
357the byte stream into eight parallel bit streams, such that bit stream
358$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
359a set of basis bit streams from which many other parallel bit
360streams can be calculated, such as character class bit
361streams such that each bit $j$ of the stream specifies
362whether character $j$ of the input stream is in the class
363or not.  Figure \ref{fig:streams} shows an example of an
364input byte stream in ASCII, the eight basis bit streams of the
365transposed representation, and the character class bit streams
366\verb:[a]:,
367\verb:[z]:, and
368\verb:[0-9]:
369that may be computed from the basis bit streams using bitwise logic.
370Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
371Transposition and character class construction are straightforward
372using the Parabix tool chain \cite{lin2012parabix}.
373
374\begin{figure}[tbh]
375\begin{center}
376\begin{tabular}{cr}\\
377input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
378$M_1$ & \verb`.1..............1....1......1....`\\
379$M_2$ & \verb`.11111..........1....11.....111..`\\
380$M_3$ & \verb`.................1.............1.`
381\end{tabular}
382
383\end{center}
384\caption{Marker Streams in Matching {\tt a[0-9]*z}}
385\label{fig:streams2}
386\end{figure}
387
388\paragraph*{Marker Streams.}  Now consider how bit-parallel data
389streams can be used in regular expression matching.   Consider
390the problem of searching the input stream of Figure \ref{fig:streams}
391to finding occurrence of strings matching
392the regular expression \verb:a[0-9]*z:.
393The matching process involves the concept of {\em marker streams}, that
394is streams that mark the positions of current matches during the
395overall process.  In this case there are three marker streams computed
396during the match process, namely,
397$M_1$ representing match positions after an initial \verb:a:
398character has been found, $M_2$ representing positions
399reachable from positions marked by $M_1$ by further matching zero or
400more digits (\verb:[0-9]*:) and finally $M_3$ the stream
401marking positions after a final \verb:z: has been found.
402Without describing the details of how these streams are computed
403for the time being, Figure \ref{fig:streams2} shows what each
404of these streams should be for our example matching problem.
405Note our convention that a marker stream contains a 1 bit
406at the next character position to be matched, that is,
407immediately past the last position that was matched.
408
409
410\paragraph*{MatchStar.}
411MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
412
413\begin{figure}[tbh]
414\begin{center}
415\begin{tabular}{cr}\\
416input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
417$M_1$ & \verb`.1..............1....1......1....`\\
418$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
419$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
420$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
421$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
422$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
423\end{tabular}
424
425\end{center}
426\caption{$M_2 = \text{MatchStar}(M_1, D)$}
427\label{fig:matchstar}
428\end{figure}
429
430
431Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
432it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
433conventional presentation of our character data input.   However, this reverses the normal
434order of presentation when considering bitstreams as numeric values.  The key point here is
435that when we perform bitstream addition, we will show bit movement from left-to-right.
436For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
437
438The first row of the figure is the input data,
439the second and third rows are the input bitstreams: the initial marker position bitstream and the
440character class bitstream for digits derived from input data. 
441
442In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
443The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
444output marker stream is obtained by ORing $T_2$ with the initial marker stream.
445
446In general, given a marker stream $M$ and a character class stream $C$,
447the operation of MatchStar is defined by the following equation. 
448\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
449Given a set of initial marker positions, the result stream marks
450all possible positions that can be reached by 0 or more occurrences
451of characters in class $C$ from each position in $M$
452
453\input{compilation}
454
455\input{re-Unicode}
456
457\section{Block-at-a-Time Processing}\label{sec:blockwise}
458
459The unbounded stream model of the previous section must of course
460be translated an implementation that proceeds block-at-a-time for
461realistic application.  In this, we primarily rely on the Pablo
462compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
463statements expressed as arbitrary-length bitstream equations, Pablo
464produces block-at-a-time C++ code that initializes and maintains all the necessary
465carry bits for each of the additions and shifts involved in the
466bitstream calculations.   
467
468In the present work, our principal contribution to the block-at-a-time
469model is the technique of long-stream addition described below.
470Otherwise, we were able to use Pablo directly in compiling our
471SSE2 and AVX2 implementations.   Our GPU implementation required
472some scripting to modify the output of the Pablo compiler for our
473purpose.
474
475\paragraph*{Long-Stream Addition.}  The maximum word size for
476addition on commodity processors is typically 64 bits.  In order
477to implement long-stream addition for block sizes of 256 or larger,
478a method for propagating carries through the individual stages of
47964-bit addition is required.  However, the normal technique of
480sequential addition using add-with-carry instructions, for example,
481is far from ideal.
482
483We have developed a technique using SIMD or SIMT methods for constant-time
484long-stream addition up to 4096 bits.   
485We assume the availability of the following SIMD/SIMT operations
486operating on vectors of $f$ 64-bit fields.
487\begin{itemize}
488\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
489in two vectors to produce a result vector of $f$ 64-bit fields.
490\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
491of \verb:x: each with the constant value -1 (all bits 1), producing
492an $f$-bit mask value,
493\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
494field into a single compressed $f$-bit mask value, and
495\item normal bitwise logic operations on $f$-bit masks, and
496\item  \verb#simd<64>::spread(x)# : distributing the bits of
497an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
498\end{itemize}
499
500Given these operations, our method for long stream addition of
501two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
502\begin{enumerate}
503\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
504\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
505
506\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
507\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
508\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
509\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
510
511\item Compute an $f$-bit mask of carries generated for each of the
51264-bit additions of \verb:X: and \verb:Y:.
513\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
514
515\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
516an incoming carry bit.  This is the {\em bubble mask}.
517\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
518
519\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
520incremented to produce the final sum.  Here we find a new application of
521MatchStar!
522\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
523
524This is the key step.  The mask {\tt c} of outgoing carries must be
525shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
526with the next digit.  At the incoming position, the carry will
527increment the 64-bit digit.   However, if this digit is all ones (as
528signalled by the corresponding bit of bubble mask {\tt b}, then the addition
529will generate another carry.  In fact, if there is a sequence of
530digits that are all ones, then the carry must bubble through
531each of them.   This is just MatchStar!
532
533\item Compute the final result {\tt Z}.
534\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
535
536\end{enumerate}
537\begin{figure}
538\begin{center}
539\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
540{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
541{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
542{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
543{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
544{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
545{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
546{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
547{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
548{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
549{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
550{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
551\end{tabular}
552\end{center}
553\caption{Long Stream Addition}\label{fig:longadd}
554\end{figure}
555
556Figure \ref{fig:longadd} illustrates the process.  In the figure,
557we illustrate the process with 8-bit fields rather than 64-bit fields
558and show all field values in hexadecimal notation.  Note that
559two of the individual 8-bit additions produce carries, while two
560others produce {\tt FF} values that generate bubble bits.  The
561net result is that four of the original 8-bit sums must be
562incremented to produce the long stream result.
563
564A slight extension to the process produces a long-stream full adder
565that can be used in chained addition.  In this case, the
566adder must take an additional carry-in bit
567{\tt p} and produce a carry-out bit {\tt q}.
568This may be accomplished by incorporating {\tt p}
569in calculating the increment mask in the low bit position,
570and then extracting the carry-out {\tt q} from the high bit position.
571\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
572\[\text{\tt q} = \text{\tt i >> f}\]
573
574As described subsequently, we use a two-level long-stream addition technique
575in both our AVX2 and GPU implementations.  In principle, one can extend
576the technique to additional levels.  Using 64-bit adders throughout,
577$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
578A three-level scheme could coordinate
57964 groups each performing 4096-bit long additions in a two-level structure.
580However, whether there are reasonable architectures that can support fine-grained
581SIMT style coordinate at this level is an open question.
582
583Using the methods outlined, it is quite conceivable that instruction
584set extensions to support long-stream addition could be added for
585future SIMD and GPU processors.   Given the fundamental nature
586of addition as a primitive and its novel application to regular
587expression matching as shown herein, it seems reasonable to expect
588such instructions to become available.
589
590
591
592\input{sse2}
593
594\input{analysis}
595
596\input{avx2}
597
598
599
600\section{GPU Implementation}\label{sec:GPU}
601
602To further assess the scalability of our regular expression matching
603using bit-parallel data streams, we implemented a GPGPU version
604in OpenCL.   
605We arranged for 64 work groups each having 64
606threads.  Input files are divided in data parallel fashion among
607the 64 work groups.  Each work group carries out the regular
608expression matching operations 4096 bytes at a time using SIMT
609processing.  Figure \ref{fig:GPUadd} shows our implementation
610of long-stream addition on the GPU.  Each thread maintains
611its own carry and bubble values and performs synchronized
612updates with the other threads using a six-step parallel-prefix
613style process.
614
615Our GPU test machine was an AMD A10-5800K APU with Radeon(tm) HD Graphics
616having a processor speed of 4.30 GHz and 32.0GB of memory.
617
618
619\begin{figure*}[tbh]
620\begin{center}
621\begin{verbatim}
622inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
623                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
624        BitBlock carry_mask;
625        BitBlock bubble_mask;
626
627        BitBlock partial_sum = a+b;
628        BitBlock gen = a&b;
629        BitBlock prop = a^b;
630        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
631        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
632       
633        barrier(CLK_LOCAL_MEM_FENCE);
634        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
635                carry[idx] = carry[idx]|carry[idx^offset];
636                bubble[idx] = bubble[idx]|bubble[idx^offset];
637                barrier(CLK_LOCAL_MEM_FENCE);
638        }
639       
640        carry_mask = (carry[0]<<1)|group_carry[carryno];
641        bubble_mask = bubble[0];
642       
643        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
644        BitBlock inc = s | (s-carry_mask);
645        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
646        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
647        return rslt;
648}
649\end{verbatim}
650
651   
652
653\end{center}
654\caption{OpenCL 4096-bit Addition}
655\label{fig:GPUadd}
656\end{figure*}
657
658Figure \ref{fig:SSE-AVX-GPU} compares the performance of
659our SSE2, AVX and GPU implementations.
660
661\begin{figure}
662\begin{center}
663\begin{tikzpicture}
664\begin{axis}[
665xtick=data,
666ylabel=Running Time (ms per megabyte),
667xticklabels={@,Date,Email,URIorEmail,xquote},
668tick label style={font=\tiny},
669enlarge x limits=0.15,
670enlarge y limits={0.15, upper},
671ymin=0,
672legend style={at={(0.5,-0.15)},
673anchor=north,legend columns=-1},
674ybar,
675bar width=7pt,
676]
677\addplot
678file {data/ssetime.dat};
679\addplot
680file {data/avxtime.dat};
681\addplot
682file {data/gputime.dat};
683
684\legend{SSE2,AVX2,GPU,Annot}
685\end{axis}
686\end{tikzpicture}
687\end{center}
688\caption{Running Time}\label{fig:SSE-AVX-GPU}
689\end{figure}
690
691
692
693
694
695
696
697
698\input{conclusion}
699
700
701
702%\appendix
703%\section{Appendix Title}
704
705%This is the text of the appendix, if you need one.
706
707\acks
708
709This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
710MITACS, Inc.
711
712% We recommend abbrvnat bibliography style.
713
714\bibliographystyle{abbrvnat}
715
716% The bibliography should be embedded for final submission.
717 
718\bibliography{reference}
719
720%\begin{thebibliography}{}
721%\softraggedright
722
723%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
724%P. Q. Smith, and X. Y. Jones. ...reference text...
725%
726%\end{thebibliography}
727
728
729\end{document}
730
731
Note: See TracBrowser for help on using the repository browser.