source: docs/Working/re/re-main.tex @ 3623

Last change on this file since 3623 was 3623, checked in by cameron, 5 years ago

Explain while loop termination; tone down long-addition claims

File size: 33.0 KB
Line 
1
2%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
3%
4% This is the LaTeX source for the instructions to authors using
5% the LaTeX document class 'llncs.cls' for contributions to
6% the Lecture Notes in Computer Sciences series.
7% http://www.springer.com/lncs       Springer Heidelberg 2006/05/04
8%
9% It may be used as a template for your own input - copy it
10% to a new file with a new name and use it as the basis
11% for your article.
12%
13% NB: the document class 'llncs' has its own and detailed documentation, see
14% ftp://ftp.springer.de/data/pubftp/pub/tex/latex/llncs/latex2e/llncsdoc.pdf
15%
16%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
17
18
19\documentclass[runningheads,a4paper]{llncs}
20
21\usepackage{amssymb}
22\usepackage{amsmath}
23\setcounter{tocdepth}{3}
24\usepackage{graphicx}
25\usepackage{tikz}
26\usepackage{pgfplots}
27
28\usepackage{url}
29\urldef{\mails}\path|{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram}@cs.sfu.ca|   
30\newcommand{\keywords}[1]{\par\addvspace\baselineskip
31\noindent\keywordname\enspace\ignorespaces#1}
32
33\begin{document}
34
35\mainmatter  % start of an individual contribution
36
37% first the title is needed
38\title{Bitwise Data Parallelism in Regular Expression Matching}
39
40% a short form should be given in case it is too long for the running head
41\titlerunning{Bitwise Data Parallel Grep}
42
43% the name(s) of the author(s) follow(s) next
44%
45% NB: Chinese authors should write their first names(s) in front of
46% their surnames. This ensures that the names appear correctly in
47% the running heads and the author index.
48%
49\author{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
50%          {Simon Fraser University}
51%           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
52
53%
54\authorrunning{Cameron et al}
55% (feature abused for this document to repeat the title also on left hand pages)
56
57% the affiliations are given next; don't give your e-mail address
58% unless you accept that it will be published
59\institute{Simon Fraser University\\
60\mails\\
61}
62
63%
64% NB: a more complex sample for affiliations and the mapping to the
65% corresponding authors can be found in the file "llncs.dem"
66% (search for the string "\mainmatter" where a contribution starts).
67% "llncs.dem" accompanies the document class "llncs.cls".
68%
69
70\toctitle{Bitwise Data Parallelism in Regular Expression Matching}
71\tocauthor{Cameron, et al}
72\maketitle
73
74
75
76
77\begin{abstract}
78\input{abstract}
79\end{abstract}
80
81\section{Introduction}
82
83The use of regular expressions to search texts for occurrences
84of string patterns has a long history and
85remains a pervasive technique throughout computing applications today.
86% {\em a brief history}
87The origins of regular expression matching date back to automata theory
88developed by Kleene in the 1950s \cite{kleene1951}.
89Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
90to nondeterministic finite automata (NFA).
91Following Thompson's approach, a regular expression of length $m$ is converted
92to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
93NFA in worst case $O(mn)$ time. Often, a more efficient choice
94is to convert an NFA into a DFA. A DFA has a single active state at any time
95throughout the matching process and
96hence it is possible to search a text at of length $n$ in $O(n)$ time
97\footnote{It is well known that the conversion of an NFA to an equivalent DFA may result
98in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
99
100A significant portion of the research in fast regular expression matching can be
101regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
102In \cite{baeza1992new}, Baeza-Yates and Gonnet
103presented a new approach to string search based on bit-parallelism.
104This technique takes advantage of the intrinsic parallelism of bitwise operations
105within a computer word.
106Given a $w$-bit word, the number of operations that a string search algorithms
107performs can be reduced by a factor $w$.
108Using this fact, the Shift-Or algorithm simulates an NFA using
109bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}.
110A disadvantage of the bit-parallel Shift-Or pattern matching approach
111is an inability to skip input characters.
112Simple string matching algorithms,
113such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
114to achieve sublinear times in the average case.
115Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 
116based on suffix automata are able to skip characters.
117The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
118combines the bit-parallel advantages of the Shift-Or approach
119with the character skipping property of BDM algorithms. The nrgrep pattern matching tool,
120is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous
121processing of data stream elements as discussed herein, nrgrep
122was by far the fastest grep tool
123for matching complex patterns and achieved similar performance
124to that of the fastest existing string
125matching tools for simple patterns \cite{navarro2000}.
126
127There has been considerable
128interest in accelerating regular expression matching
129on parallel hardware
130such as multi-core processors (CPUs), graphics processing units (GPUs),
131field-programmable gate arrays (FPGAs), and specialized architectures such as
132the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
133%CPU
134Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
135text processing algorithms that exhibit irregular memory access patterns
136can be efficiently executed on multicore hardware.
137In related work, Pasetto et al. presented a flexible tool that
138performs small-ruleset regular expression matching at a rate of
1392.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
140Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
141string matching algorithm \cite{aho1975} is well suited for parallel
142implementation on multi-core CPUs, GPUs and the Cell BE.
143On each hardware, both thread-level parallelism (cores) and data-level parallelism
144(SIMD units) were leveraged for performance.
145Salapura et. al. \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
146in business analytics applications and leveraged the SIMD hardware available
147on multi-core processors to acheive a speedup of greater than 1.8 over a
148range of data sizes of interest.
149%Cell
150In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
151that delivered 1.00--1.78 Gbps on a single
152Cell BE chip and extended this approach for emulation on the Intel Larrabee
153instruction set \cite{scarpazza2009larrabee}.
154On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
155implementation that delivered a throughput of 40
156Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
157Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
158presented a string matching implementation for automata that achieved
1594 Gbps on the Cell BE.
160% GPU
161In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
162implementation of the AC algorithm for
163accelerating string matching on GPUs. Lin et al., proposed
164the Parallel Failureless Aho-Corasick (PFAC)
165algorithm to accelerate pattern matching on GPU hardware and
166achieved 143 Gbps throughput, 14.74 times faster
167than the AC algorithm performed on a four core
168multi-core processor using OpenMP \cite{lin2013accelerating}.
169
170Whereas the existing approaches to parallelization have been
171based on adapting traditional sequential algorithms to emergent
172parallel architectures, we introduce both a new algorithmic
173approach and its implementation on SIMD and GPU architectures.
174This approach relies on a bitwise data parallel view of text
175streams as well as a surprising use of addition to match
176runs of characters in a single step.  The closest previous
177work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
178technology together with a parallel scanning primitive also
179based on addition \cite{cameron2011parallel}.   
180However, in contrast to the deterministic, longest-match
181scanning associated with the ScanThru primitive of that
182work, we introduce here a new primitive MatchStar
183that can be used in full generality for nondeterministic
184regular expression matching.   We also introduce a long-stream
185addition technique involving a further application of MatchStar
186that enables us to scale the technique to $n$-bit addition
187in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
188for example, to perform
189synchronized 4096-bit addition on GPU warps of 64 threads.
190
191There is also a strong keyword match between the bit-parallel
192data streams used in our approach and the bit-parallelism
193used for NFA state transitions in the classical algorithms of
194Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
195and Navarro and Raffinot \cite{navarro1998bit}.
196However those algorithms use bit-parallelism in a fundamentally
197different way: representing all possible current NFA states
198as a bit vector and performing parallel transitions to a new
199set of states using table lookups and bitwise logic.    Whereas
200our approach can match multiple characters per step, bit-parallel
201NFA algorithms proceed through the input one byte at a time.
202Nevertheless, the agrep \cite{wu1992agrep} and
203nrgrep \cite{navarro2000} programs implemented using these techniques remain
204among the strongest competitors in regular expression matching
205performance, so we include them in our comparative evaluation.
206
207The remainder of this paper is organized as follows.
208Section \ref{sec:grep} briefly describes regular expression
209notation and the grep problem.
210Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
211using a model of arbitrary-length bit-parallel data streams.
212Section \ref{sec:blockwise} discusses the block-by-block
213implementation of our techniques including the long stream
214addition techniques for 256-bit addition with AVX2 and
2154096-bit additions with GPGPU SIMT.
216Section \ref{sec:SSE2} describes our overall SSE2 implementation
217and carries out a performance study in comparison with
218existing grep implementations.
219Given the dramatic variation in grep performance across
220different implementation techniques, expressions and data sets,
221Section \ref{sec:analysis} considers a comparison between
222the bit-stream and NFA approaches from a theoretical perspective.
223Section \ref{sec:AVX2} then examines and demonstrates
224the scalability of our
225bitwise data-parallel approach in moving from
226128-bit to 256-bit SIMD on Intel Haswell architecture.
227To further investigate scalability, Section \ref{sec:GPU} 
228addresses the implementation of our matcher using
229groups of 64 threads working together SIMT-style on a GPGPU system.
230Section \ref{sec:Concl} concludes the paper with a discussion of results and
231areas for future work.
232
233\section{Regular Expression Notation and Grep}\label{sec:grep}
234
235We follow common POSIX notation for regular expressions.
236A regular expression specifies a set of strings through
237a pattern notation.   Individual characters normally
238stand for themselves, unless they are one of the
239special characters \verb:*+?[{\(|^$.: that serve as metacharacters
240of the notation system.  Thus the regular expression \verb:cat:
241is a pattern for the set consisting of the single 3-character
242string ``\verb:cat:''.   The special characters must be escaped
243with a backslash to prevent interpretation as metacharacter, thus
244\verb:\$: represents the dollar-sign and \verb:\\\\: represent
245the string consisting of two backslash characters.
246Character class bracket expressions are pattern elements
247that allow any character in a given class to be used in a particular
248context.  For example, \verb:[@#%]: is a regular expression
249that stands for any of the three given symbols.  Contiguous
250ranges of characters may be specified using hyphens;
251for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
252for any alphanumeric character or underscore.  If the
253caret character immediately follows the opening bracket,
254the class is negated, thus \verb:[^0-9]: stands for
255any character except a digit.  The period metacharacter
256\verb:.: stands for the class of all characters.
257
258Consecutive pattern elements stand for strings formed by
259concatenation, thus \verb:[cd][ao][tg]: stands for the
260set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
261The alternation operator \verb:|: allows a pattern to be
262defined to have to alternative forms, thus \verb:cat|dog:
263matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
264takes precedence over alternation, but parenthesis may be
265used to change this, thus \verb:(ab|cd)[0-9]: stands for any
266digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
267
268Repetition operators may be appended to a pattern to specify
269a variable number of occurrences of that pattern. 
270The Kleene \verb:*: specifies zero-or-more occurrences
271matching the previous pattern, while \verb:+: specifies one-or
272more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
273both specify the same set: strings of at least one lower-case
274letter.  The postfix operator \verb:?: specifies an optional
275component, i.e., zero-or-one occurrence of strings matching
276the element.  Specific bounds may be given within braces:
277\verb:(ab){3}: specifies the string ``\verb:ababab:'',
278\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
279or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
280strings of at least 4 consecutive capital letters.
281
282The grep program searches a file for lines containing matches
283to a regular expression using any of the above notations.
284In addition, the pattern elements \verb:^: and \verb:$:
285may be used to match respectively the beginning or the
286end of a line.  In line-based tools such as grep, \verb:.:
287matches any character except newlines; matches cannot extend
288over lines.
289Normally, grep prints all matching
290lines to its output.  However, grep programs typically
291allow a command line flag such as \verb:-c: to specify
292that only a count of matching lines be produced; we use
293this option in our experimental evaluation to focus
294our comparisons on the performance of the underlying matching
295algorithms.
296
297\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
298
299Whereas the traditional approaches to regular expression matching
300using NFAs, DFAs or backtracking all rely on a byte-at-a-time
301processing model, the approach  we introduce in this paper is based
302on quite a different concept:  a data-parallel approach to simultaneous
303processing of data stream elements.  Indeed, our most abstract model
304is that of unbounded data parallelism: processing all elements of
305the input data stream simultaneously.   In essence, data streams are viewed
306as (very large) integers.   The fundamental operations are bitwise
307logic, stream shifting and long-stream addition.
308
309Depending on the available parallel processing resources, an actual
310implementation may divide an input stream into blocks  and process
311the blocks sequentially.   Within each block  all elements of the
312input stream are processed together, relying the availability of
313bitwise logic and addition scaled to the block size.   On commodity
314Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
315we typically process input streams 128 bytes at a time.   
316In this
317case, we rely on the Parabix tool chain \cite{lin2012parabix}
318to handle the details of compilation to block-by-block processing.
319On the
320latest processors supporting the 256-bit AVX2 SIMD operations,
321we also use the Parabix tool chain, but substitute a parallelized
322long-stream addition technique to avoid the sequential chaining
323of 4 64-bit additions.
324Our GPGPU implementation uses scripts to modify the output
325of the Parabix tools, effectively dividing the input into blocks
326of 4K bytes.   
327We also have adapted our long-stream addition technique
328to perform 4096-bit additions using 64 threads working in lock-step
329SIMT fashion.  A similar technique is known to the GPU programming
330community\cite{}.   
331 
332\begin{figure}[tbh]
333\begin{center}
334\begin{tabular}{cr}\\
335input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
336$B_7$ & \verb`.................................`\\
337$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
338$B_5$ & \verb`111111111111111111111111111111111`\\
339$B_4$ & \verb`.11111...111....1....11.....111..`\\
340$B_3$ & \verb`......11..11111.1111...11.....111`\\
341$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
342$B_1$ & \verb`...1....11.1....1........11.111..`\\
343$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
344\verb:[a]: & \verb`1..............1....1......1.....`\\
345\verb:[z]: & \verb`...........1....1.............1..`\\
346\verb:[0-9]: & \verb`.1111....11..........1......11...`
347\end{tabular}
348
349\end{center}
350\caption{Basis and Character Class Streams}
351\label{fig:streams}
352\end{figure}
353
354A key concept in this streaming approach is the derivation of bit streams
355that are parallel to the input data stream, i.e., in one-to-one
356correspondence with the data element positions of the input
357streams.   Typically, the input stream is a byte stream comprising
358the 8-bit character code units of a particular encoding such
359as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
360easily be used with wider code units such as the 16-bit code units of
361UTF-16.   In the case of a byte stream, the first step is to transpose
362the byte stream into eight parallel bit streams, such that bit stream
363$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
364a set of basis bit streams from which many other parallel bit
365streams can be calculated, such as character class bit
366streams such that each bit $j$ of the stream specifies
367whether character $j$ of the input stream is in the class
368or not.  Figure \ref{fig:streams} shows an example of an
369input byte stream in ASCII, the eight basis bit streams of the
370transposed representation, and the character class bit streams
371\verb:[a]:,
372\verb:[z]:, and
373\verb:[0-9]:
374that may be computed from the basis bit streams using bitwise logic.
375Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
376Transposition and character class construction are straightforward
377using the Parabix tool chain \cite{lin2012parabix}.
378
379\begin{figure}[tbh]
380\begin{center}
381\begin{tabular}{cr}\\
382input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
383$M_1$ & \verb`.1..............1....1......1....`\\
384$M_2$ & \verb`.11111..........1....11.....111..`\\
385$M_3$ & \verb`.................1.............1.`
386\end{tabular}
387
388\end{center}
389\caption{Marker Streams in Matching {\tt a[0-9]*z}}
390\label{fig:streams2}
391\end{figure}
392
393\paragraph*{Marker Streams.}  Now consider how bit-parallel data
394streams can be used in regular expression matching.   Consider
395the problem of searching the input stream of Figure \ref{fig:streams}
396to finding occurrence of strings matching
397the regular expression \verb:a[0-9]*z:.
398The matching process involves the concept of {\em marker streams}, that
399is streams that mark the positions of current matches during the
400overall process.  In this case there are three marker streams computed
401during the match process, namely,
402$M_1$ representing match positions after an initial \verb:a:
403character has been found, $M_2$ representing positions
404reachable from positions marked by $M_1$ by further matching zero or
405more digits (\verb:[0-9]*:) and finally $M_3$ the stream
406marking positions after a final \verb:z: has been found.
407Without describing the details of how these streams are computed
408for the time being, Figure \ref{fig:streams2} shows what each
409of these streams should be for our example matching problem.
410Note our convention that a marker stream contains a 1 bit
411at the next character position to be matched, that is,
412immediately past the last position that was matched.
413
414
415\paragraph*{MatchStar.}
416MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
417
418\begin{figure}[tbh]
419\begin{center}
420\begin{tabular}{cr}\\
421input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
422$M_1$ & \verb`.1..............1....1......1....`\\
423$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
424$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
425$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
426$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
427$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
428\end{tabular}
429
430\end{center}
431\caption{$M_2 = \text{MatchStar}(M_1, D)$}
432\label{fig:matchstar}
433\end{figure}
434
435
436Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
437it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
438conventional presentation of our character data input.   However, this reverses the normal
439order of presentation when considering bitstreams as numeric values.  The key point here is
440that when we perform bitstream addition, we will show bit movement from left-to-right.
441For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
442
443The first row of the figure is the input data,
444the second and third rows are the input bitstreams: the initial marker position bitstream and the
445character class bitstream for digits derived from input data. 
446
447In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
448The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
449output marker stream is obtained by ORing $T_2$ with the initial marker stream.
450
451In general, given a marker stream $M$ and a character class stream $C$,
452the operation of MatchStar is defined by the following equation. 
453\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
454Given a set of initial marker positions, the result stream marks
455all possible positions that can be reached by 0 or more occurrences
456of characters in class $C$ from each position in $M$
457
458\input{compilation}
459
460\input{re-Unicode}
461
462\section{Block-at-a-Time Processing}\label{sec:blockwise}
463
464The unbounded stream model of the previous section must of course
465be translated an implementation that proceeds block-at-a-time for
466realistic application.  In this, we primarily rely on the Pablo
467compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
468statements expressed as arbitrary-length bitstream equations, Pablo
469produces block-at-a-time C++ code that initializes and maintains all the necessary
470carry bits for each of the additions and shifts involved in the
471bitstream calculations.   
472
473In the present work, our principal contribution to the Parabix tool
474chain is to incorporate the technique of long-stream addition described below.
475Otherwise, we were able to use Pablo directly in compiling our
476SSE2 and AVX2 implementations.   Our GPU implementation required
477some scripting to modify the output of the Pablo compiler for our
478purpose.
479
480\paragraph*{Long-Stream Addition.}  The maximum word size for
481addition on commodity processors is typically 64 bits.  In order
482to implement long-stream addition for block sizes of 256 or larger,
483a method for propagating carries through the individual stages of
48464-bit addition is required.  However, the normal technique of
485sequential addition using add-with-carry instructions, for example,
486is far from ideal.
487
488We have developed a technique using SIMD methods for constant-time
489long-stream addition up to 4096 bits.   
490We assume the availability of the following SIMD/SIMT operations
491operating on vectors of $f$ 64-bit fields.
492\begin{itemize}
493\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
494in two vectors to produce a result vector of $f$ 64-bit fields.
495\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
496of \verb:x: each with the constant value -1 (all bits 1), producing
497an $f$-bit mask value,
498\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
499field into a single compressed $f$-bit mask value, and
500\item normal bitwise logic operations on $f$-bit masks, and
501\item  \verb#simd<64>::spread(x)# : distributing the bits of
502an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
503\end{itemize}
504
505Given these operations, our method for long stream addition of
506two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
507\begin{enumerate}
508\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
509\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
510
511\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
512\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
513\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
514\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
515
516\item Compute an $f$-bit mask of carries generated for each of the
51764-bit additions of \verb:X: and \verb:Y:.
518\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
519
520\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
521an incoming carry bit.  This is the {\em bubble mask}.
522\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
523
524\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
525incremented to produce the final sum.  Here we find a new application of
526MatchStar!
527\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
528
529This is the key step.  The mask {\tt c} of outgoing carries must be
530shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
531with the next digit.  At the incoming position, the carry will
532increment the 64-bit digit.   However, if this digit is all ones (as
533signalled by the corresponding bit of bubble mask {\tt b}, then the addition
534will generate another carry.  In fact, if there is a sequence of
535digits that are all ones, then the carry must bubble through
536each of them.   This is just MatchStar!
537
538\item Compute the final result {\tt Z}.
539\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
540
541\end{enumerate}
542\begin{figure}
543\begin{center}
544\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
545{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
546{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
547{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
548{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
549{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
550{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
551{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
552{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
553{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
554{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
555{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
556\end{tabular}
557\end{center}
558\caption{Long Stream Addition}\label{fig:longadd}
559\end{figure}
560
561Figure \ref{fig:longadd} illustrates the process.  In the figure,
562we illustrate the process with 8-bit fields rather than 64-bit fields
563and show all field values in hexadecimal notation.  Note that
564two of the individual 8-bit additions produce carries, while two
565others produce {\tt FF} values that generate bubble bits.  The
566net result is that four of the original 8-bit sums must be
567incremented to produce the long stream result.
568
569A slight extension to the process produces a long-stream full adder
570that can be used in chained addition.  In this case, the
571adder must take an additional carry-in bit
572{\tt p} and produce a carry-out bit {\tt q}.
573This may be accomplished by incorporating {\tt p}
574in calculating the increment mask in the low bit position,
575and then extracting the carry-out {\tt q} from the high bit position.
576\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
577\[\text{\tt q} = \text{\tt i >> f}\]
578
579As described subsequently, we use a two-level long-stream addition technique
580in both our AVX2 and GPU implementations.  In principle, one can extend
581the technique to additional levels.  Using 64-bit adders throughout,
582$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
583A three-level scheme could coordinate
58464 groups each performing 4096-bit long additions in a two-level structure.
585However, whether there are reasonable architectures that can support fine-grained
586SIMT style coordinate at this level is an open question.
587
588Using the methods outlined, it is quite conceivable that instruction
589set extensions to support long-stream addition could be added for
590future SIMD and GPU processors.   Given the fundamental nature
591of addition as a primitive and its particular application to regular
592expression matching as shown herein, it seems reasonable to expect
593such instructions to become available.
594
595
596
597\input{sse2}
598
599\input{analysis}
600
601\input{avx2}
602
603
604
605\section{GPU Implementation}\label{sec:GPU}
606
607To further assess the scalability of our regular expression matching
608using bit-parallel data streams, we implemented a GPGPU version
609in OpenCL.   
610We arranged for 64 work groups each having 64
611threads.  Input files are divided in data parallel fashion among
612the 64 work groups.  Each work group carries out the regular
613expression matching operations 4096 bytes at a time using SIMT
614processing.  Figure \ref{fig:GPUadd} shows our implementation
615of long-stream addition on the GPU.  Each thread maintains
616its own carry and bubble values and performs synchronized
617updates with the other threads using a six-step parallel-prefix
618style process.
619
620Our GPU test machine was an AMD A10-5800K APU with Radeon(tm) HD Graphics
621having a processor speed of 4.30 GHz and 32.0GB of memory.
622
623
624\begin{figure*}[tbh]
625\begin{center}\small
626\begin{verbatim}
627inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
628                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
629        BitBlock carry_mask;
630        BitBlock bubble_mask;
631
632        BitBlock partial_sum = a+b;
633        BitBlock gen = a&b;
634        BitBlock prop = a^b;
635        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
636        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
637       
638        barrier(CLK_LOCAL_MEM_FENCE);
639        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
640                carry[idx] = carry[idx]|carry[idx^offset];
641                bubble[idx] = bubble[idx]|bubble[idx^offset];
642                barrier(CLK_LOCAL_MEM_FENCE);
643        }
644       
645        carry_mask = (carry[0]<<1)|group_carry[carryno];
646        bubble_mask = bubble[0];
647       
648        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
649        BitBlock inc = s | (s-carry_mask);
650        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
651        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
652        return rslt;
653}
654\end{verbatim}
655
656\end{center}
657\caption{OpenCL 4096-bit Addition}
658\label{fig:GPUadd}
659\end{figure*}
660
661Figure \ref{fig:SSE-AVX-GPU} compares the performance of
662our SSE2, AVX and GPU implementations.
663
664\begin{figure}
665\begin{center}
666\begin{tikzpicture}
667\begin{axis}[
668xtick=data,
669ylabel=Running Time (ms per megabyte),
670xticklabels={@,Date,Email,URIorEmail,xquote},
671tick label style={font=\tiny},
672enlarge x limits=0.15,
673%enlarge y limits={0.15, upper},
674ymin=0,
675legend style={at={(0.5,-0.15)},
676anchor=north,legend columns=-1},
677ybar,
678bar width=7pt,
679]
680\addplot
681file {data/ssetime.dat};
682\addplot
683file {data/avxtime.dat};
684\addplot
685file {data/gputime.dat};
686
687\legend{SSE2,AVX2,GPU,Annot}
688\end{axis}
689\end{tikzpicture}
690\end{center}
691\caption{Running Time}\label{fig:SSE-AVX-GPU}
692\end{figure}
693
694
695
696
697
698
699
700
701\input{conclusion}
702
703
704
705%\appendix
706%\section{Appendix Title}
707
708%This is the text of the appendix, if you need one.
709
710
711This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
712MITACS, Inc.
713
714% We recommend abbrvnat bibliography style.
715
716\bibliographystyle{splncs}
717
718% The bibliography should be embedded for final submission.
719 
720\bibliography{reference}
721
722%\begin{thebibliography}{}
723%\softraggedright
724
725%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
726%P. Q. Smith, and X. Y. Jones. ...reference text...
727%
728%\end{thebibliography}
729
730
731\end{document}
732
733
Note: See TracBrowser for help on using the repository browser.