source: docs/Working/re/ppopp-re.tex @ 3502

Last change on this file since 3502 was 3502, checked in by cameron, 6 years ago

Updates to section 1; analysis section

File size: 32.5 KB
RevLine 
[3459]1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
[3490]23\documentclass[preprint]{sigplanconf}
[3459]24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
[3482]33\usepackage{pgfplots}
[3459]34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
[3490]41\conferenceinfo{PPoPP 2014}{February 15-19, 2014, Orlando, Florida, United States} 
[3459]42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
[3490]56\titlebanner{}        % These are ignored unless
57\preprintfooter{Bitwise Data Parallel Grep}   % 'preprint' option specified.
[3459]58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
[3490]60%\subtitle{Subtitle Text, if any}
[3459]61
[3490]62\authorinfo{Anonymous Authors}{Institutions}{emails}
63%\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
64%          {Simon Fraser University}
65%           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
[3459]66
67\maketitle
68
69\begin{abstract}
70\input{abstract}
71\end{abstract}
72\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
73\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
74
75% general terms are not compulsory anymore,
76% you may leave them out
77%\terms
78%term1, term2
79
80\keywords
81regular expression matching, grep, parallel bit stream technology
82
83\section{Introduction}
84
[3464]85The use of regular expressions to search texts for occurrences
86of string patterns has a long history and
87remains a pervasive technique throughout computing applications today.
[3468]88% {\em a brief history}
89The origins of regular expression matching date back to automata theory
90developed by Kleene in the 1950s \cite{kleene1951}.
91Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
92to nondeterministic finite automata (NFA).
[3490]93Following Thompson's approach, a regular expression of length $m$ is first converted
94to an NFA with $O(m)$ nodes. It is then possible to search a text of length $n$ using the
95NFA in worst case $O(mn)$ time. Often, a more efficient choice
[3468]96is to convert the NFA into a DFA. A DFA has only a single active state at any time
97in the matching process and
[3490]98hence it is possible to search a text at of length $n$ in $O(n)$ time.
[3468]99However, it is well known that the conversion of an NFA to an equivalent DFA may result
100in state explosion. That is, the number of resultant DFA states may increase exponentially.
101In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}.
102This technique takes advantage of the intrinsic parallelism of bitwise operations
[3490]103within a computer word. Given a $w$-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the
[3468]104bit-parallel approach to
[3490]105simulate an NFA in $O(nm/w)$ worst-case time.
[3459]106
[3468]107A disadvantage of the bit-parallel Shift-Or pattern matching approach
108in comparison to simple string matching algorithms is an inability to skip input characters.
109For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters
110to achieve sublinear times in the average case. Backward Dawg Matching
111(BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters.
112The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
113combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm.
114The nrgrep pattern matching tool is built over the BNDM algorithm,
115and hence the name nrgrep \cite{navarro2000}.
116
[3464]117There has been considerable interest in using parallelization techniques
[3471]118to improve the performance of regular expression matching on parallel hardware
119such as multi-core processors (CPUs), graphics processing units (GPUs),
[3490]120field-programmable gate arrays (FPGAs), and specialized architectures such as
[3471]121the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
122%CPU
123Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
124text processing algorithms that exhibit irregular memory access patterns
125can be efficiently executed on multicore hardware.
126In related work, Pasetto et al. presented a flexible tool that
[3468]127performs small-ruleset regular expression matching at a rate of
1282.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
[3471]129Naghmouchi et al. demonstrated that the Aho-Corasick (AC)
130string matching algorithm \cite{aho1975} is well suited for parallel
131implementation on multi-core CPUs, GPUs and the Cell BE \cite{scarpazza2011top, naghmouchi2010}.
132On each hardware, both thread-level parallelism (additional cores) and data-level parallelism
133(wide SIMD units) are leveraged for performance.
134Salapura et. al., advocated the use of vector-style processing for regular expressions
135in business analytics applications and leveraged the SIMD hardware available
136on multi-core processors to acheive a speedup of better than 1.8 over a
137range of data sizes of interest \cite{salapura2012accelerating}.
138%Cell
139In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
140that delivered 1.00–1.78 Gbps on a single
141Cell BE chip and extended this approach for emulation on the Intel Larrabee
142instruction set \cite{scarpazza2009larrabee}.
143On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
144implementation that delivered a throughput of 40
145Gbps for a small dictionary of approximately 100 patterns, and a throughput of 3.3-3.4
146Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
147presented a string matching implementation for automata that achieves
1484 Gbps on the Cell BE.
149% GPU
150In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
151implementation of the AC algorithm for
152accelerating string matching on GPUs. Lin et al., proposed
153the Parallel Failureless Aho-Corasick (PFAC)
154algorithm to accelerate pattern matching on GPU hardware and
155achieved 143 Gbps throughput, 14.74 times faster
156than the AC algorithm performed on a four core
157multi-core processor using OpenMP \cite{lin2013accelerating}.
[3468]158
[3464]159Whereas the existing approaches to parallelization have been
160based on adapting traditional sequential algorithms to emergent
161parallel architectures, we introduce both a new algorithmic
162approach and its implementation on SIMD and GPU architectures.
163This approach relies on a bitwise data parallel view of text
[3466]164streams as well as a surprising use of addition to match
[3486]165runs of characters in a single step.  The closest previous
[3466]166work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
[3464]167technology together with a parallel scanning primitive also
168based on addition \cite{cameron2011parallel}.   
169However, in contrast to the deterministic, longest-match
170scanning associated with the ScanThru primitive of that
171work, we introduce here a new primitive MatchStar
172that can be used in full generality for nondeterministic
173regular expression matching.   We also introduce a long-stream
174addition technique involving a further application of MatchStar
175that enables us to scale the technique to $n$-bit addition
[3486]176in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
[3466]177for example, to perform
178synchronized 4096-bit addition on GPU warps of 64 threads.
[3459]179
[3464]180There is also a strong keyword match between the bit-parallel
[3466]181data streams used in our approach and the bit-parallelism
182used for NFA state transitions in the classical algorithms of
183Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
184and Navarro and Raffinot \cite{navarro1998bit}.
[3464]185However those algorithms use bit-parallelism in a fundamentally
[3466]186different way: representing all possible current NFA states
187as a bit vector and performing parallel transitions to a new
188set of states using table lookups and bitwise logic.    Whereas
189our approach can match multiple characters per step, bit-parallel
190NFA algorithms proceed through the input one byte at a time.
191Nevertheless, the agrep \cite{wu1992agrep} and
192nrgrep \cite{navarro2000} programs implemented using these techniques remain
193among the strongest competitors in regular expression matching
194performance, so we include them in our comparative evaluation.
[3459]195
196The remainder of this paper is organized as follows.
[3486]197Section \ref{sec:grep} briefly describes regular expression
198notation and the grep problem.
[3469]199Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
200using a model of arbitrary-length bit-parallel data streams.
201Section \ref{sec:blockwise} discusses the block-by-block
202implementation of our techniques including the long stream
203addition techniques for 256-bit addition with AVX2 and
2044096-bit additions with GPGPU SIMT.
[3502]205Section \ref{sec:SSE2} describes our overall SSE2 implementation
206and carries out a performance study in comparison with
207existing grep implementations.
208Given the dramatic variation in grep performance across
209different implementation techniques, expressions and data sets,
210Section \ref{sec:analysis} considers a comparison between
211the bit-stream and NFA approaches from a theoretical perspective.
212Section \ref{sec:AVX2} then examines and demonstrates
213the scalability of our
214bitwise data-parallel approach in moving from
215128-bit to 256-bit SIMD on Intel Haswell architecture.
216To further investigate scalability, Section \ref{sec:GPU} 
217addresses the implementation of our matcher using
218groups of 64 threads working together SIMT-style on a GPGPU system.
219Section \ref{sec:Concl} concludes the paper with a discussion of results and
220areas for future work.
[3459]221
[3486]222\section{Regular Expression Notation and Grep}\label{sec:grep}
[3459]223
[3486]224We follow common Posix notation for regular expressions.
225A regular expression specifies a set of strings through
226a pattern notation.   Individual characters normally
227stand for themselves, unless they are one of the
228special characters \verb:*+?[{\(|^$.: that serve as metacharacters
229of the notation system.  Thus the regular expression \verb:cat:
230is a pattern for the set consisting of the single 3-character
231string ``\verb:cat:''.   The special characters must be escaped
232with a backslash to prevent interpretation as metacharacter, thus
233\verb:\$: represents the dollar-sign and \verb:\\\\: represent
234the string consisting of two backslash characters.
235Character class bracket expressions are pattern elements
236that allow any character in a given class to be used in a particular
237context.  For example, \verb:[@#%]: is a regular expression
238that stands for any of the three given symbols.  Contiguous
239ranges of characters may be specified using hyphens;
240for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
241for any alphanumeric character or underscore.  If the
242caret character immediately follows the opening bracket,
243the class is negated, thus \verb:[^0-9]: stands for
244any character except a digit.  The period metacharacter
245\verb:.: stands for the class of all characters.
246
247Consecutive pattern elements stand for strings formed by
248concatenation, thus \verb:[cd][ao][tg]: stands for the
249set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
250The alternation operator \verb:|: allows a pattern to be
251defined to have to alternative forms, thus \verb:cat|dog:
252matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
253takes precedence over alternation, but parenthesis may be
254used to change this, thus \verb:(ab|cd)[0-9]: stands for any
255digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
256
257Repetition operators may be appended to a pattern to specify
258a variable number of occurrences of that pattern. 
259The Kleene \verb:*: specifies zero-or-more occurrences
260matching the previous pattern, while \verb:+: specifies one-or
261more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
262both specify the same set: strings of at least one lower-case
263letter.  The postfix operator \verb:?: specifies an optional
264component, i.e., zero-or-one occurrence of strings matching
265the element.  Specific bounds may be given within braces:
266\verb:(ab){3}: specifies the string ``\verb:ababab:'',
267\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
268or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
269strings of at least 4 consecutive capital letters.
270
271The grep program searches a file for lines containing matches
272to a regular expression using any of the above notations.
273In addition, the pattern elements \verb:^: and \verb:$:
274may be used to match respectively the beginning or the
275end of a line.  In line-based tools such as grep, \verb:.:
276matches any character except newlines; matches cannot extend
277over lines.
278Normally, grep prints all matching
279lines to its output.  However, grep programs typically
280allow a command line flag such as \verb:-c: to specify
281that only a count of matching lines be produced; we use
282this option in our experimental evaluation to focus
283our comparisons on the performance of the underlying matching
284algorithms.
285
[3469]286\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
[3459]287
[3469]288Whereas the traditional approaches to regular expression matching
289using NFAs, DFAs or backtracking all rely on a byte-at-a-time
290processing model, the approach  we introduce in this paper is based
291on quite a different concept:  a data-parallel approach to simultaneous
292processing of data stream elements.  Indeed, our most abstract model
293is that of unbounded data parallelism: processing all elements of
[3490]294the input data stream simultaneously.   In essence, data streams are viewed
295as (very large) integers.   The fundamental operations are bitwise
296logic, stream shifting and long-stream addition.
[3463]297
[3469]298Depending on the available parallel processing resources, an actual
299implementation may divide an input stream into blocks  and process
300the blocks sequentially.   Within each block  all elements of the
301input stream are processed together, relying the availability of
302bitwise logic and addition scaled to the block size.   On commodity
303Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
304we typically process input streams 128 bytes at a time.   In this
305case, we rely on the Parabix tool chain \cite{lin2012parabix}
306to handle the details of compilation to block-by-block processing.
[3490]307For our GPGPU implementation, we have developed a long-stream
308addition technique that allows us to perform 4096-bit additions
309using 64 threads working in lock-step SIMT fashion.  Using scripts
310to modify the output of the Parabix tools, we effectively divide
311the input into blocks of 4K bytes processed in a fully data-parallel
312manner.
[3469]313
[3491]314\begin{figure}[tbh]
315\begin{center}
316\begin{tabular}{cr}\\
317input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
318$B_7$ & \verb`.................................`\\
319$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
320$B_5$ & \verb`111111111111111111111111111111111`\\
321$B_4$ & \verb`.11111...111....1....11.....111..`\\
322$B_3$ & \verb`......11..11111.1111...11.....111`\\
323$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
324$B_1$ & \verb`...1....11.1....1........11.111..`\\
325$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
326\verb:[a]: & \verb`1..............1....1......1.....`\\
327\verb:[z]: & \verb`...........1....1.............1..`\\
328\verb:[0-9]: & \verb`.1111....11..........1......11...`
329\end{tabular}
330
331\end{center}
332\caption{Basis and Character Class Streams}
333\label{fig:streams}
334\end{figure}
335
[3469]336A key concept in this streaming approach is the derivation of bit streams
[3463]337that are parallel to the input data stream, i.e., in one-to-one
338correspondence with the data element positions of the input
339streams.   Typically, the input stream is a byte stream comprising
340the 8-bit character code units of a particular encoding such
341as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
342easily be used with wider code units such as the 16-bit code units of
343UTF-16.   In the case of a byte stream, the first step is to transpose
344the byte stream into eight parallel bit streams, such that bit stream
[3466]345$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
[3463]346a set of basis bit streams from which many other parallel bit
347streams can be calculated, such as character class bit
348streams such that each bit $j$ of the stream specifies
349whether character $j$ of the input stream is in the class
[3472]350or not.  Figure \ref{fig:streams} shows an example of an
351input byte stream in ASCII, the eight basis bit streams of the
[3490]352transposed representation, and the character class bit streams
353\verb:[a]:,
354\verb:[z]:, and
355\verb:[0-9]:
[3472]356that may be computed from the basis bit streams using bitwise logic.
[3491]357Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
[3472]358Transposition and character class construction are straightforward
359using the Parabix tool chain \cite{lin2012parabix}.
[3463]360
[3490]361\begin{figure}[tbh]
362\begin{center}
363\begin{tabular}{cr}\\
364input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
[3491]365$M_1$ & \verb`.1..............1....1......1....`\\
366$M_2$ & \verb`.11111..........1....11.....111..`\\
367$M_3$ & \verb`.................1.............1.`
[3490]368\end{tabular}
369
370\end{center}
[3491]371\caption{Marker Streams in Matching {\tt a[0-9]*z}}
372\label{fig:streams2}
[3490]373\end{figure}
374
[3472]375\paragraph*{Marker Streams.}  Now consider how bit-parallel data
376streams can be used in regular expression matching.   Consider
377the problem of searching the input stream of Figure \ref{fig:streams}
378to finding occurrence of strings matching
379the regular expression \verb:a[0-9]*z:.
380The matching process involves the concept of {\em marker streams}, that
381is streams that mark the positions of current matches during the
382overall process.  In this case there are three marker streams computed
383during the match process, namely,
384$M_1$ representing match positions after an initial \verb:a:
385character has been found, $M_2$ representing positions
386reachable from positions marked by $M_1$ by further matching zero or
387more digits (\verb:[0-9]*:) and finally $M_3$ the stream
388marking positions after a final \verb:z: has been found.
389Without describing the details of how these streams are computed
[3491]390for the time being, Figure \ref{fig:streams2} shows what each
[3472]391of these streams should be for our example matching problem.
392Note our convention that a marker stream contains a 1 bit
393at the next character position to be matched, that is,
394immediately past the last position that was matched.
[3463]395
[3459]396
[3472]397\paragraph*{MatchStar.}
398MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
[3459]399
[3491]400\begin{figure}[tbh]
[3472]401\begin{center}
402\begin{tabular}{cr}\\
[3491]403input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
404$M_1$ & \verb`.1..............1....1......1....`\\
405$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
406$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
407$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
408$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
409$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
[3472]410\end{tabular}
411
412\end{center}
[3491]413\caption{$M_2 = \text{MatchStar}(M_1, D)$}
414\label{fig:matchstar}
415\end{figure}
[3472]416
417
[3493]418Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
419it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
[3491]420conventional presentation of our character data input.   However, this reverses the normal
421order of presentation when considering bitstreams as numeric values.  The key point here is
422that when we perform bitstream addition, we will show bit movement from left-to-right.
[3493]423For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
[3491]424
[3493]425The first row of the figure is the input data,
426the second and third rows are the input bitstreams: the initial marker position bitstream and the
427character class bitstream for digits derived from input data. 
428
[3491]429In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
430The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
431output marker stream is obtained by ORing $T_2$ with the initial marker stream.
432
[3472]433In general, given a marker stream $M$ and a character class stream $C$,
434the operation of MatchStar is defined by the following equation. 
435\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
436Given a set of initial marker positions, the result stream marks
437all possible positions that can be reached by 0 or more occurrences
438of characters in class $C$ from each position in $M$
439
[3502]440\input{compilation}
[3472]441
442\section{Block-at-a-Time Processing}\label{sec:blockwise}
443
[3475]444The unbounded stream model of the previous section must of course
445be translated an implementation that proceeds block-at-a-time for
446realistic application.  In this, we primarily rely on the Pablo
447compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
448statements expressed as arbitrary-length bitstream equations, Pablo
449produces block-at-a-time C++ code that initializes and maintains all the necessary
450carry bits for each of the additions and shifts involved in the
451bitstream calculations.   
[3472]452
[3475]453In the present work, our principal contribution to the block-at-a-time
454model is the technique of long-stream addition described below.
455Otherwise, we were able to use Pablo directly in compiling our
456SSE2 and AVX2 implementations.   Our GPU implementation required
457some scripting to modify the output of the Pablo compiler for our
458purpose.
[3472]459
460\paragraph*{Long-Stream Addition.}  The maximum word size for
461addition on commodity processors is typically 64 bits.  In order
462to implement long-stream addition for block sizes of 256 or larger,
463a method for propagating carries through the individual stages of
46464-bit addition is required.  However, the normal technique of
465sequential addition using add-with-carry instructions, for example,
466is far from ideal.
467
468We have developed a technique using SIMD or SIMT methods for constant-time
469long-stream addition up to 4096 bits.   
470We assume the availability of the following SIMD/SIMT operations
471operating on vectors of $f$ 64-bit fields.
472\begin{itemize}
473\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
474in two vectors to produce a result vector of $f$ 64-bit fields.
475\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
476of \verb:x: each with the constant value -1 (all bits 1), producing
[3475]477an $f$-bit mask value,
[3472]478\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
479field into a single compressed $f$-bit mask value, and
[3490]480\item normal bitwise logic operations on $f$-bit masks, and
[3472]481\item  \verb#simd<64>::spread(x)# : distributing the bits of
[3490]482an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
[3472]483\end{itemize}
484
485Given these operations, our method for long stream addition of
486two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
487\begin{enumerate}
488\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
489\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
490
491\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
492\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
493\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
494\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
495
496\item Compute an $f$-bit mask of carries generated for each of the
49764-bit additions of \verb:X: and \verb:Y:.
498\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
499
500\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
[3475]501an incoming carry bit.  This is the {\em bubble mask}.
[3472]502\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
503
504\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
[3475]505incremented to produce the final sum.  Here we find a new application of
506MatchStar!
[3472]507\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
508
[3475]509This is the key step.  The mask {\tt c} of outgoing carries must be
510shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
511with the next digit.  At the incoming position, the carry will
512increment the 64-bit digit.   However, if this digit is all ones (as
513signalled by the corresponding bit of bubble mask {\tt b}, then the addition
514will generate another carry.  In fact, if there is a sequence of
515digits that are all ones, then the carry must bubble through
516each of them.   This is just MatchStar!
517
518\item Compute the final result {\tt Z}.
[3472]519\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
520
521\end{enumerate}
[3475]522\begin{figure}
523\begin{center}
524\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
525{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
526{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
527{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
528{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
529{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
530{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
531{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
532{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
533{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
534{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
535{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
536\end{tabular}
537\end{center}
538\caption{Long Stream Addition}\label{fig:longadd}
539\end{figure}
[3472]540
[3475]541Figure \ref{fig:longadd} illustrates the process.  In the figure,
542we illustrate the process with 8-bit fields rather than 64-bit fields
543and show all field values in hexadecimal notation.  Note that
544two of the individual 8-bit additions produce carries, while two
545others produce {\tt FF} values that generate bubble bits.  The
546net result is that four of the original 8-bit sums must be
547incremented to produce the long stream result.
[3472]548
[3475]549A slight extension to the process produces a long-stream full adder
550that can be used in chained addition.  In this case, the
551adder must take an additional carry-in bit
552{\tt p} and produce a carry-out bit {\tt q}.
553This may be accomplished by incorporating {\tt p}
554in calculating the increment mask in the low bit position,
555and then extracting the carry-out {\tt q} from the high bit position.
556\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
557\[\text{\tt q} = \text{\tt i >> f}\]
[3477]558
559As described subsequently, we use a two-level long-stream addition technique
560in both our AVX2 and GPU implementations.  In principle, one can extend
561the technique to additional levels.  Using 64-bit adders throughout,
[3486]562$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
[3477]563A three-level scheme could coordinate
56464 groups each performing 4096-bit long additions in a two-level structure.
565However, whether there are reasonable architectures that can support fine-grained
566SIMT style coordinate at this level is an open question.
567
568Using the methods outlined, it is quite conceivable that instruction
569set extensions to support long-stream addition could be added for
570future SIMD and GPU processors.   Given the fundamental nature
571of addition as a primitive and its novel application to regular
572expression matching as shown herein, it seems reasonable to expect
573such instructions to become available.
[3459]574
575
[3502]576\input{sse2}
[3459]577
[3502]578\input{analysis}
[3459]579
[3501]580\input{avx2}
[3486]581
582
[3482]583
[3459]584\section{GPU Implementation}\label{sec:GPU}
[3493]585
586To further assess the scalability of our regular expression matching
587using bit-parallel data streams, we implemented a GPGPU version
[3501]588in OpenCL.   
589We arranged for 64 work groups each having 64
[3493]590threads.  Input files are divided in data parallel fashion among
591the 64 work groups.  Each work group carries out the regular
592expression matching operations 4096 bytes at a time using SIMT
593processing.  Figure \ref{fig:GPUadd} shows our implementation
594of long-stream addition on the GPU.  Each thread maintains
595its own carry and bubble values and performs synchronized
596updates with the other threads using a six-step parallel-prefix
597style process.
598
[3501]599Our GPU test machine was an AMD A10-5800K APU with Radeon(tm) HD Graphics
600having a processor speed of 4.30 GHz and 32.0GB of memory.
[3493]601
602
603\begin{figure*}[tbh]
604\begin{center}
605\begin{verbatim}
606inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
607                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
608        BitBlock carry_mask;
609        BitBlock bubble_mask;
610
611        BitBlock partial_sum = a+b;
612        BitBlock gen = a&b;
613        BitBlock prop = a^b;
614        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
615        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
616       
617        barrier(CLK_LOCAL_MEM_FENCE);
618        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
619                carry[idx] = carry[idx]|carry[idx^offset];
620                bubble[idx] = bubble[idx]|bubble[idx^offset];
621                barrier(CLK_LOCAL_MEM_FENCE);
622        }
623       
624        carry_mask = (carry[0]<<1)|group_carry[carryno];
625        bubble_mask = bubble[0];
626       
627        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
628        BitBlock inc = s | (s-carry_mask);
629        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
630        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
631        return rslt;
632}
633\end{verbatim}
634
635Figure \ref{fig:SSE-AVX-GPU} compares the performance of
636our SSE2, AVX and GPU implementations.   
637
638\end{center}
639\caption{OpenCL 4096-bit Addition}
640\label{fig:GPUadd}
641\end{figure*}
642
643
644
[3485]645\begin{figure}
646\begin{center}
647\begin{tikzpicture}
648\begin{axis}[
649xtick=data,
650ylabel=Running Time (ms per byte),
651xticklabels={@,Date,Email,URIorEmail,xquote},
652tick label style={font=\tiny},
653enlargelimits=0.15,
654legend style={at={(0.5,-0.15)},
655anchor=north,legend columns=-1},
656ybar,
657bar width=7pt,
658]
659\addplot
660file {data/ssetime.dat};
661\addplot
662file {data/avxtime.dat};
663\addplot
664file {data/gputime.dat};
[3459]665
[3485]666\legend{SSE2,AVX2,GPU,Annot}
667\end{axis}
668\end{tikzpicture}
669\end{center}
[3493]670\caption{Running Time}\label{fig:SSE-AVX-GPU}
[3485]671\end{figure}
672
673
674
675
[3459]676\section{Miscellaneous}
677\subsection{Skipping}
[3486]678\input{re-Unicode}
[3459]679
[3486]680\input{conclusion}
[3459]681
682
683
684%\appendix
685%\section{Appendix Title}
686
687%This is the text of the appendix, if you need one.
688
689\acks
690
691This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
692MITACS, Inc.
693
694% We recommend abbrvnat bibliography style.
695
696\bibliographystyle{abbrvnat}
697
698% The bibliography should be embedded for final submission.
699 
700\bibliography{reference}
701
702%\begin{thebibliography}{}
703%\softraggedright
704
705%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
706%P. Q. Smith, and X. Y. Jones. ...reference text...
707%
708%\end{thebibliography}
709
710
711\end{document}
712
713
Note: See TracBrowser for help on using the repository browser.