source: docs/Working/re/pact051-cameron.tex @ 3890

Last change on this file since 3890 was 3890, checked in by cameron, 5 years ago

Classification; IPC discussion

File size: 35.6 KB
Line 
1\documentclass{sig-alternate-2013}
2
3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
5\let\crnotice\mycrnotice%
6\let\confname\myconfname%
7
8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
10{\mycrnotice{Copyright is held by the owner/author(s). Publication rights licensed to ACM.}}}
11\copyrightetc{ACM \the\acmcopyr}
12\crdata{978-1-4503-2809-8/14/08\ ...\$15.00.\\
13http://dx.doi.org/10.1145/2628071.2628079}
14
15\clubpenalty=10000
16\widowpenalty = 10000
17\usepackage[normalem]{ulem}
18\usepackage{amssymb}
19\usepackage{amsmath}
20\usepackage{graphicx}
21\usepackage{tikz}
22\usepackage{multirow}
23\usepackage{pgfplots}
24
25\begin{document}
26
27\title{Bitwise Data Parallelism in Regular Expression Matching}
28
29\numberofauthors{1} %  in this sample file, there are a *total*
30% of EIGHT authors. SIX appear on the 'first-page' (for formatting
31% reasons) and the remaining two appear in the \additionalauthors section.
32%
33\author{
34% You can go ahead and credit any number of authors here,
35% e.g. one 'row of three' or two rows (consisting of one row of three
36% and a second row of one, two or three).
37%
38% The command \alignauthor (no curly braces needed) should
39% precede each author name, affiliation/snail-mail address and
40% e-mail address. Additionally, tag each line of
41% affiliation/address with \affaddr, and tag the
42% e-mail address with \email.
43%
44% 1st. author
45%\alignauthor
46Rob Cameron, Tom Shermer, Arrvindh Shriraman, Ken Herdy, Dan Lin, Ben Hull, Meng Lin\\
47       \affaddr{School of Computing Science}\\
48       \affaddr{Simon Fraser University}\\
49       \affaddr{Surrey, British Columbia}\\
50       \email{\{cameron,shermer,ashrira,ksherdy,lindanl,bhull,linmengl\}@sfu.ca}
51}
52
53
54\date{}
55\maketitle
56
57\thispagestyle{empty}
58
59
60\begin{abstract}
61\input{abstract}
62\end{abstract}
63\category{D.1.3}{Programming Techniques}{Concurrent Programming}[parallel programming]
64\category{F.2.2}{Theory of Computation}{Nonnumerical Algorithms and Problems}[pattern matching]
65\keywords{regular expression matching, parallel bit streams}
66\section{Introduction}
67
68The use of regular expressions to search texts
69for patterns has a long history and
70remains an important technique.
71% {\em a brief history}
72%The origins of regular expression matching date back to automata theory
73%studied by Kleene in the 1950s \cite{kleene1951}.
74Thompson \cite{thompson1968} is credited with the
75first construction to convert regular expressions
76to nondeterministic finite automata (NFA).
77Following Thompson's approach,
78a regular expression of length $m$ is converted
79to an NFA with $O(m)$ states. Using the
80NFA it is possible to search a text
81of length $n$ in $O(mn)$ time.
82Frequently, a more efficient choice
83is to convert an NFA into a deterministic finite automata (DFA).
84A DFA maintains a single active state
85throughout the matching process and
86hence, using a DFA it is possible to
87search a text of length $n$ in $O(n)$ time\footnote{It is well
88known that the conversion of an NFA to an equivalent DFA may result
89in \textit{state explosion}, i.e., the number of resultant DFA states may increase exponentially.}.
90
91A significant proportion of the research
92in fast regular expression matching can be
93regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
94In \cite{baeza1992new}, Baeza-Yates and Gonnet
95presented a new approach to string search based on bit-level parallelism.
96This technique takes advantage of the intrinsic parallelism of bitwise operations
97within a computer word. Thus,
98given a $w$-bit word, the number of operations that a string search algorithms
99performs can be reduced by a factor $w$.
100Building on this observation, the Shift-Or algorithm simulates an NFA using
101bitwise operations and achieves $O(\frac{nm}{w})$ time in the worst-case \cite{navarro2000}.
102A disadvantage of the Shift-Or approach
103is an inability to skip input characters.
104Simple string matching algorithms,
105such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical},
106skip input characters
107to achieve sublinear times in the average case.
108% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
109% based on suffix automata are able to skip characters.
110The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
111combines the advantages of the Shift-Or approach
112with the ability to skip characters. %character skipping property of BDM algorithms.
113The nrgrep tool is based on the BNDM algorithm. It is generally
114considered
115the fastest grep tool
116for matching complex patterns, and achieves similar performance
117to the fastest existing string
118matching tools for simple patterns \cite{navarro2000}.
119
120Recently, there has been considerable
121interest in the use of
122parallel hardware such as multicore processors (CPUs),
123 graphics processing units (GPUs),
124field-programmable gate arrays (FPGAs),
125or specialized architectures such as
126the Cell Broadband Engine (Cell BE) to
127accelerate regular expression matching. % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
128Generally, speedups are achieved by
129using parallel hardware features to improve the throughput of
130multiple instances of a matching problem at a time,
131i.e., by matching against sets of patterns or multiple input streams.
132In contrast, our approach uses parallelism to accelerate the throughput of
133a single problem instance, i.e., a single regular expression
134matched against a single input stream.
135
136%\paragraph{Multicore}
137In related work targeting multicore hardware,
138Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
139text processing algorithms that exhibit irregular memory access patterns
140can be efficiently executed.
141Pasetto et al \cite{pasetto2010} presented a flexible tool that
142performs small-ruleset regular expression matching at a rate of
1432.88 Gbps per chip on Intel Xeon E5472 hardware.
144Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
145string matching algorithm \cite{aho1975} is well-suited for parallel
146implementation on multicore CPUs, GPUs and the Cell BE.
147% On each system, both thread-level parallelism (cores) and data-level parallelism
148% (SIMD units) were leveraged for performance.
149Salapura et al \cite{salapura2012accelerating} advocated
150the use of vector-style processing for regular expressions
151in business analytics applications and leveraged the SIMD hardware available
152on multicore processors to achieve a speedup of greater than 1.8 over a
153range of data sizes.
154%\paragraph{Cell Broadband Engine}
155On the Cell Broadband Engine, Scarpazza and Russell
156%presented a SIMD tokenizer
157%that delivered 1.00--1.78 Gbps on a single
158%chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee
159%instruction set \cite{scarpazza2009larrabee}.
160%Furthermore,
161\cite{scarpazza2009cell}
162described a pattern matching
163implementation that delivered a throughput of 40
164Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
165Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
166presented a string matching implementation for automata that achieved
1674 Gbps on the Cell BE.
168On GPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
169implementation of the AC algorithm for
170accelerating string matching on GPUs. Lin et al., proposed
171the Parallel Failureless Aho-Corasick (PFAC)
172algorithm to accelerate pattern matching on GPU hardware and
173achieved 143 Gbps raw data throughput,
174although system throughput was limited to 15 Gbps
175\cite{lin2013accelerating}.
176Most recently, Mytkowicz et al \cite{DBLP:conf/asplos/MytkowiczMS14} 
177have developed a method for combining
178SIMD parallelism and data parallelism on multicore hardware.
179Of each of these related works, this approach stands
180out since it also focuses on the acceleration
181of matching against a single input stream.
182
183Whereas the existing approaches to parallelization have been
184based on adapting traditional sequential algorithms to emergent
185parallel architectures, we introduce both a new algorithmic
186approach and its implementation on SIMD and GPU architectures.
187This approach relies on a bitwise data parallel view of text
188streams as well as a surprising use of addition to match
189runs of characters in a single step.  The closest previous
190work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
191technology together with a parallel scanning primitive also
192based on addition \cite{cameron2011parallel}.   
193However, in contrast to the deterministic, longest-match
194scanning associated with the ScanThru primitive of that
195work, we introduce here a new primitive MatchStar
196that can be used in full generality for nondeterministic
197regular expression matching.   We also introduce a long-stream
198addition technique involving a further application of MatchStar
199that enables us to scale the technique to $n$-bit addition
200in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
201for example, to perform
202synchronized 4096-bit addition on GPU wavefronts of 64 threads.
203
204There is also a strong keyword match between the bit-parallel
205data streams used in our approach and the bit-parallelism
206used for NFA state transitions in the classical algorithms of
207Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
208and Navarro and Raffinot \cite{navarro1998bit}.
209However those algorithms use bit-parallelism in a fundamentally
210different way: representing all possible current NFA states
211as a bit vector and performing parallel transitions to a new
212set of states using table lookups and bitwise logic.    Whereas
213our approach can match multiple characters per step, bit-parallel
214NFA algorithms proceed through the input one byte at a time.
215Nevertheless, the agrep \cite{wu1992agrep} and
216nrgrep \cite{navarro2000} programs implemented using these techniques remain
217among the strongest competitors in regular expression matching
218performance, so we include them in our comparative evaluation.
219
220The remainder of this paper is organized as follows.
221Section \ref{sec:grep} briefly describes regular expression
222notation and the grep problem.
223Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
224primitive using a model of arbitrary-length bit-parallel data streams.
225Section \ref{sec:blockwise} discusses the block-by-block
226implementation of our techniques including the long stream
227addition techniques for 256-bit addition with AVX2 and
2284096-bit additions with GPU SIMT.
229Section \ref{sec:SSE2} describes our overall SSE2 implementation
230and carries out a performance study in comparison with
231existing grep implementations.
232Given the dramatic variation in grep performance across
233different implementation techniques, expressions and data sets,
234Section \ref{sec:analysis} considers a comparison between
235the bit-stream and NFA approaches from a theoretical perspective.
236Section \ref{sec:AVX2} then examines and demonstrates
237the scalability of our
238bitwise data-parallel approach in moving from
239128-bit to 256-bit SIMD on Intel Haswell architecture.
240To further investigate scalability, Section \ref{sec:GPU} 
241addresses the implementation of our matcher using
242groups of 64 threads working together SIMT-style on a GPU system.
243Section \ref{sec:Concl} concludes the paper with a discussion of results and
244areas for future work.
245
246\section{Regular Expressions and Grep}\label{sec:grep}
247
248We follow common POSIX notation for regular expressions.
249A regular expression specifies a set of strings through
250a pattern notation.   Individual characters normally
251stand for themselves, unless they are one of the
252special characters \verb:*+?[{\(|^$.: that serve as metacharacters
253of the notation system.  Thus the regular expression \verb:cat:
254is a pattern for the set consisting of the single 3-character
255string ``\verb:cat:''.   The special characters must be escaped
256with a backslash to prevent interpretation as metacharacter, thus
257\verb:\$: represents the dollar-sign and \verb:\\\\: represent
258the string consisting of two backslash characters.
259Character class bracket expressions are pattern elements
260that allow any character in a given class to be used in a particular
261context.  For example, \verb:[@#%]: is a regular expression
262that stands for any of the three given symbols.  Contiguous
263ranges of characters may be specified using hyphens;
264for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
265for any alphanumeric character or underscore.  If the
266caret character immediately follows the opening bracket,
267the class is negated, thus \verb:[^0-9]: stands for
268any character except a digit.  The period metacharacter
269\verb:.: stands for the class of all characters.
270
271Consecutive pattern elements stand for strings formed by
272concatenation, thus \verb:[cd][ao][tg]: stands for the
273set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
274The alternation operator \verb:|: allows a pattern to be
275defined to have two alternative forms, thus \verb:cat|dog:
276matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
277takes precedence over alternation, but parenthesis may be
278used to change this, thus \verb:(ab|cd)[0-9]: stands for any
279digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
280
281Repetition operators may be appended to a pattern to specify
282a variable number of occurrences of that pattern. 
283The Kleene star operator \verb:*: specifies zero-or-more occurrences
284matching the previous pattern, while Kleene plus \verb:+: specifies
285one-or-more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
286both specify the same set: strings of at least one lower-case
287letter.  The postfix operator \verb:?: specifies an optional
288component, i.e., zero-or-one occurrence of strings matching
289the element.  Specific bounds may be given within braces:
290\verb:(ab){3}: specifies the string ``\verb:ababab:'',
291\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
292or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
293strings of at least 4 consecutive capital letters.
294
295The grep program searches a file for lines containing matches
296to a regular expression using any of the above notations.
297In addition, the pattern elements \verb:^: and \verb:$:
298may be used to match respectively the beginning or the
299end of a line.  In line-based tools such as grep, \verb:.:
300matches any character except newlines; matches cannot extend
301over lines.
302Normally, grep prints all matching
303lines to its output.  However, grep programs typically
304allow a command line flag such as \verb:-c: to specify
305that only a count of matching lines be produced; we use
306this option in our experimental evaluation to focus
307our comparisons on the performance of the underlying matching
308algorithms.
309
310\section{Bit-Parallel Data Streams}\label{sec:bitwise}
311
312Whereas the traditional approaches to regular expression matching
313using NFAs, DFAs or backtracking all rely on a byte-at-a-time
314processing model, the approach  we introduce in this paper is based
315on quite a different concept:  a data-parallel approach to simultaneous
316processing of data stream elements.  Indeed, our most abstract model
317is that of unbounded data parallelism: processing all elements of
318the input data stream simultaneously.   In essence, data streams are viewed
319as (very large) integers.   The fundamental operations are bitwise
320logic, stream shifting and long-stream addition.
321
322Depending on the available parallel processing resources, an actual
323implementation may divide an input stream into blocks  and process
324the blocks sequentially.   Within each block  all elements of the
325input stream are processed together, relying on the availability of
326bitwise logic and addition scaled to the block size.   On commodity
327Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
328we typically process input streams 128 bytes at a time.   
329In this
330case, we rely on the Parabix tool chain \cite{lin2012parabix}
331to handle the details of compilation to block-by-block processing.
332On the
333latest processors supporting the 256-bit AVX2 SIMD operations,
334we also use the Parabix tool chain, but substitute a parallelized
335long-stream addition technique to avoid the sequential chaining
336of 4 64-bit additions.
337Our GPU implementation uses scripts to modify the output
338of the Parabix tools, effectively dividing the input into blocks
339of 4K bytes.   
340We also have adapted our long-stream addition technique
341to perform 4096-bit additions using 64 threads working in lock-step
342SIMT fashion. 
343 
344\begin{figure}[tbh]
345\begin{center}
346\begin{tabular}{cr}\\
347input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
348$B_7$ & \verb`.................................`\\
349$B_6$ & \verb`1...1..1.1..11..1.....1..11..1...`\\
350$B_5$ & \verb`111111111111111111111111111111111`\\
351$B_4$ & \verb`.1111...11...1...111111....1111..`\\
352$B_3$ & \verb`....111..111.111...1.1111....1.11`\\
353$B_2$ & \verb`.11..11...11..11....1..11.....111`\\
354$B_1$ & \verb`...11..111...1....1...1..1.1111..`\\
355$B_0$ & \verb`1.11.11.1.111.1111.1.1.1111...111`\\
356\verb:[a]: & \verb`1...........1...1.........1......`\\
357\verb:[z9]: & \verb`....1....1...1.....1.11......1...`\\
358\verb:[0-9]: & \verb`.111....1........11111.....11.1..`
359\end{tabular}
360
361\end{center}
362\caption{Basis and Character Class Streams}
363\label{fig:streams}
364\end{figure}
365
366A key concept in this streaming approach is the derivation of bit streams
367that are parallel to the input data stream, i.e., in one-to-one
368correspondence with the data element positions of the input
369streams.   Typically, the input stream is a byte stream comprising
370the 8-bit character code units of a particular encoding such
371as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
372easily be used with wider code units such as the 16-bit code units of
373UTF-16.   In the case of a byte stream, the first step is to transpose
374the byte stream into eight parallel bit streams, such that bit stream
375$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
376a set of basis bit streams from which many other parallel bit
377streams can be calculated, such as character class bit
378streams such that each bit $j$ of the stream specifies
379whether character $j$ of the input stream is in the class
380or not.  Figure \ref{fig:streams} shows an example of an
381input byte stream in ASCII, the eight basis bit streams of the
382transposed representation, and the character class bit streams
383\verb:[a]:,
384\verb:[z9]:, and
385\verb:[0-9]:
386that may be computed from the basis bit streams using bitwise logic.
387Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
388Transposition and character class construction are straightforward
389using the Parabix tool chain \cite{lin2012parabix}.
390
391\begin{figure}[tbh]
392\begin{center}
393\begin{tabular}{cr}\\
394input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
395$M_1$ & \verb`.1...........1...1.........1.....`\\
396$M_2$ & \verb`.1111........1...111111....111...`\\
397$M_3$ & \verb`.....1........1.....1.11......1..`
398\end{tabular}
399
400\end{center}
401\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
402\label{fig:streams2}
403\end{figure}
404
405\paragraph*{Marker Streams}  Now consider how bit-parallel data
406streams can be used in regular expression matching.   Consider
407the problem of searching the input stream of Figure \ref{fig:streams}
408to finding occurrence of strings matching
409the regular expression \verb:a[0-9]*[z9]:. 
410Note that this is an ambiguous regular expression, which could match
411texts such as \verb:a12949z: in multiple ways.
412The matching process involves the concept of {\em marker streams}, that
413is streams that mark the positions of current matches during the
414overall process.  In this case there are three marker streams computed
415during the match process, namely,
416$M_1$ representing match positions after an initial \verb:a:
417character has been found, $M_2$ representing positions
418reachable from positions marked by $M_1$ by further matching zero or
419more digits (\verb:[0-9]*:) and finally $M_3$ the stream
420marking positions after a final \verb:z: or \verb:9: has been found.
421Without describing the details of how these streams are computed
422for the time being, Figure \ref{fig:streams2} shows what each
423of these streams should be for our example matching problem.
424Our convention that a marker stream contains a 1 bit
425at the next character position to be matched, that is,
426immediately past the last position that was matched.
427Note that all three matches from the third occurrence of \verb:a:
428are correctly marked in $M_3$.
429
430
431\paragraph*{MatchStar}  MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
432
433\begin{figure}[tbh]
434\begin{center}
435\begin{tabular}{cr}\\
436input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
437$M_1$ & \verb`.1...........1...1.........1.....`\\
438$C = \text{\tt [0-9]}$ & \verb`.111....1........11111.....11.1..`\\
439$T_0 = M_1 \wedge C$ & \verb`.1...............1.........1.....`\\
440$T_1 = T_0 + C$ & \verb`....1...1.............1......11..`\\
441$T_2 = T_1 \oplus C$ & \verb`.1111............111111....111...`\\
442$M_2 = T_2 \vee M_1$ & \verb`.1111........1...111111....111...`
443\end{tabular}
444
445\end{center}
446\caption{$M_2 = \text{MatchStar}(M_1, C)$}
447\label{fig:matchstar}
448\end{figure}
449
450
451Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
452it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
453conventional presentation of our character data input.   However, this reverses the normal
454order of presentation when considering bitstreams as numeric values.  The key point here is
455that when we perform bitstream addition, we will show bit movement from left-to-right.
456For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
457
458The first row of the figure is the input data,
459the second and third rows are the input bitstreams: the initial marker position bitstream and the
460character class bitstream for digits derived from input data. 
461
462In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
463The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
464output marker stream is obtained by ORing $T_2$ with the initial marker stream.
465
466In general, given a marker stream $M$ and a character class stream $C$,
467the operation of MatchStar is defined by the following equation. 
468\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
469Given a set of initial marker positions, the result stream marks
470all possible positions that can be reached by 0 or more occurrences
471of characters in class $C$ from each position in $M$
472
473MatchStar differs from ScanThru of the Parabix tool chain in that it
474finds all matches, not just the longest match.  This is necessary
475for general matching involving possibly ambiguous regular
476expressions.   
477
478\input{compilation}
479
480\input{re-Unicode}
481
482\section{Block-at-a-Time Processing}\label{sec:blockwise}
483
484The unbounded stream model of the previous section must of course
485be translated an implementation that proceeds block-at-a-time for
486realistic application.  In this, we primarily rely on the Pablo
487compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
488statements expressed as arbitrary-length bitstream equations, Pablo
489produces block-at-a-time C++ code that initializes and maintains all the necessary
490carry bits for each of the additions and shifts involved in the
491bitstream calculations.   
492
493In the present work, our principal contribution to the Parabix tool
494chain is to incorporate the technique of long-stream addition described below.
495Otherwise, we were able to use Pablo directly in compiling our
496SSE2 and AVX2 implementations.   Our GPU implementation required
497some scripting to modify the output of the Pablo compiler for our
498purpose.
499
500\paragraph*{Long-Stream Addition}  The maximum word size for
501addition on commodity processors is typically 64 bits.  In order
502to implement long-stream addition for block sizes of 256 or larger,
503a method for propagating carries through the individual stages of
50464-bit addition is required.  However, the normal technique of
505sequential addition using add-with-carry instructions, for example,
506is far from ideal.
507
508We use the following general model using SIMD methods for constant-time
509long-stream addition up to 4096 bits.   Related GPU solutions have been
510independently developed\cite{Crovella2012},
511however our model is intended to be a more broadly applicable abstraction.
512We assume the availability of the following SIMD/SIMT operations
513operating on vectors of $f$ 64-bit fields.
514\begin{itemize}
515\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
516in two vectors to produce a result vector of $f$ 64-bit fields.
517\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
518of \verb:x: each with the constant value -1 (all bits 1), producing
519an $f$-bit mask value,
520\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
521field into a single compressed $f$-bit mask value, and
522\item normal bitwise logic operations on $f$-bit masks, and
523\item  \verb#simd<64>::spread(X)#: distributing the bits of
524an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
525\end{itemize}
526
527Here, the \verb#hsimd<64>::mask(X)# and
528\verb#simd<64>::spread(X)# model the minimum
529communication requirements between the parallel processing units
530(SIMD lanes or SIMT processors).    In essence, we just need
531the ability to quickly send and receive 1 bit of information
532per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
533gathers 1 bit from each of the processors to a central resource.
534After calculations on the gather bits are performed, we then
535just need an operation to invert the communication, i.e.,
536sending 1 bit each from the central processor to each of
537the parallel units.   There are a variety of ways in which
538these facilities may be implemented depending on the
539underlying architecture; details of our AVX2 and GPU implementations
540are presented later.   
541
542Given these operations, our method for long stream addition of
543two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
544\begin{enumerate}
545\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
546\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
547
548\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
549\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
550\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
551\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
552
553\item Compute an $f$-bit mask of carries generated for each of the
55464-bit additions of \verb:X: and \verb:Y:.
555\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
556
557\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
558an incoming carry bit.  This is called the {\em bubble mask}.
559\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
560
561\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
562incremented to produce the final sum.  Here we find a new application of
563MatchStar.
564\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
565
566This is the key step.  The mask {\tt c} of outgoing carries must be
567shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
568with the next digit.  At the incoming position, the carry will
569increment the 64-bit digit.   However, if this digit is all ones (as
570signaled by the corresponding bit of bubble mask {\tt b}, then the addition
571will generate another carry.  In fact, if there is a sequence of
572digits that are all ones, then the carry must bubble through
573each of them.   This is just MatchStar.
574
575\item Compute the final result {\tt Z}.
576\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
577
578\end{enumerate}
579\begin{figure}
580\begin{center}
581\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
582{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
583{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
584{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
585{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
586{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
587{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
588{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
589{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
590{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
591{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
592{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
593\end{tabular}
594\end{center}
595\caption{Long Stream Addition}\label{fig:longadd}
596\end{figure}
597
598Figure \ref{fig:longadd} illustrates the process.  In the figure,
599we illustrate the process with 8-bit fields rather than 64-bit fields
600and show all field values in hexadecimal notation.  Note that
601two of the individual 8-bit additions produce carries, while two
602others produce {\tt FF} values that generate bubble bits.  The
603net result is that four of the original 8-bit sums must be
604incremented to produce the long stream result.
605
606A slight extension to the process produces a long-stream full adder
607that can be used in chained addition.  In this case, the
608adder must take an additional carry-in bit
609{\tt p} and produce a carry-out bit {\tt q}.
610This may be accomplished by incorporating {\tt p}
611in calculating the increment mask in the low bit position,
612and then extracting the carry-out {\tt q} from the high bit position.
613\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
614\[\text{\tt q} = \text{\tt i >}\text{\tt > f}\]
615
616As described subsequently, we use a two-level long-stream addition technique
617in both our AVX2 and GPU implementations.  In principle, one can extend
618the technique to additional levels.  Using 64-bit adders throughout,
619$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
620A three-level scheme could coordinate
62164 groups each performing 4096-bit long additions in a two-level structure.
622However, whether there are reasonable architectures that can support fine-grained
623SIMT style at this level is an open question.
624
625Using the methods outlined, it is quite conceivable that instruction
626set extensions to support long-stream addition could be added for
627future SIMD and GPU processors.   Given the fundamental nature
628of addition as a primitive and its particular application to regular
629expression matching as shown herein, it seems reasonable to expect
630such instructions to become available.    Alternatively, it may
631be worthwhile to simply ensure that the \verb#hmask# and
632\verb#spread# operations are efficiently supported.
633
634
635\input{sse2}
636
637\input{analysis}
638
639\input{avx2}
640
641
642
643\section{GPU Implementation}\label{sec:GPU}
644
645To further assess the scalability of our regular expression matching
646using bit-parallel data streams, we implemented a GPU version
647in OpenCL.   
648We arranged for 64 work groups each having 64 threads.
649The size of work group and number of work groups is chosen
650to provide the best occupancy as calculated by the AMD App Profiler.
651Input files are divided in data parallel fashion among
652the 64 work groups.  Each work group carries out the regular
653expression matching operations 4096 bytes at a time using SIMT
654processing.   Although the GPU
655does not directly support the mask and spread operations required
656by our long-stream addition model,
657we are able to simulate them using shared memory.
658Each thread maintains
659its own carry and bubble values in shared memory and performs
660synchronized updates with the other threads using a six-step
661parallel-prefix style process.  Others have implemented
662long-stream addition on the GPU using similar techniques,
663as noted previously.
664
665We performed our test on an AMD Radeon HD A10-6800K APU machine.
666On the AMD Fusion systems, the input buffer is allocated in
667pinned memory to take advantage of the zero-copy memory regions
668where data can be read directly into this region by the CPU
669and also accessed by the GPU for further processing. Therefore,
670the expensive data transferring time that is needed by traditional
671discrete GPUs is hidden and we compare only the kernel execution
672time with our SSE2 and AVX implementations as shown in Figure
673\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
674improvement over SSE version and up to 40\% performance
675improvement over AVX version.   However, because of
676implementation complexities of the triply-nested while loop for
677the StarHeight expression, it has been omitted.
678
679Although we intended to process
68064 work groups with 4096 bytes each at a time rather than 128 bytes
681at a time on SSE or 256 bytes at a time on AVX, the performance
682improvement is less than 60\%. The first reason is hardware
683limitations. Our kernel occupancy is limited by register usage
684and not all the work groups can be scheduled at the same time.
685The second reason is that the long-stream addition implemented
686on GPU is more expensive than the implementations on SSE or AVX.
687Another important reason is the control flow. When a possible
688match is found in one thread, the rest of the threads in the
689same work group have to execute the same instructions for
690further processing rather than jump to the next block with a
691simple IF test. Therefore, the performance of different
692regular expressions is dependent on the number of
693long-stream addition operations and the total number of matches
694of a given input.   Perhaps surprisingly, the overhead of the Parabix
695transformation was not a dominant factor, coming in at 0.08 ms/MB.
696
697
698\begin{figure}
699\begin{center}
700\begin{tikzpicture}
701\begin{axis}[
702xtick=data,
703ylabel=Running Time (ms per megabyte),
704xticklabels={@,Date,Email,URI,Hex,StarHeight},
705tick label style={font=\tiny},
706enlarge x limits=0.15,
707%enlarge y limits={0.15, upper},
708ymin=0,
709legend style={at={(0.5,-0.15)},
710anchor=north,legend columns=-1},
711ybar,
712bar width=7pt,
713]
714\addplot
715file {data/ssetime.dat};
716\addplot
717file {data/avxtime.dat};
718\addplot
719file {data/gputime.dat};
720
721\legend{SSE2,AVX2,GPU,Annot}
722\end{axis}
723\end{tikzpicture}
724\end{center}
725\caption{Running Time}\label{fig:SSE-AVX-GPU}
726\end{figure}
727
728
729
730
731
732
733
734
735\input{conclusion}
736
737
738
739%\appendix
740%\section{Appendix Title}
741
742%This is the text of the appendix, if you need one.
743
744
745This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
746MITACS, Inc.
747
748\bibliographystyle{IEEEtranS}
749\bibliography{reference}
750
751\end{document}
752
753
Note: See TracBrowser for help on using the repository browser.