source: docs/Working/re/pact051-cameron.tex @ 3891

Last change on this file since 3891 was 3891, checked in by cameron, 5 years ago

compiler category; ashriram

File size: 35.7 KB
Line 
1\documentclass{sig-alternate-2013}
2
3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
5\let\crnotice\mycrnotice%
6\let\confname\myconfname%
7
8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
10{\mycrnotice{Copyright is held by the owner/author(s). Publication rights licensed to ACM.}}}
11\copyrightetc{ACM \the\acmcopyr}
12\crdata{978-1-4503-2809-8/14/08\ ...\$15.00.\\
13http://dx.doi.org/10.1145/2628071.2628079}
14
15\clubpenalty=10000
16\widowpenalty = 10000
17\usepackage[normalem]{ulem}
18\usepackage{amssymb}
19\usepackage{amsmath}
20\usepackage{graphicx}
21\usepackage{tikz}
22\usepackage{multirow}
23\usepackage{pgfplots}
24
25\begin{document}
26
27\title{Bitwise Data Parallelism in Regular Expression Matching}
28
29\numberofauthors{1} %  in this sample file, there are a *total*
30% of EIGHT authors. SIX appear on the 'first-page' (for formatting
31% reasons) and the remaining two appear in the \additionalauthors section.
32%
33\author{
34% You can go ahead and credit any number of authors here,
35% e.g. one 'row of three' or two rows (consisting of one row of three
36% and a second row of one, two or three).
37%
38% The command \alignauthor (no curly braces needed) should
39% precede each author name, affiliation/snail-mail address and
40% e-mail address. Additionally, tag each line of
41% affiliation/address with \affaddr, and tag the
42% e-mail address with \email.
43%
44% 1st. author
45%\alignauthor
46Rob Cameron, Tom Shermer, Arrvindh Shriraman, Ken Herdy, Dan Lin, Ben Hull, Meng Lin\\
47       \affaddr{School of Computing Science}\\
48       \affaddr{Simon Fraser University}\\
49       \affaddr{Surrey, British Columbia}\\
50       \email{\{cameron,shermer,ashriram,ksherdy,lindanl,bhull,linmengl\}@sfu.ca}
51}
52
53
54\date{}
55\maketitle
56
57\thispagestyle{empty}
58
59
60\begin{abstract}
61\input{abstract}
62\end{abstract}
63\category{D.1.3}{Programming Techniques}{Concurrent Programming}[parallel programming]
64\category{D.3.3}{Programming Languages}{Processors}[compilers]
65\category{F.2.2}{Analysis of Algorithms and Problem Complexity}{Nonnumerical Algorithms and Problems}[pattern matching]
66\keywords{regular expression matching, parallel bit streams}
67\section{Introduction}
68
69The use of regular expressions to search texts
70for patterns has a long history and
71remains an important technique.
72% {\em a brief history}
73%The origins of regular expression matching date back to automata theory
74%studied by Kleene in the 1950s \cite{kleene1951}.
75Thompson \cite{thompson1968} is credited with the
76first construction to convert regular expressions
77to nondeterministic finite automata (NFA).
78Following Thompson's approach,
79a regular expression of length $m$ is converted
80to an NFA with $O(m)$ states. Using the
81NFA it is possible to search a text
82of length $n$ in $O(mn)$ time.
83Frequently, a more efficient choice
84is to convert an NFA into a deterministic finite automata (DFA).
85A DFA maintains a single active state
86throughout the matching process and
87hence, using a DFA it is possible to
88search a text of length $n$ in $O(n)$ time\footnote{It is well
89known that the conversion of an NFA to an equivalent DFA may result
90in \textit{state explosion}, i.e., the number of resultant DFA states may increase exponentially.}.
91
92A significant proportion of the research
93in fast regular expression matching can be
94regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
95In \cite{baeza1992new}, Baeza-Yates and Gonnet
96presented a new approach to string search based on bit-level parallelism.
97This technique takes advantage of the intrinsic parallelism of bitwise operations
98within a computer word. Thus,
99given a $w$-bit word, the number of operations that a string search algorithms
100performs can be reduced by a factor $w$.
101Building on this observation, the Shift-Or algorithm simulates an NFA using
102bitwise operations and achieves $O(\frac{nm}{w})$ time in the worst-case \cite{navarro2000}.
103A disadvantage of the Shift-Or approach
104is an inability to skip input characters.
105Simple string matching algorithms,
106such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical},
107skip input characters
108to achieve sublinear times in the average case.
109% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
110% based on suffix automata are able to skip characters.
111The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
112combines the advantages of the Shift-Or approach
113with the ability to skip characters. %character skipping property of BDM algorithms.
114The nrgrep tool is based on the BNDM algorithm. It is generally
115considered
116the fastest grep tool
117for matching complex patterns, and achieves similar performance
118to the fastest existing string
119matching tools for simple patterns \cite{navarro2000}.
120
121Recently, there has been considerable
122interest in the use of
123parallel hardware such as multicore processors (CPUs),
124 graphics processing units (GPUs),
125field-programmable gate arrays (FPGAs),
126or specialized architectures such as
127the Cell Broadband Engine (Cell BE) to
128accelerate regular expression matching. % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
129Generally, speedups are achieved by
130using parallel hardware features to improve the throughput of
131multiple instances of a matching problem at a time,
132i.e., by matching against sets of patterns or multiple input streams.
133In contrast, our approach uses parallelism to accelerate the throughput of
134a single problem instance, i.e., a single regular expression
135matched against a single input stream.
136
137%\paragraph{Multicore}
138In related work targeting multicore hardware,
139Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
140text processing algorithms that exhibit irregular memory access patterns
141can be efficiently executed.
142Pasetto et al \cite{pasetto2010} presented a flexible tool that
143performs small-ruleset regular expression matching at a rate of
1442.88 Gbps per chip on Intel Xeon E5472 hardware.
145Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
146string matching algorithm \cite{aho1975} is well-suited for parallel
147implementation on multicore CPUs, GPUs and the Cell BE.
148% On each system, both thread-level parallelism (cores) and data-level parallelism
149% (SIMD units) were leveraged for performance.
150Salapura et al \cite{salapura2012accelerating} advocated
151the use of vector-style processing for regular expressions
152in business analytics applications and leveraged the SIMD hardware available
153on multicore processors to achieve a speedup of greater than 1.8 over a
154range of data sizes.
155%\paragraph{Cell Broadband Engine}
156On the Cell Broadband Engine, Scarpazza and Russell
157%presented a SIMD tokenizer
158%that delivered 1.00--1.78 Gbps on a single
159%chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee
160%instruction set \cite{scarpazza2009larrabee}.
161%Furthermore,
162\cite{scarpazza2009cell}
163described a pattern matching
164implementation that delivered a throughput of 40
165Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
166Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
167presented a string matching implementation for automata that achieved
1684 Gbps on the Cell BE.
169On GPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
170implementation of the AC algorithm for
171accelerating string matching on GPUs. Lin et al., proposed
172the Parallel Failureless Aho-Corasick (PFAC)
173algorithm to accelerate pattern matching on GPU hardware and
174achieved 143 Gbps raw data throughput,
175although system throughput was limited to 15 Gbps
176\cite{lin2013accelerating}.
177Most recently, Mytkowicz et al \cite{DBLP:conf/asplos/MytkowiczMS14} 
178have developed a method for combining
179SIMD parallelism and data parallelism on multicore hardware.
180Of each of these related works, this approach stands
181out since it also focuses on the acceleration
182of matching against a single input stream.
183
184Whereas the existing approaches to parallelization have been
185based on adapting traditional sequential algorithms to emergent
186parallel architectures, we introduce both a new algorithmic
187approach and its implementation on SIMD and GPU architectures.
188This approach relies on a bitwise data parallel view of text
189streams as well as a surprising use of addition to match
190runs of characters in a single step.  The closest previous
191work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
192technology together with a parallel scanning primitive also
193based on addition \cite{cameron2011parallel}.   
194However, in contrast to the deterministic, longest-match
195scanning associated with the ScanThru primitive of that
196work, we introduce here a new primitive MatchStar
197that can be used in full generality for nondeterministic
198regular expression matching.   We also introduce a long-stream
199addition technique involving a further application of MatchStar
200that enables us to scale the technique to $n$-bit addition
201in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
202for example, to perform
203synchronized 4096-bit addition on GPU wavefronts of 64 threads.
204
205There is also a strong keyword match between the bit-parallel
206data streams used in our approach and the bit-parallelism
207used for NFA state transitions in the classical algorithms of
208Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
209and Navarro and Raffinot \cite{navarro1998bit}.
210However those algorithms use bit-parallelism in a fundamentally
211different way: representing all possible current NFA states
212as a bit vector and performing parallel transitions to a new
213set of states using table lookups and bitwise logic.    Whereas
214our approach can match multiple characters per step, bit-parallel
215NFA algorithms proceed through the input one byte at a time.
216Nevertheless, the agrep \cite{wu1992agrep} and
217nrgrep \cite{navarro2000} programs implemented using these techniques remain
218among the strongest competitors in regular expression matching
219performance, so we include them in our comparative evaluation.
220
221The remainder of this paper is organized as follows.
222Section \ref{sec:grep} briefly describes regular expression
223notation and the grep problem.
224Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
225primitive using a model of arbitrary-length bit-parallel data streams.
226Section \ref{sec:blockwise} discusses the block-by-block
227implementation of our techniques including the long stream
228addition techniques for 256-bit addition with AVX2 and
2294096-bit additions with GPU SIMT.
230Section \ref{sec:SSE2} describes our overall SSE2 implementation
231and carries out a performance study in comparison with
232existing grep implementations.
233Given the dramatic variation in grep performance across
234different implementation techniques, expressions and data sets,
235Section \ref{sec:analysis} considers a comparison between
236the bit-stream and NFA approaches from a theoretical perspective.
237Section \ref{sec:AVX2} then examines and demonstrates
238the scalability of our
239bitwise data-parallel approach in moving from
240128-bit to 256-bit SIMD on Intel Haswell architecture.
241To further investigate scalability, Section \ref{sec:GPU} 
242addresses the implementation of our matcher using
243groups of 64 threads working together SIMT-style on a GPU system.
244Section \ref{sec:Concl} concludes the paper with a discussion of results and
245areas for future work.
246
247\section{Regular Expressions and Grep}\label{sec:grep}
248
249We follow common POSIX notation for regular expressions.
250A regular expression specifies a set of strings through
251a pattern notation.   Individual characters normally
252stand for themselves, unless they are one of the
253special characters \verb:*+?[{\(|^$.: that serve as metacharacters
254of the notation system.  Thus the regular expression \verb:cat:
255is a pattern for the set consisting of the single 3-character
256string ``\verb:cat:''.   The special characters must be escaped
257with a backslash to prevent interpretation as metacharacter, thus
258\verb:\$: represents the dollar-sign and \verb:\\\\: represent
259the string consisting of two backslash characters.
260Character class bracket expressions are pattern elements
261that allow any character in a given class to be used in a particular
262context.  For example, \verb:[@#%]: is a regular expression
263that stands for any of the three given symbols.  Contiguous
264ranges of characters may be specified using hyphens;
265for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
266for any alphanumeric character or underscore.  If the
267caret character immediately follows the opening bracket,
268the class is negated, thus \verb:[^0-9]: stands for
269any character except a digit.  The period metacharacter
270\verb:.: stands for the class of all characters.
271
272Consecutive pattern elements stand for strings formed by
273concatenation, thus \verb:[cd][ao][tg]: stands for the
274set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
275The alternation operator \verb:|: allows a pattern to be
276defined to have two alternative forms, thus \verb:cat|dog:
277matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
278takes precedence over alternation, but parenthesis may be
279used to change this, thus \verb:(ab|cd)[0-9]: stands for any
280digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
281
282Repetition operators may be appended to a pattern to specify
283a variable number of occurrences of that pattern. 
284The Kleene star operator \verb:*: specifies zero-or-more occurrences
285matching the previous pattern, while Kleene plus \verb:+: specifies
286one-or-more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
287both specify the same set: strings of at least one lower-case
288letter.  The postfix operator \verb:?: specifies an optional
289component, i.e., zero-or-one occurrence of strings matching
290the element.  Specific bounds may be given within braces:
291\verb:(ab){3}: specifies the string ``\verb:ababab:'',
292\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
293or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
294strings of at least 4 consecutive capital letters.
295
296The grep program searches a file for lines containing matches
297to a regular expression using any of the above notations.
298In addition, the pattern elements \verb:^: and \verb:$:
299may be used to match respectively the beginning or the
300end of a line.  In line-based tools such as grep, \verb:.:
301matches any character except newlines; matches cannot extend
302over lines.
303Normally, grep prints all matching
304lines to its output.  However, grep programs typically
305allow a command line flag such as \verb:-c: to specify
306that only a count of matching lines be produced; we use
307this option in our experimental evaluation to focus
308our comparisons on the performance of the underlying matching
309algorithms.
310
311\section{Bit-Parallel Data Streams}\label{sec:bitwise}
312
313Whereas the traditional approaches to regular expression matching
314using NFAs, DFAs or backtracking all rely on a byte-at-a-time
315processing model, the approach  we introduce in this paper is based
316on quite a different concept:  a data-parallel approach to simultaneous
317processing of data stream elements.  Indeed, our most abstract model
318is that of unbounded data parallelism: processing all elements of
319the input data stream simultaneously.   In essence, data streams are viewed
320as (very large) integers.   The fundamental operations are bitwise
321logic, stream shifting and long-stream addition.
322
323Depending on the available parallel processing resources, an actual
324implementation may divide an input stream into blocks  and process
325the blocks sequentially.   Within each block  all elements of the
326input stream are processed together, relying on the availability of
327bitwise logic and addition scaled to the block size.   On commodity
328Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
329we typically process input streams 128 bytes at a time.   
330In this
331case, we rely on the Parabix tool chain \cite{lin2012parabix}
332to handle the details of compilation to block-by-block processing.
333On the
334latest processors supporting the 256-bit AVX2 SIMD operations,
335we also use the Parabix tool chain, but substitute a parallelized
336long-stream addition technique to avoid the sequential chaining
337of 4 64-bit additions.
338Our GPU implementation uses scripts to modify the output
339of the Parabix tools, effectively dividing the input into blocks
340of 4K bytes.   
341We also have adapted our long-stream addition technique
342to perform 4096-bit additions using 64 threads working in lock-step
343SIMT fashion. 
344 
345\begin{figure}[tbh]
346\begin{center}
347\begin{tabular}{cr}\\
348input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
349$B_7$ & \verb`.................................`\\
350$B_6$ & \verb`1...1..1.1..11..1.....1..11..1...`\\
351$B_5$ & \verb`111111111111111111111111111111111`\\
352$B_4$ & \verb`.1111...11...1...111111....1111..`\\
353$B_3$ & \verb`....111..111.111...1.1111....1.11`\\
354$B_2$ & \verb`.11..11...11..11....1..11.....111`\\
355$B_1$ & \verb`...11..111...1....1...1..1.1111..`\\
356$B_0$ & \verb`1.11.11.1.111.1111.1.1.1111...111`\\
357\verb:[a]: & \verb`1...........1...1.........1......`\\
358\verb:[z9]: & \verb`....1....1...1.....1.11......1...`\\
359\verb:[0-9]: & \verb`.111....1........11111.....11.1..`
360\end{tabular}
361
362\end{center}
363\caption{Basis and Character Class Streams}
364\label{fig:streams}
365\end{figure}
366
367A key concept in this streaming approach is the derivation of bit streams
368that are parallel to the input data stream, i.e., in one-to-one
369correspondence with the data element positions of the input
370streams.   Typically, the input stream is a byte stream comprising
371the 8-bit character code units of a particular encoding such
372as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
373easily be used with wider code units such as the 16-bit code units of
374UTF-16.   In the case of a byte stream, the first step is to transpose
375the byte stream into eight parallel bit streams, such that bit stream
376$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
377a set of basis bit streams from which many other parallel bit
378streams can be calculated, such as character class bit
379streams such that each bit $j$ of the stream specifies
380whether character $j$ of the input stream is in the class
381or not.  Figure \ref{fig:streams} shows an example of an
382input byte stream in ASCII, the eight basis bit streams of the
383transposed representation, and the character class bit streams
384\verb:[a]:,
385\verb:[z9]:, and
386\verb:[0-9]:
387that may be computed from the basis bit streams using bitwise logic.
388Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
389Transposition and character class construction are straightforward
390using the Parabix tool chain \cite{lin2012parabix}.
391
392\begin{figure}[tbh]
393\begin{center}
394\begin{tabular}{cr}\\
395input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
396$M_1$ & \verb`.1...........1...1.........1.....`\\
397$M_2$ & \verb`.1111........1...111111....111...`\\
398$M_3$ & \verb`.....1........1.....1.11......1..`
399\end{tabular}
400
401\end{center}
402\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
403\label{fig:streams2}
404\end{figure}
405
406\paragraph*{Marker Streams}  Now consider how bit-parallel data
407streams can be used in regular expression matching.   Consider
408the problem of searching the input stream of Figure \ref{fig:streams}
409to finding occurrence of strings matching
410the regular expression \verb:a[0-9]*[z9]:. 
411Note that this is an ambiguous regular expression, which could match
412texts such as \verb:a12949z: in multiple ways.
413The matching process involves the concept of {\em marker streams}, that
414is streams that mark the positions of current matches during the
415overall process.  In this case there are three marker streams computed
416during the match process, namely,
417$M_1$ representing match positions after an initial \verb:a:
418character has been found, $M_2$ representing positions
419reachable from positions marked by $M_1$ by further matching zero or
420more digits (\verb:[0-9]*:) and finally $M_3$ the stream
421marking positions after a final \verb:z: or \verb:9: has been found.
422Without describing the details of how these streams are computed
423for the time being, Figure \ref{fig:streams2} shows what each
424of these streams should be for our example matching problem.
425Our convention that a marker stream contains a 1 bit
426at the next character position to be matched, that is,
427immediately past the last position that was matched.
428Note that all three matches from the third occurrence of \verb:a:
429are correctly marked in $M_3$.
430
431
432\paragraph*{MatchStar}  MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
433
434\begin{figure}[tbh]
435\begin{center}
436\begin{tabular}{cr}\\
437input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
438$M_1$ & \verb`.1...........1...1.........1.....`\\
439$C = \text{\tt [0-9]}$ & \verb`.111....1........11111.....11.1..`\\
440$T_0 = M_1 \wedge C$ & \verb`.1...............1.........1.....`\\
441$T_1 = T_0 + C$ & \verb`....1...1.............1......11..`\\
442$T_2 = T_1 \oplus C$ & \verb`.1111............111111....111...`\\
443$M_2 = T_2 \vee M_1$ & \verb`.1111........1...111111....111...`
444\end{tabular}
445
446\end{center}
447\caption{$M_2 = \text{MatchStar}(M_1, C)$}
448\label{fig:matchstar}
449\end{figure}
450
451
452Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
453it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
454conventional presentation of our character data input.   However, this reverses the normal
455order of presentation when considering bitstreams as numeric values.  The key point here is
456that when we perform bitstream addition, we will show bit movement from left-to-right.
457For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
458
459The first row of the figure is the input data,
460the second and third rows are the input bitstreams: the initial marker position bitstream and the
461character class bitstream for digits derived from input data. 
462
463In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
464The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
465output marker stream is obtained by ORing $T_2$ with the initial marker stream.
466
467In general, given a marker stream $M$ and a character class stream $C$,
468the operation of MatchStar is defined by the following equation. 
469\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
470Given a set of initial marker positions, the result stream marks
471all possible positions that can be reached by 0 or more occurrences
472of characters in class $C$ from each position in $M$
473
474MatchStar differs from ScanThru of the Parabix tool chain in that it
475finds all matches, not just the longest match.  This is necessary
476for general matching involving possibly ambiguous regular
477expressions.   
478
479\input{compilation}
480
481\input{re-Unicode}
482
483\section{Block-at-a-Time Processing}\label{sec:blockwise}
484
485The unbounded stream model of the previous section must of course
486be translated an implementation that proceeds block-at-a-time for
487realistic application.  In this, we primarily rely on the Pablo
488compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
489statements expressed as arbitrary-length bitstream equations, Pablo
490produces block-at-a-time C++ code that initializes and maintains all the necessary
491carry bits for each of the additions and shifts involved in the
492bitstream calculations.   
493
494In the present work, our principal contribution to the Parabix tool
495chain is to incorporate the technique of long-stream addition described below.
496Otherwise, we were able to use Pablo directly in compiling our
497SSE2 and AVX2 implementations.   Our GPU implementation required
498some scripting to modify the output of the Pablo compiler for our
499purpose.
500
501\paragraph*{Long-Stream Addition}  The maximum word size for
502addition on commodity processors is typically 64 bits.  In order
503to implement long-stream addition for block sizes of 256 or larger,
504a method for propagating carries through the individual stages of
50564-bit addition is required.  However, the normal technique of
506sequential addition using add-with-carry instructions, for example,
507is far from ideal.
508
509We use the following general model using SIMD methods for constant-time
510long-stream addition up to 4096 bits.   Related GPU solutions have been
511independently developed\cite{Crovella2012},
512however our model is intended to be a more broadly applicable abstraction.
513We assume the availability of the following SIMD/SIMT operations
514operating on vectors of $f$ 64-bit fields.
515\begin{itemize}
516\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
517in two vectors to produce a result vector of $f$ 64-bit fields.
518\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
519of \verb:x: each with the constant value -1 (all bits 1), producing
520an $f$-bit mask value,
521\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
522field into a single compressed $f$-bit mask value, and
523\item normal bitwise logic operations on $f$-bit masks, and
524\item  \verb#simd<64>::spread(X)#: distributing the bits of
525an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
526\end{itemize}
527
528Here, the \verb#hsimd<64>::mask(X)# and
529\verb#simd<64>::spread(X)# model the minimum
530communication requirements between the parallel processing units
531(SIMD lanes or SIMT processors).    In essence, we just need
532the ability to quickly send and receive 1 bit of information
533per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
534gathers 1 bit from each of the processors to a central resource.
535After calculations on the gather bits are performed, we then
536just need an operation to invert the communication, i.e.,
537sending 1 bit each from the central processor to each of
538the parallel units.   There are a variety of ways in which
539these facilities may be implemented depending on the
540underlying architecture; details of our AVX2 and GPU implementations
541are presented later.   
542
543Given these operations, our method for long stream addition of
544two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
545\begin{enumerate}
546\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
547\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
548
549\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
550\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
551\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
552\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
553
554\item Compute an $f$-bit mask of carries generated for each of the
55564-bit additions of \verb:X: and \verb:Y:.
556\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
557
558\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
559an incoming carry bit.  This is called the {\em bubble mask}.
560\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
561
562\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
563incremented to produce the final sum.  Here we find a new application of
564MatchStar.
565\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
566
567This is the key step.  The mask {\tt c} of outgoing carries must be
568shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
569with the next digit.  At the incoming position, the carry will
570increment the 64-bit digit.   However, if this digit is all ones (as
571signaled by the corresponding bit of bubble mask {\tt b}, then the addition
572will generate another carry.  In fact, if there is a sequence of
573digits that are all ones, then the carry must bubble through
574each of them.   This is just MatchStar.
575
576\item Compute the final result {\tt Z}.
577\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
578
579\end{enumerate}
580\begin{figure}
581\begin{center}
582\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
583{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
584{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
585{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
586{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
587{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
588{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
589{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
590{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
591{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
592{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
593{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
594\end{tabular}
595\end{center}
596\caption{Long Stream Addition}\label{fig:longadd}
597\end{figure}
598
599Figure \ref{fig:longadd} illustrates the process.  In the figure,
600we illustrate the process with 8-bit fields rather than 64-bit fields
601and show all field values in hexadecimal notation.  Note that
602two of the individual 8-bit additions produce carries, while two
603others produce {\tt FF} values that generate bubble bits.  The
604net result is that four of the original 8-bit sums must be
605incremented to produce the long stream result.
606
607A slight extension to the process produces a long-stream full adder
608that can be used in chained addition.  In this case, the
609adder must take an additional carry-in bit
610{\tt p} and produce a carry-out bit {\tt q}.
611This may be accomplished by incorporating {\tt p}
612in calculating the increment mask in the low bit position,
613and then extracting the carry-out {\tt q} from the high bit position.
614\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
615\[\text{\tt q} = \text{\tt i >}\text{\tt > f}\]
616
617As described subsequently, we use a two-level long-stream addition technique
618in both our AVX2 and GPU implementations.  In principle, one can extend
619the technique to additional levels.  Using 64-bit adders throughout,
620$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
621A three-level scheme could coordinate
62264 groups each performing 4096-bit long additions in a two-level structure.
623However, whether there are reasonable architectures that can support fine-grained
624SIMT style at this level is an open question.
625
626Using the methods outlined, it is quite conceivable that instruction
627set extensions to support long-stream addition could be added for
628future SIMD and GPU processors.   Given the fundamental nature
629of addition as a primitive and its particular application to regular
630expression matching as shown herein, it seems reasonable to expect
631such instructions to become available.    Alternatively, it may
632be worthwhile to simply ensure that the \verb#hmask# and
633\verb#spread# operations are efficiently supported.
634
635
636\input{sse2}
637
638\input{analysis}
639
640\input{avx2}
641
642
643
644\section{GPU Implementation}\label{sec:GPU}
645
646To further assess the scalability of our regular expression matching
647using bit-parallel data streams, we implemented a GPU version
648in OpenCL.   
649We arranged for 64 work groups each having 64 threads.
650The size of work group and number of work groups is chosen
651to provide the best occupancy as calculated by the AMD App Profiler.
652Input files are divided in data parallel fashion among
653the 64 work groups.  Each work group carries out the regular
654expression matching operations 4096 bytes at a time using SIMT
655processing.   Although the GPU
656does not directly support the mask and spread operations required
657by our long-stream addition model,
658we are able to simulate them using shared memory.
659Each thread maintains
660its own carry and bubble values in shared memory and performs
661synchronized updates with the other threads using a six-step
662parallel-prefix style process.  Others have implemented
663long-stream addition on the GPU using similar techniques,
664as noted previously.
665
666We performed our test on an AMD Radeon HD A10-6800K APU machine.
667On the AMD Fusion systems, the input buffer is allocated in
668pinned memory to take advantage of the zero-copy memory regions
669where data can be read directly into this region by the CPU
670and also accessed by the GPU for further processing. Therefore,
671the expensive data transferring time that is needed by traditional
672discrete GPUs is hidden and we compare only the kernel execution
673time with our SSE2 and AVX implementations as shown in Figure
674\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
675improvement over SSE version and up to 40\% performance
676improvement over AVX version.   However, because of
677implementation complexities of the triply-nested while loop for
678the StarHeight expression, it has been omitted.
679
680Although we intended to process
68164 work groups with 4096 bytes each at a time rather than 128 bytes
682at a time on SSE or 256 bytes at a time on AVX, the performance
683improvement is less than 60\%. The first reason is hardware
684limitations. Our kernel occupancy is limited by register usage
685and not all the work groups can be scheduled at the same time.
686The second reason is that the long-stream addition implemented
687on GPU is more expensive than the implementations on SSE or AVX.
688Another important reason is the control flow. When a possible
689match is found in one thread, the rest of the threads in the
690same work group have to execute the same instructions for
691further processing rather than jump to the next block with a
692simple IF test. Therefore, the performance of different
693regular expressions is dependent on the number of
694long-stream addition operations and the total number of matches
695of a given input.   Perhaps surprisingly, the overhead of the Parabix
696transformation was not a dominant factor, coming in at 0.08 ms/MB.
697
698
699\begin{figure}
700\begin{center}
701\begin{tikzpicture}
702\begin{axis}[
703xtick=data,
704ylabel=Running Time (ms per megabyte),
705xticklabels={@,Date,Email,URI,Hex,StarHeight},
706tick label style={font=\tiny},
707enlarge x limits=0.15,
708%enlarge y limits={0.15, upper},
709ymin=0,
710legend style={at={(0.5,-0.15)},
711anchor=north,legend columns=-1},
712ybar,
713bar width=7pt,
714]
715\addplot
716file {data/ssetime.dat};
717\addplot
718file {data/avxtime.dat};
719\addplot
720file {data/gputime.dat};
721
722\legend{SSE2,AVX2,GPU,Annot}
723\end{axis}
724\end{tikzpicture}
725\end{center}
726\caption{Running Time}\label{fig:SSE-AVX-GPU}
727\end{figure}
728
729
730
731
732
733
734
735
736\input{conclusion}
737
738
739
740%\appendix
741%\section{Appendix Title}
742
743%This is the text of the appendix, if you need one.
744
745
746This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
747MITACS, Inc.
748
749\bibliographystyle{IEEEtranS}
750\bibliography{reference}
751
752\end{document}
753
754
Note: See TracBrowser for help on using the repository browser.