source: docs/Working/re/pact051-cameron.tex @ 3876

Last change on this file since 3876 was 3876, checked in by cameron, 5 years ago

Delete extra period for paragraph{} items

File size: 36.5 KB
3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
10{\mycrnotice{Copyright is held by the owner/author(s). Publication rights licensed to ACM.}}}
11\copyrightetc{ACM \the\acmcopyr}
12\crdata{978-1-4503-2809-8/14/08\ ...\$15.00.\\
16\widowpenalty = 10000
27\title{Bitwise Data Parallelism in Regular Expression Matching}
29\numberofauthors{7} %  in this sample file, there are a *total*
30% of EIGHT authors. SIX appear on the 'first-page' (for formatting
31% reasons) and the remaining two appear in the \additionalauthors section.
34% You can go ahead and credit any number of authors here,
35% e.g. one 'row of three' or two rows (consisting of one row of three
36% and a second row of one, two or three).
38% The command \alignauthor (no curly braces needed) should
39% precede each author name, affiliation/snail-mail address and
40% e-mail address. Additionally, tag each line of
41% affiliation/address with \affaddr, and tag the
42% e-mail address with \email.
44% 1st. author
46Rob Cameron\\
47       \affaddr{School of Computing Science}\\
48       \affaddr{Simon Fraser University}\\
49       \affaddr{Surrey, British Columbia}\\
50       \email{}
51% 2nd. author
53Tom Shermer\\
54       \affaddr{School of Computing Science}\\
55       \affaddr{Simon Fraser University}\\
56       \affaddr{Surrey, British Columbia}\\
57       \email{}
58% 3rd. author
59\alignauthor Arrvindh Shriraman\\
60       \affaddr{School of Computing Science}\\
61       \affaddr{Simon Fraser University}\\
62       \affaddr{Surrey, British Columbia}\\
63       \email{}
64\and  % use '\and' if you need 'another row' of author names
65% 4th. author
66\alignauthor Ken Herdy\\
67       \affaddr{School of Computing Science}\\
68       \affaddr{Simon Fraser University}\\
69       \affaddr{Surrey, British Columbia}\\
70       \email{}
71% 5th. author
72\alignauthor Dan Lin\\
73       \affaddr{School of Computing Science}\\
74       \affaddr{Simon Fraser University}\\
75       \affaddr{Surrey, British Columbia}\\
76       \email{}
77% 6th. author
78\alignauthor Ben Hull\\
79       \affaddr{School of Computing Science}\\
80       \affaddr{Simon Fraser University}\\
81       \affaddr{Surrey, British Columbia}\\
82       \email{}
83% 7th. author
84\alignauthor Meng Lin\\
85       \affaddr{School of Computing Science}\\
86       \affaddr{Simon Fraser University}\\
87       \affaddr{Surrey, British Columbia}\\
88       \email{}
90% There's nothing stopping you putting the seventh, eighth, etc.
91% author on the opening page (as the 'third row') but we ask,
92% for aesthetic reasons that you place these 'additional authors'
93% in the \additional authors block, viz.
94%\additionalauthors{Additional authors: John Smith (The Th{\o}rv{\"a}ld Group,
95%email: {\texttt{}}) and Julius P.~Kumquat
96%(The Kumquat Consortium, email: {\texttt{}}).}
97%\date{30 July 1999}
98% Just remember to make sure that the TOTAL number of authors
99% is the number that will appear on the first page PLUS the
100% number that will appear in the \additionalauthors section.
115The use of regular expressions to search texts for occurrences
116of string patterns has a long history and
117remains a pervasive technique throughout computing applications today.
118% {\em a brief history}
119The origins of regular expression matching date back to automata theory
120developed by Kleene in the 1950s \cite{kleene1951}.
121Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
122to nondeterministic finite automata (NFA).
123Following Thompson's approach, a regular expression of length $m$ is converted
124to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
125NFA in worst case $O(mn)$ time. Often, a more efficient choice
126is to convert an NFA into a DFA. A DFA has a single active state at any time
127throughout the matching process and
128hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well
129known that the conversion of an NFA to an equivalent DFA may result
130in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
132A significant proportion of the research in fast regular expression matching can be
133regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
134In \cite{baeza1992new}, Baeza-Yates and Gonnet
135presented a new approach to string search based on bit-level parallelism.
136This technique takes advantage of the intrinsic parallelism of bitwise operations
137within a computer word. Thus,
138given a $w$-bit word, the number of operations that a string search algorithms
139performs can be reduced by a factor $w$.
140Building on this observation, the Shift-Or algorithm simulates an NFA using
141bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}.
142A disadvantage of the Shift-Or approach
143is an inability to skip input characters.
144Simple string matching algorithms,
145such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, skip input characters
146to achieve sublinear times in the average case.
147% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
148% based on suffix automata are able to skip characters.
149The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
150combines the advantages of the Shift-Or approach
151with the ability to skip characters. %character skipping property of BDM algorithms.
152The nrgrep tool is based on the BNDM algorithm. It is generally
154the fastest grep tool
155for matching complex patterns, and achieves similar performance
156to the fastest existing string
157matching tools for simple patterns \cite{navarro2000}.
159There has been considerable recent
160interest in accelerating regular expression matching
161on parallel hardware
162such as multicore processors (CPUs),
163general purpose graphics processing units (GPGPUs),
164field-programmable gate arrays (FPGAs),
165and specialized architectures such as
166the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
168Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
169text processing algorithms that exhibit irregular memory access patterns
170can be efficiently executed on multicore hardware.
171In related work, Pasetto et al presented a flexible tool that
172performs small-ruleset regular expression matching at a rate of
1732.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
174Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
175string matching algorithm \cite{aho1975} is well-suited for parallel
176implementation on multicore CPUs, GPGPUs and the Cell BE.
177On each system, both thread-level parallelism (cores) and data-level parallelism
178(SIMD units) were leveraged for performance.
179Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
180in business analytics applications and leveraged the SIMD hardware available
181on multicore processors to achieve a speedup of greater than 1.8 over a
182range of data sizes of interest.
184In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
185that delivered 1.00--1.78 Gbps on a single
186Cell BE chip and extended this approach for emulation on the Intel Larrabee
187instruction set \cite{scarpazza2009larrabee}.
188On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
189implementation that delivered a throughput of 40
190Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
191Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
192presented a string matching implementation for automata that achieved
1934 Gbps on the Cell BE.
194% GPU
195On GPGPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
196implementation of the AC algorithm for
197accelerating string matching on GPGPUs. Lin et al., proposed
198the Parallel Failureless Aho-Corasick (PFAC)
199algorithm to accelerate pattern matching on GPGPU hardware and
200achieved 143 Gbps raw data throughput,
201although system throughput was limited to 15 Gbps
203Most recently, Mytkowicz et al have developed a method for combining
204SIMD parallellism and data parallelism
206Whereas the existing approaches to parallelization have been
207based on adapting traditional sequential algorithms to emergent
208parallel architectures, we introduce both a new algorithmic
209approach and its implementation on SIMD and GPGPU architectures.
210This approach relies on a bitwise data parallel view of text
211streams as well as a surprising use of addition to match
212runs of characters in a single step.  The closest previous
213work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
214technology together with a parallel scanning primitive also
215based on addition \cite{cameron2011parallel}.   
216However, in contrast to the deterministic, longest-match
217scanning associated with the ScanThru primitive of that
218work, we introduce here a new primitive MatchStar
219that can be used in full generality for nondeterministic
220regular expression matching.   We also introduce a long-stream
221addition technique involving a further application of MatchStar
222that enables us to scale the technique to $n$-bit addition
223in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
224for example, to perform
225synchronized 4096-bit addition on GPGPU wavefronts of 64 threads.
227There is also a strong keyword match between the bit-parallel
228data streams used in our approach and the bit-parallelism
229used for NFA state transitions in the classical algorithms of
230Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
231and Navarro and Raffinot \cite{navarro1998bit}.
232However those algorithms use bit-parallelism in a fundamentally
233different way: representing all possible current NFA states
234as a bit vector and performing parallel transitions to a new
235set of states using table lookups and bitwise logic.    Whereas
236our approach can match multiple characters per step, bit-parallel
237NFA algorithms proceed through the input one byte at a time.
238Nevertheless, the agrep \cite{wu1992agrep} and
239nrgrep \cite{navarro2000} programs implemented using these techniques remain
240among the strongest competitors in regular expression matching
241performance, so we include them in our comparative evaluation.
243The remainder of this paper is organized as follows.
244Section \ref{sec:grep} briefly describes regular expression
245notation and the grep problem.
246Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
247primitive using a model of arbitrary-length bit-parallel data streams.
248Section \ref{sec:blockwise} discusses the block-by-block
249implementation of our techniques including the long stream
250addition techniques for 256-bit addition with AVX2 and
2514096-bit additions with GPGPU SIMT.
252Section \ref{sec:SSE2} describes our overall SSE2 implementation
253and carries out a performance study in comparison with
254existing grep implementations.
255Given the dramatic variation in grep performance across
256different implementation techniques, expressions and data sets,
257Section \ref{sec:analysis} considers a comparison between
258the bit-stream and NFA approaches from a theoretical perspective.
259Section \ref{sec:AVX2} then examines and demonstrates
260the scalability of our
261bitwise data-parallel approach in moving from
262128-bit to 256-bit SIMD on Intel Haswell architecture.
263To further investigate scalability, Section \ref{sec:GPU} 
264addresses the implementation of our matcher using
265groups of 64 threads working together SIMT-style on a GPGPU system.
266Section \ref{sec:Concl} concludes the paper with a discussion of results and
267areas for future work.
269\section{Regular Expression Notation and Grep}\label{sec:grep}
271We follow common POSIX notation for regular expressions.
272A regular expression specifies a set of strings through
273a pattern notation.   Individual characters normally
274stand for themselves, unless they are one of the
275special characters \verb:*+?[{\(|^$.: that serve as metacharacters
276of the notation system.  Thus the regular expression \verb:cat:
277is a pattern for the set consisting of the single 3-character
278string ``\verb:cat:''.   The special characters must be escaped
279with a backslash to prevent interpretation as metacharacter, thus
280\verb:\$: represents the dollar-sign and \verb:\\\\: represent
281the string consisting of two backslash characters.
282Character class bracket expressions are pattern elements
283that allow any character in a given class to be used in a particular
284context.  For example, \verb:[@#%]: is a regular expression
285that stands for any of the three given symbols.  Contiguous
286ranges of characters may be specified using hyphens;
287for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
288for any alphanumeric character or underscore.  If the
289caret character immediately follows the opening bracket,
290the class is negated, thus \verb:[^0-9]: stands for
291any character except a digit.  The period metacharacter
292\verb:.: stands for the class of all characters.
294Consecutive pattern elements stand for strings formed by
295concatenation, thus \verb:[cd][ao][tg]: stands for the
296set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
297The alternation operator \verb:|: allows a pattern to be
298defined to have two alternative forms, thus \verb:cat|dog:
299matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
300takes precedence over alternation, but parenthesis may be
301used to change this, thus \verb:(ab|cd)[0-9]: stands for any
302digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
304Repetition operators may be appended to a pattern to specify
305a variable number of occurrences of that pattern. 
306The Kleene star operator \verb:*: specifies zero-or-more occurrences
307matching the previous pattern, while Kleene plus \verb:+: specifies one-or
308more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
309both specify the same set: strings of at least one lower-case
310letter.  The postfix operator \verb:?: specifies an optional
311component, i.e., zero-or-one occurrence of strings matching
312the element.  Specific bounds may be given within braces:
313\verb:(ab){3}: specifies the string ``\verb:ababab:'',
314\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
315or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
316strings of at least 4 consecutive capital letters.
318The grep program searches a file for lines containing matches
319to a regular expression using any of the above notations.
320In addition, the pattern elements \verb:^: and \verb:$:
321may be used to match respectively the beginning or the
322end of a line.  In line-based tools such as grep, \verb:.:
323matches any character except newlines; matches cannot extend
324over lines.
325Normally, grep prints all matching
326lines to its output.  However, grep programs typically
327allow a command line flag such as \verb:-c: to specify
328that only a count of matching lines be produced; we use
329this option in our experimental evaluation to focus
330our comparisons on the performance of the underlying matching
333\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
335Whereas the traditional approaches to regular expression matching
336using NFAs, DFAs or backtracking all rely on a byte-at-a-time
337processing model, the approach  we introduce in this paper is based
338on quite a different concept:  a data-parallel approach to simultaneous
339processing of data stream elements.  Indeed, our most abstract model
340is that of unbounded data parallelism: processing all elements of
341the input data stream simultaneously.   In essence, data streams are viewed
342as (very large) integers.   The fundamental operations are bitwise
343logic, stream shifting and long-stream addition.
345Depending on the available parallel processing resources, an actual
346implementation may divide an input stream into blocks  and process
347the blocks sequentially.   Within each block  all elements of the
348input stream are processed together, relying on the availability of
349bitwise logic and addition scaled to the block size.   On commodity
350Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
351we typically process input streams 128 bytes at a time.   
352In this
353case, we rely on the Parabix tool chain \cite{lin2012parabix}
354to handle the details of compilation to block-by-block processing.
355On the
356latest processors supporting the 256-bit AVX2 SIMD operations,
357we also use the Parabix tool chain, but substitute a parallelized
358long-stream addition technique to avoid the sequential chaining
359of 4 64-bit additions.
360Our GPGPU implementation uses scripts to modify the output
361of the Parabix tools, effectively dividing the input into blocks
362of 4K bytes.   
363We also have adapted our long-stream addition technique
364to perform 4096-bit additions using 64 threads working in lock-step
365SIMT fashion. 
370input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
371$B_7$ & \verb`.................................`\\
372$B_6$ & \verb`1...1..1.1..11..1.....1..11..1...`\\
373$B_5$ & \verb`111111111111111111111111111111111`\\
374$B_4$ & \verb`.1111...11...1...111111....1111..`\\
375$B_3$ & \verb`....111..111.111...1.1111....1.11`\\
376$B_2$ & \verb`.11..11...11..11....1..11.....111`\\
377$B_1$ & \verb`...11..111...1....1...1..1.1111..`\\
378$B_0$ & \verb``\\
379\verb:[a]: & \verb`1...........1...1.........1......`\\
380\verb:[z9]: & \verb`....1....1...1.....1.11......1...`\\
381\verb:[0-9]: & \verb`.111....1........11111.....11.1..`
385\caption{Basis and Character Class Streams}
389A key concept in this streaming approach is the derivation of bit streams
390that are parallel to the input data stream, i.e., in one-to-one
391correspondence with the data element positions of the input
392streams.   Typically, the input stream is a byte stream comprising
393the 8-bit character code units of a particular encoding such
394as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
395easily be used with wider code units such as the 16-bit code units of
396UTF-16.   In the case of a byte stream, the first step is to transpose
397the byte stream into eight parallel bit streams, such that bit stream
398$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
399a set of basis bit streams from which many other parallel bit
400streams can be calculated, such as character class bit
401streams such that each bit $j$ of the stream specifies
402whether character $j$ of the input stream is in the class
403or not.  Figure \ref{fig:streams} shows an example of an
404input byte stream in ASCII, the eight basis bit streams of the
405transposed representation, and the character class bit streams
407\verb:[z9]:, and
409that may be computed from the basis bit streams using bitwise logic.
410Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
411Transposition and character class construction are straightforward
412using the Parabix tool chain \cite{lin2012parabix}.
417input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
418$M_1$ & \verb`.1...........1...1.........1.....`\\
419$M_2$ & \verb`.1111........1...111111....111...`\\
420$M_3$ & \verb`.....1........1.....1.11......1..`
424\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
428\paragraph*{Marker Streams}  Now consider how bit-parallel data
429streams can be used in regular expression matching.   Consider
430the problem of searching the input stream of Figure \ref{fig:streams}
431to finding occurrence of strings matching
432the regular expression \verb:a[0-9]*[z9]:. 
433Note that this is an ambiguous regular expression, which could match
434texts such as \verb:a12949z: in multiple ways.
435The matching process involves the concept of {\em marker streams}, that
436is streams that mark the positions of current matches during the
437overall process.  In this case there are three marker streams computed
438during the match process, namely,
439$M_1$ representing match positions after an initial \verb:a:
440character has been found, $M_2$ representing positions
441reachable from positions marked by $M_1$ by further matching zero or
442more digits (\verb:[0-9]*:) and finally $M_3$ the stream
443marking positions after a final \verb:z: or \verb:9: has been found.
444Without describing the details of how these streams are computed
445for the time being, Figure \ref{fig:streams2} shows what each
446of these streams should be for our example matching problem.
447Our convention that a marker stream contains a 1 bit
448at the next character position to be matched, that is,
449immediately past the last position that was matched.
450Note that all three matches from the third occurrence of \verb:a:
451are correctly marked in $M_3$.
455MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
460input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
461$M_1$ & \verb`.1...........1...1.........1.....`\\
462$C = \text{\tt [0-9]}$ & \verb`.111....1........11111.....11.1..`\\
463$T_0 = M_1 \wedge C$ & \verb`.1...............1.........1.....`\\
464$T_1 = T_0 + C$ & \verb`....1...1.............1......11..`\\
465$T_2 = T_1 \oplus C$ & \verb`.1111............111111....111...`\\
466$M_2 = T_2 \vee M_1$ & \verb`.1111........1...111111....111...`
470\caption{$M_2 = \text{MatchStar}(M_1, C)$}
475Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
476it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
477conventional presentation of our character data input.   However, this reverses the normal
478order of presentation when considering bitstreams as numeric values.  The key point here is
479that when we perform bitstream addition, we will show bit movement from left-to-right.
480For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
482The first row of the figure is the input data,
483the second and third rows are the input bitstreams: the initial marker position bitstream and the
484character class bitstream for digits derived from input data. 
486In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
487The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
488output marker stream is obtained by ORing $T_2$ with the initial marker stream.
490In general, given a marker stream $M$ and a character class stream $C$,
491the operation of MatchStar is defined by the following equation. 
492\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
493Given a set of initial marker positions, the result stream marks
494all possible positions that can be reached by 0 or more occurrences
495of characters in class $C$ from each position in $M$
497MatchStar differs from ScanThru of the Parabix tool chain in that it
498finds all matches, not just the longest match.  This is necessary
499for general matching involving possibly ambiguous regular
506\section{Block-at-a-Time Processing}\label{sec:blockwise}
508The unbounded stream model of the previous section must of course
509be translated an implementation that proceeds block-at-a-time for
510realistic application.  In this, we primarily rely on the Pablo
511compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
512statements expressed as arbitrary-length bitstream equations, Pablo
513produces block-at-a-time C++ code that initializes and maintains all the necessary
514carry bits for each of the additions and shifts involved in the
515bitstream calculations.   
517In the present work, our principal contribution to the Parabix tool
518chain is to incorporate the technique of long-stream addition described below.
519Otherwise, we were able to use Pablo directly in compiling our
520SSE2 and AVX2 implementations.   Our GPGPU implementation required
521some scripting to modify the output of the Pablo compiler for our
524\paragraph*{Long-Stream Addition}  The maximum word size for
525addition on commodity processors is typically 64 bits.  In order
526to implement long-stream addition for block sizes of 256 or larger,
527a method for propagating carries through the individual stages of
52864-bit addition is required.  However, the normal technique of
529sequential addition using add-with-carry instructions, for example,
530is far from ideal.
532We use the following general model using SIMD methods for constant-time
533long-stream addition up to 4096 bits.   Related GPGPU solutions have been
534independently developed\cite{Crovella2012},
535however our model is intended to be a more broadly applicable abstraction.
536We assume the availability of the following SIMD/SIMT operations
537operating on vectors of $f$ 64-bit fields.
539\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
540in two vectors to produce a result vector of $f$ 64-bit fields.
541\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
542of \verb:x: each with the constant value -1 (all bits 1), producing
543an $f$-bit mask value,
544\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
545field into a single compressed $f$-bit mask value, and
546\item normal bitwise logic operations on $f$-bit masks, and
547\item  \verb#simd<64>::spread(X)#: distributing the bits of
548an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
551In this model, the \verb#hsimd<64>::mask(X)# and
552\verb#simd<64>::spread(X)# model the minimum
553communication requirements between the parallel processing units
554(SIMD lanes or SIMT processors).    In essence, we just need
555the ability to quickly send and receive 1 bit of information
556per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
557gathers 1 bit from each of the processors to a central resource.
558After calculations on the gather bits are performed, we then
559just need an operation to invert the communication, i.e.,
560sending 1 bit each from the central processor to each of
561the parallel units.   There are a variety of ways in which
562these facilities may be implemented depending on the
563underlying architecture; details of our AVX2 and GPGPU implementations
564are presented later.   
566Given these operations, our method for long stream addition of
567two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
569\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
570\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
572\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
573\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
574\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
575\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
577\item Compute an $f$-bit mask of carries generated for each of the
57864-bit additions of \verb:X: and \verb:Y:.
579\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
581\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
582an incoming carry bit.  This is called the {\em bubble mask}.
583\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
585\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
586incremented to produce the final sum.  Here we find a new application of
588\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
590This is the key step.  The mask {\tt c} of outgoing carries must be
591shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
592with the next digit.  At the incoming position, the carry will
593increment the 64-bit digit.   However, if this digit is all ones (as
594signaled by the corresponding bit of bubble mask {\tt b}, then the addition
595will generate another carry.  In fact, if there is a sequence of
596digits that are all ones, then the carry must bubble through
597each of them.   This is just MatchStar.
599\item Compute the final result {\tt Z}.
600\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
606{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
607{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
608{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
609{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
610{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
611{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
612{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
613{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
614{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
615{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
616{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
619\caption{Long Stream Addition}\label{fig:longadd}
622Figure \ref{fig:longadd} illustrates the process.  In the figure,
623we illustrate the process with 8-bit fields rather than 64-bit fields
624and show all field values in hexadecimal notation.  Note that
625two of the individual 8-bit additions produce carries, while two
626others produce {\tt FF} values that generate bubble bits.  The
627net result is that four of the original 8-bit sums must be
628incremented to produce the long stream result.
630A slight extension to the process produces a long-stream full adder
631that can be used in chained addition.  In this case, the
632adder must take an additional carry-in bit
633{\tt p} and produce a carry-out bit {\tt q}.
634This may be accomplished by incorporating {\tt p}
635in calculating the increment mask in the low bit position,
636and then extracting the carry-out {\tt q} from the high bit position.
637\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
638\[\text{\tt q} = \text{\tt i >> f}\]
640As described subsequently, we use a two-level long-stream addition technique
641in both our AVX2 and GPGPU implementations.  In principle, one can extend
642the technique to additional levels.  Using 64-bit adders throughout,
643$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
644A three-level scheme could coordinate
64564 groups each performing 4096-bit long additions in a two-level structure.
646However, whether there are reasonable architectures that can support fine-grained
647SIMT style at this level is an open question.
649Using the methods outlined, it is quite conceivable that instruction
650set extensions to support long-stream addition could be added for
651future SIMD and GPGPU processors.   Given the fundamental nature
652of addition as a primitive and its particular application to regular
653expression matching as shown herein, it seems reasonable to expect
654such instructions to become available.    Alternatively, it may
655be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and
656\verb#simd<64>::spread(X)# operations are efficiently supported.
667\section{GPGPU Implementation}\label{sec:GPU}
669To further assess the scalability of our regular expression matching
670using bit-parallel data streams, we implemented a GPGPU version
671in OpenCL.   
672We arranged for 64 work groups each having 64 threads.
673The size of work group and number of work groups is chosen
674to provide the best occupancy as calculated by the AMD App Profiler.
675Input files are divided in data parallel fashion among
676the 64 work groups.  Each work group carries out the regular
677expression matching operations 4096 bytes at a time using SIMT
678processing.   Although the GPGPU
679does not directly support the mask and spread operations required
680by our long-stream addition model,
681we are able to simulate them using shared memory.
682Each thread maintains
683its own carry and bubble values in shared memory and performs
684synchronized updates with the other threads using a six-step
685parallel-prefix style process.  Others have implemented
686long-stream addition on the GPGPU using similar techniques,
687as noted previously.
689We performed our test on an AMD Radeon HD A10-6800K APU machine.
690On the AMD Fusion systems, the input buffer is allocated in
691pinned memory to take advantage of the zero-copy memory regions
692where data can be read directly into this region by the CPU
693and also accessed by the GPGPU for further processing. Therefore,
694the expensive data transferring time that is needed by traditional
695discrete GPGPUs is hidden and we compare only the kernel execution
696time with our SSE2 and AVX implementations as shown in Figure
697\ref{fig:SSE-AVX-GPU}. The GPGPU version gives 30\% to 60\% performance
698improvement over SSE version and 10\% to 40\% performance
699improvement over AVX version. Although we intended to process
70064 work groups with 4096 bytes each at a time rather than 128 bytes
701at a time on SSE or 256 bytes at a time on AVX, the performance
702improvement is less than 60\%. The first reason is hardware
703limitations. Our kernel occupancy is limited by register usage
704and not all the work groups can be scheduled at the same time.
705The second reason is that the long-stream addition implemented
706on GPGPU is more expensive than the implementations on SSE or AVX.
707Another important reason is the control flow. When a possible
708match is found in one thread, the rest of the threads in the
709same work group have to execute the same instructions for
710further processing rather than jump to the next block with a
711simple IF test. Therefore, the performance of different
712regular expressions is dependent on the number of
713long-stream addition operations and the total number of matches
714of a given input.   Perhaps surprisingly, the overhead of the Parabix
715transformation was not a dominant factor, coming in at 0.08 ms/MB.
723ylabel=Running Time (ms per megabyte),
725tick label style={font=\tiny},
726enlarge x limits=0.15,
727%enlarge y limits={0.15, upper},
729legend style={at={(0.5,-0.15)},
730anchor=north,legend columns=-1},
732bar width=7pt,
735file {data/ssetime.dat};
737file {data/avxtime.dat};
739file {data/gputime.dat};
745\caption{Running Time}\label{fig:SSE-AVX-GPU}
760%\section{Appendix Title}
762%This is the text of the appendix, if you need one.
765This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
766MITACS, Inc.
Note: See TracBrowser for help on using the repository browser.