source: docs/Working/re/pact051-cameron.tex @ 3880

Last change on this file since 3880 was 3880, checked in by cameron, 5 years ago

Various formatting items

File size: 34.7 KB
3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
10{\mycrnotice{Copyright is held by the owner/author(s). Publication rights licensed to ACM.}}}
11\copyrightetc{ACM \the\acmcopyr}
12\crdata{978-1-4503-2809-8/14/08\ ...\$15.00.\\
16\widowpenalty = 10000
27\title{Bitwise Data Parallelism in Regular Expression Matching}
29\numberofauthors{7} %  in this sample file, there are a *total*
30% of EIGHT authors. SIX appear on the 'first-page' (for formatting
31% reasons) and the remaining two appear in the \additionalauthors section.
34% You can go ahead and credit any number of authors here,
35% e.g. one 'row of three' or two rows (consisting of one row of three
36% and a second row of one, two or three).
38% The command \alignauthor (no curly braces needed) should
39% precede each author name, affiliation/snail-mail address and
40% e-mail address. Additionally, tag each line of
41% affiliation/address with \affaddr, and tag the
42% e-mail address with \email.
44% 1st. author
46Rob Cameron, Tom Shermer, Arrvindh Shriraman, Ken Herdy, Dan Lin, Ben Hull, Meng Lin\\
47       \affaddr{School of Computing Science}\\
48       \affaddr{Simon Fraser University}\\
49       \affaddr{Surrey, British Columbia}\\
50       \email{\{cameron,shermer,ashrira,ksherdy,lindanl,bhull,linmengl\}}
66The use of regular expressions to search texts for occurrences
67of string patterns has a long history and
68remains a pervasive technique throughout computing applications today.
69% {\em a brief history}
70The origins of regular expression matching date back to automata theory
71developed by Kleene in the 1950s \cite{kleene1951}.
72Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
73to nondeterministic finite automata (NFA).
74Following Thompson's approach, a regular expression of length $m$ is converted
75to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
76NFA in worst case $O(mn)$ time. Often, a more efficient choice
77is to convert an NFA into a DFA. A DFA has a single active state at any time
78throughout the matching process and
79hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well
80known that the conversion of an NFA to an equivalent DFA may result
81in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
83A significant proportion of the research in fast regular expression matching can be
84regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
85In \cite{baeza1992new}, Baeza-Yates and Gonnet
86presented a new approach to string search based on bit-level parallelism.
87This technique takes advantage of the intrinsic parallelism of bitwise operations
88within a computer word. Thus,
89given a $w$-bit word, the number of operations that a string search algorithms
90performs can be reduced by a factor $w$.
91Building on this observation, the Shift-Or algorithm simulates an NFA using
92bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}.
93A disadvantage of the Shift-Or approach
94is an inability to skip input characters.
95Simple string matching algorithms,
96such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, skip input characters
97to achieve sublinear times in the average case.
98% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
99% based on suffix automata are able to skip characters.
100The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
101combines the advantages of the Shift-Or approach
102with the ability to skip characters. %character skipping property of BDM algorithms.
103The nrgrep tool is based on the BNDM algorithm. It is generally
105the fastest grep tool
106for matching complex patterns, and achieves similar performance
107to the fastest existing string
108matching tools for simple patterns \cite{navarro2000}.
110There has been considerable recent
111interest in accelerating regular expression matching
112on parallel hardware
113such as multicore processors (CPUs),
114general purpose graphics processing units (GPGPUs),
115field-programmable gate arrays (FPGAs),
116and specialized architectures such as
117the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
119Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
120text processing algorithms that exhibit irregular memory access patterns
121can be efficiently executed on multicore hardware.
122In related work, Pasetto et al presented a flexible tool that
123performs small-ruleset regular expression matching at a rate of
1242.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
125Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
126string matching algorithm \cite{aho1975} is well-suited for parallel
127implementation on multicore CPUs, GPGPUs and the Cell BE.
128On each system, both thread-level parallelism (cores) and data-level parallelism
129(SIMD units) were leveraged for performance.
130Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
131in business analytics applications and leveraged the SIMD hardware available
132on multicore processors to achieve a speedup of greater than 1.8 over a
133range of data sizes of interest.
135In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
136that delivered 1.00--1.78 Gbps on a single
137Cell BE chip and extended this approach for emulation on the Intel Larrabee
138instruction set \cite{scarpazza2009larrabee}.
139On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
140implementation that delivered a throughput of 40
141Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
142Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
143presented a string matching implementation for automata that achieved
1444 Gbps on the Cell BE.
145% GPU
146On GPGPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
147implementation of the AC algorithm for
148accelerating string matching on GPGPUs. Lin et al., proposed
149the Parallel Failureless Aho-Corasick (PFAC)
150algorithm to accelerate pattern matching on GPGPU hardware and
151achieved 143 Gbps raw data throughput,
152although system throughput was limited to 15 Gbps
154Most recently, Mytkowicz et al have developed a method for combining
155SIMD parallellism and data parallelism
157Whereas the existing approaches to parallelization have been
158based on adapting traditional sequential algorithms to emergent
159parallel architectures, we introduce both a new algorithmic
160approach and its implementation on SIMD and GPGPU architectures.
161This approach relies on a bitwise data parallel view of text
162streams as well as a surprising use of addition to match
163runs of characters in a single step.  The closest previous
164work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
165technology together with a parallel scanning primitive also
166based on addition \cite{cameron2011parallel}.   
167However, in contrast to the deterministic, longest-match
168scanning associated with the ScanThru primitive of that
169work, we introduce here a new primitive MatchStar
170that can be used in full generality for nondeterministic
171regular expression matching.   We also introduce a long-stream
172addition technique involving a further application of MatchStar
173that enables us to scale the technique to $n$-bit addition
174in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
175for example, to perform
176synchronized 4096-bit addition on GPGPU wavefronts of 64 threads.
178There is also a strong keyword match between the bit-parallel
179data streams used in our approach and the bit-parallelism
180used for NFA state transitions in the classical algorithms of
181Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
182and Navarro and Raffinot \cite{navarro1998bit}.
183However those algorithms use bit-parallelism in a fundamentally
184different way: representing all possible current NFA states
185as a bit vector and performing parallel transitions to a new
186set of states using table lookups and bitwise logic.    Whereas
187our approach can match multiple characters per step, bit-parallel
188NFA algorithms proceed through the input one byte at a time.
189Nevertheless, the agrep \cite{wu1992agrep} and
190nrgrep \cite{navarro2000} programs implemented using these techniques remain
191among the strongest competitors in regular expression matching
192performance, so we include them in our comparative evaluation.
194The remainder of this paper is organized as follows.
195Section \ref{sec:grep} briefly describes regular expression
196notation and the grep problem.
197Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
198primitive using a model of arbitrary-length bit-parallel data streams.
199Section \ref{sec:blockwise} discusses the block-by-block
200implementation of our techniques including the long stream
201addition techniques for 256-bit addition with AVX2 and
2024096-bit additions with GPGPU SIMT.
203Section \ref{sec:SSE2} describes our overall SSE2 implementation
204and carries out a performance study in comparison with
205existing grep implementations.
206Given the dramatic variation in grep performance across
207different implementation techniques, expressions and data sets,
208Section \ref{sec:analysis} considers a comparison between
209the bit-stream and NFA approaches from a theoretical perspective.
210Section \ref{sec:AVX2} then examines and demonstrates
211the scalability of our
212bitwise data-parallel approach in moving from
213128-bit to 256-bit SIMD on Intel Haswell architecture.
214To further investigate scalability, Section \ref{sec:GPU} 
215addresses the implementation of our matcher using
216groups of 64 threads working together SIMT-style on a GPGPU system.
217Section \ref{sec:Concl} concludes the paper with a discussion of results and
218areas for future work.
220\section{Regular Expression Notation and Grep}\label{sec:grep}
222We follow common POSIX notation for regular expressions.
223A regular expression specifies a set of strings through
224a pattern notation.   Individual characters normally
225stand for themselves, unless they are one of the
226special characters \verb:*+?[{\(|^$.: that serve as metacharacters
227of the notation system.  Thus the regular expression \verb:cat:
228is a pattern for the set consisting of the single 3-character
229string ``\verb:cat:''.   The special characters must be escaped
230with a backslash to prevent interpretation as metacharacter, thus
231\verb:\$: represents the dollar-sign and \verb:\\\\: represent
232the string consisting of two backslash characters.
233Character class bracket expressions are pattern elements
234that allow any character in a given class to be used in a particular
235context.  For example, \verb:[@#%]: is a regular expression
236that stands for any of the three given symbols.  Contiguous
237ranges of characters may be specified using hyphens;
238for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
239for any alphanumeric character or underscore.  If the
240caret character immediately follows the opening bracket,
241the class is negated, thus \verb:[^0-9]: stands for
242any character except a digit.  The period metacharacter
243\verb:.: stands for the class of all characters.
245Consecutive pattern elements stand for strings formed by
246concatenation, thus \verb:[cd][ao][tg]: stands for the
247set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
248The alternation operator \verb:|: allows a pattern to be
249defined to have two alternative forms, thus \verb:cat|dog:
250matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
251takes precedence over alternation, but parenthesis may be
252used to change this, thus \verb:(ab|cd)[0-9]: stands for any
253digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
255Repetition operators may be appended to a pattern to specify
256a variable number of occurrences of that pattern. 
257The Kleene star operator \verb:*: specifies zero-or-more occurrences
258matching the previous pattern, while Kleene plus \verb:+: specifies
259one-or-more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
260both specify the same set: strings of at least one lower-case
261letter.  The postfix operator \verb:?: specifies an optional
262component, i.e., zero-or-one occurrence of strings matching
263the element.  Specific bounds may be given within braces:
264\verb:(ab){3}: specifies the string ``\verb:ababab:'',
265\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
266or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
267strings of at least 4 consecutive capital letters.
269The grep program searches a file for lines containing matches
270to a regular expression using any of the above notations.
271In addition, the pattern elements \verb:^: and \verb:$:
272may be used to match respectively the beginning or the
273end of a line.  In line-based tools such as grep, \verb:.:
274matches any character except newlines; matches cannot extend
275over lines.
276Normally, grep prints all matching
277lines to its output.  However, grep programs typically
278allow a command line flag such as \verb:-c: to specify
279that only a count of matching lines be produced; we use
280this option in our experimental evaluation to focus
281our comparisons on the performance of the underlying matching
284\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
286Whereas the traditional approaches to regular expression matching
287using NFAs, DFAs or backtracking all rely on a byte-at-a-time
288processing model, the approach  we introduce in this paper is based
289on quite a different concept:  a data-parallel approach to simultaneous
290processing of data stream elements.  Indeed, our most abstract model
291is that of unbounded data parallelism: processing all elements of
292the input data stream simultaneously.   In essence, data streams are viewed
293as (very large) integers.   The fundamental operations are bitwise
294logic, stream shifting and long-stream addition.
296Depending on the available parallel processing resources, an actual
297implementation may divide an input stream into blocks  and process
298the blocks sequentially.   Within each block  all elements of the
299input stream are processed together, relying on the availability of
300bitwise logic and addition scaled to the block size.   On commodity
301Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
302we typically process input streams 128 bytes at a time.   
303In this
304case, we rely on the Parabix tool chain \cite{lin2012parabix}
305to handle the details of compilation to block-by-block processing.
306On the
307latest processors supporting the 256-bit AVX2 SIMD operations,
308we also use the Parabix tool chain, but substitute a parallelized
309long-stream addition technique to avoid the sequential chaining
310of 4 64-bit additions.
311Our GPGPU implementation uses scripts to modify the output
312of the Parabix tools, effectively dividing the input into blocks
313of 4K bytes.   
314We also have adapted our long-stream addition technique
315to perform 4096-bit additions using 64 threads working in lock-step
316SIMT fashion. 
321input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
322$B_7$ & \verb`.................................`\\
323$B_6$ & \verb`1...1..1.1..11..1.....1..11..1...`\\
324$B_5$ & \verb`111111111111111111111111111111111`\\
325$B_4$ & \verb`.1111...11...1...111111....1111..`\\
326$B_3$ & \verb`....111..111.111...1.1111....1.11`\\
327$B_2$ & \verb`.11..11...11..11....1..11.....111`\\
328$B_1$ & \verb`...11..111...1....1...1..1.1111..`\\
329$B_0$ & \verb``\\
330\verb:[a]: & \verb`1...........1...1.........1......`\\
331\verb:[z9]: & \verb`....1....1...1.....1.11......1...`\\
332\verb:[0-9]: & \verb`.111....1........11111.....11.1..`
336\caption{Basis and Character Class Streams}
340A key concept in this streaming approach is the derivation of bit streams
341that are parallel to the input data stream, i.e., in one-to-one
342correspondence with the data element positions of the input
343streams.   Typically, the input stream is a byte stream comprising
344the 8-bit character code units of a particular encoding such
345as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
346easily be used with wider code units such as the 16-bit code units of
347UTF-16.   In the case of a byte stream, the first step is to transpose
348the byte stream into eight parallel bit streams, such that bit stream
349$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
350a set of basis bit streams from which many other parallel bit
351streams can be calculated, such as character class bit
352streams such that each bit $j$ of the stream specifies
353whether character $j$ of the input stream is in the class
354or not.  Figure \ref{fig:streams} shows an example of an
355input byte stream in ASCII, the eight basis bit streams of the
356transposed representation, and the character class bit streams
358\verb:[z9]:, and
360that may be computed from the basis bit streams using bitwise logic.
361Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
362Transposition and character class construction are straightforward
363using the Parabix tool chain \cite{lin2012parabix}.
368input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
369$M_1$ & \verb`.1...........1...1.........1.....`\\
370$M_2$ & \verb`.1111........1...111111....111...`\\
371$M_3$ & \verb`.....1........1.....1.11......1..`
375\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
379\paragraph*{Marker Streams}  Now consider how bit-parallel data
380streams can be used in regular expression matching.   Consider
381the problem of searching the input stream of Figure \ref{fig:streams}
382to finding occurrence of strings matching
383the regular expression \verb:a[0-9]*[z9]:. 
384Note that this is an ambiguous regular expression, which could match
385texts such as \verb:a12949z: in multiple ways.
386The matching process involves the concept of {\em marker streams}, that
387is streams that mark the positions of current matches during the
388overall process.  In this case there are three marker streams computed
389during the match process, namely,
390$M_1$ representing match positions after an initial \verb:a:
391character has been found, $M_2$ representing positions
392reachable from positions marked by $M_1$ by further matching zero or
393more digits (\verb:[0-9]*:) and finally $M_3$ the stream
394marking positions after a final \verb:z: or \verb:9: has been found.
395Without describing the details of how these streams are computed
396for the time being, Figure \ref{fig:streams2} shows what each
397of these streams should be for our example matching problem.
398Our convention that a marker stream contains a 1 bit
399at the next character position to be matched, that is,
400immediately past the last position that was matched.
401Note that all three matches from the third occurrence of \verb:a:
402are correctly marked in $M_3$.
405\paragraph*{MatchStar}  MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
410input data  & \verb`a453z--b3z--az--a12949z--ca22z7--`\\
411$M_1$ & \verb`.1...........1...1.........1.....`\\
412$C = \text{\tt [0-9]}$ & \verb`.111....1........11111.....11.1..`\\
413$T_0 = M_1 \wedge C$ & \verb`.1...............1.........1.....`\\
414$T_1 = T_0 + C$ & \verb`....1...1.............1......11..`\\
415$T_2 = T_1 \oplus C$ & \verb`.1111............111111....111...`\\
416$M_2 = T_2 \vee M_1$ & \verb`.1111........1...111111....111...`
420\caption{$M_2 = \text{MatchStar}(M_1, C)$}
425Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
426it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
427conventional presentation of our character data input.   However, this reverses the normal
428order of presentation when considering bitstreams as numeric values.  The key point here is
429that when we perform bitstream addition, we will show bit movement from left-to-right.
430For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
432The first row of the figure is the input data,
433the second and third rows are the input bitstreams: the initial marker position bitstream and the
434character class bitstream for digits derived from input data. 
436In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
437The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
438output marker stream is obtained by ORing $T_2$ with the initial marker stream.
440In general, given a marker stream $M$ and a character class stream $C$,
441the operation of MatchStar is defined by the following equation. 
442\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
443Given a set of initial marker positions, the result stream marks
444all possible positions that can be reached by 0 or more occurrences
445of characters in class $C$ from each position in $M$
447MatchStar differs from ScanThru of the Parabix tool chain in that it
448finds all matches, not just the longest match.  This is necessary
449for general matching involving possibly ambiguous regular
456\section{Block-at-a-Time Processing}\label{sec:blockwise}
458The unbounded stream model of the previous section must of course
459be translated an implementation that proceeds block-at-a-time for
460realistic application.  In this, we primarily rely on the Pablo
461compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
462statements expressed as arbitrary-length bitstream equations, Pablo
463produces block-at-a-time C++ code that initializes and maintains all the necessary
464carry bits for each of the additions and shifts involved in the
465bitstream calculations.   
467In the present work, our principal contribution to the Parabix tool
468chain is to incorporate the technique of long-stream addition described below.
469Otherwise, we were able to use Pablo directly in compiling our
470SSE2 and AVX2 implementations.   Our GPGPU implementation required
471some scripting to modify the output of the Pablo compiler for our
474\paragraph*{Long-Stream Addition}  The maximum word size for
475addition on commodity processors is typically 64 bits.  In order
476to implement long-stream addition for block sizes of 256 or larger,
477a method for propagating carries through the individual stages of
47864-bit addition is required.  However, the normal technique of
479sequential addition using add-with-carry instructions, for example,
480is far from ideal.
482We use the following general model using SIMD methods for constant-time
483long-stream addition up to 4096 bits.   Related GPGPU solutions have been
484independently developed\cite{Crovella2012},
485however our model is intended to be a more broadly applicable abstraction.
486We assume the availability of the following SIMD/SIMT operations
487operating on vectors of $f$ 64-bit fields.
489\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
490in two vectors to produce a result vector of $f$ 64-bit fields.
491\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
492of \verb:x: each with the constant value -1 (all bits 1), producing
493an $f$-bit mask value,
494\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
495field into a single compressed $f$-bit mask value, and
496\item normal bitwise logic operations on $f$-bit masks, and
497\item  \verb#simd<64>::spread(X)#: distributing the bits of
498an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
501In this model, the \verb#hsimd<64>::mask(X)# and
502\verb#simd<64>::spread(X)# model the minimum
503communication requirements between the parallel processing units
504(SIMD lanes or SIMT processors).    In essence, we just need
505the ability to quickly send and receive 1 bit of information
506per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
507gathers 1 bit from each of the processors to a central resource.
508After calculations on the gather bits are performed, we then
509just need an operation to invert the communication, i.e.,
510sending 1 bit each from the central processor to each of
511the parallel units.   There are a variety of ways in which
512these facilities may be implemented depending on the
513underlying architecture; details of our AVX2 and GPGPU implementations
514are presented later.   
516Given these operations, our method for long stream addition of
517two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
519\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
520\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
522\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
523\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
524\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
525\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
527\item Compute an $f$-bit mask of carries generated for each of the
52864-bit additions of \verb:X: and \verb:Y:.
529\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
531\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
532an incoming carry bit.  This is called the {\em bubble mask}.
533\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
535\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
536incremented to produce the final sum.  Here we find a new application of
538\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
540This is the key step.  The mask {\tt c} of outgoing carries must be
541shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
542with the next digit.  At the incoming position, the carry will
543increment the 64-bit digit.   However, if this digit is all ones (as
544signaled by the corresponding bit of bubble mask {\tt b}, then the addition
545will generate another carry.  In fact, if there is a sequence of
546digits that are all ones, then the carry must bubble through
547each of them.   This is just MatchStar.
549\item Compute the final result {\tt Z}.
550\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
556{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
557{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
558{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
559{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
560{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
561{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
562{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
563{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
564{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
565{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
566{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
569\caption{Long Stream Addition}\label{fig:longadd}
572Figure \ref{fig:longadd} illustrates the process.  In the figure,
573we illustrate the process with 8-bit fields rather than 64-bit fields
574and show all field values in hexadecimal notation.  Note that
575two of the individual 8-bit additions produce carries, while two
576others produce {\tt FF} values that generate bubble bits.  The
577net result is that four of the original 8-bit sums must be
578incremented to produce the long stream result.
580A slight extension to the process produces a long-stream full adder
581that can be used in chained addition.  In this case, the
582adder must take an additional carry-in bit
583{\tt p} and produce a carry-out bit {\tt q}.
584This may be accomplished by incorporating {\tt p}
585in calculating the increment mask in the low bit position,
586and then extracting the carry-out {\tt q} from the high bit position.
587\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
588\[\text{\tt q} = \text{verb:i >> f:}\]
590As described subsequently, we use a two-level long-stream addition technique
591in both our AVX2 and GPGPU implementations.  In principle, one can extend
592the technique to additional levels.  Using 64-bit adders throughout,
593$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
594A three-level scheme could coordinate
59564 groups each performing 4096-bit long additions in a two-level structure.
596However, whether there are reasonable architectures that can support fine-grained
597SIMT style at this level is an open question.
599Using the methods outlined, it is quite conceivable that instruction
600set extensions to support long-stream addition could be added for
601future SIMD and GPGPU processors.   Given the fundamental nature
602of addition as a primitive and its particular application to regular
603expression matching as shown herein, it seems reasonable to expect
604such instructions to become available.    Alternatively, it may
605be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and
606\verb#simd<64>::spread(X)# operations are efficiently supported.
617\section{GPGPU Implementation}\label{sec:GPU}
619To further assess the scalability of our regular expression matching
620using bit-parallel data streams, we implemented a GPGPU version
621in OpenCL.   
622We arranged for 64 work groups each having 64 threads.
623The size of work group and number of work groups is chosen
624to provide the best occupancy as calculated by the AMD App Profiler.
625Input files are divided in data parallel fashion among
626the 64 work groups.  Each work group carries out the regular
627expression matching operations 4096 bytes at a time using SIMT
628processing.   Although the GPGPU
629does not directly support the mask and spread operations required
630by our long-stream addition model,
631we are able to simulate them using shared memory.
632Each thread maintains
633its own carry and bubble values in shared memory and performs
634synchronized updates with the other threads using a six-step
635parallel-prefix style process.  Others have implemented
636long-stream addition on the GPGPU using similar techniques,
637as noted previously.
639We performed our test on an AMD Radeon HD A10-6800K APU machine.
640On the AMD Fusion systems, the input buffer is allocated in
641pinned memory to take advantage of the zero-copy memory regions
642where data can be read directly into this region by the CPU
643and also accessed by the GPGPU for further processing. Therefore,
644the expensive data transferring time that is needed by traditional
645discrete GPGPUs is hidden and we compare only the kernel execution
646time with our SSE2 and AVX implementations as shown in Figure
647\ref{fig:SSE-AVX-GPU}. The GPGPU version gives up to 55\% performance
648improvement over SSE version and up to 40\% performance
649improvement over AVX version. Although we intended to process
65064 work groups with 4096 bytes each at a time rather than 128 bytes
651at a time on SSE or 256 bytes at a time on AVX, the performance
652improvement is less than 60\%. The first reason is hardware
653limitations. Our kernel occupancy is limited by register usage
654and not all the work groups can be scheduled at the same time.
655The second reason is that the long-stream addition implemented
656on GPGPU is more expensive than the implementations on SSE or AVX.
657Another important reason is the control flow. When a possible
658match is found in one thread, the rest of the threads in the
659same work group have to execute the same instructions for
660further processing rather than jump to the next block with a
661simple IF test. Therefore, the performance of different
662regular expressions is dependent on the number of
663long-stream addition operations and the total number of matches
664of a given input.   Perhaps surprisingly, the overhead of the Parabix
665transformation was not a dominant factor, coming in at 0.08 ms/MB.
673ylabel=Running Time (ms per megabyte),
675tick label style={font=\tiny},
676enlarge x limits=0.15,
677%enlarge y limits={0.15, upper},
679legend style={at={(0.5,-0.15)},
680anchor=north,legend columns=-1},
682bar width=7pt,
685file {data/ssetime.dat};
687file {data/avxtime.dat};
689file {data/gputime.dat};
695\caption{Running Time}\label{fig:SSE-AVX-GPU}
710%\section{Appendix Title}
712%This is the text of the appendix, if you need one.
715This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
716MITACS, Inc.
Note: See TracBrowser for help on using the repository browser.