# source:docs/Working/re/pact051-cameron.tex@3896

Last change on this file since 3896 was 3896, checked in by cameron, 5 years ago

Clean-ups

File size: 35.9 KB
Line
1\documentclass{sig-alternate-2013}
2
3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
5\let\crnotice\mycrnotice%
6\let\confname\myconfname%
7
8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
12\crdata{978-1-4503-2809-8/14/08\ ...\15.00.\\ 13http://dx.doi.org/10.1145/2628071.2628079} 14 15\clubpenalty=10000 16\widowpenalty = 10000 17\usepackage[normalem]{ulem} 18\usepackage{amssymb} 19\usepackage{amsmath} 20\usepackage{graphicx} 21\usepackage{tikz} 22\usepackage{multirow} 23\usepackage{pgfplots} 24 25\begin{document} 26 27\title{Bitwise Data Parallelism in Regular Expression Matching} 28 29\numberofauthors{1} % in this sample file, there are a *total* 30% of EIGHT authors. SIX appear on the 'first-page' (for formatting 31% reasons) and the remaining two appear in the \additionalauthors section. 32% 33\author{ 34% You can go ahead and credit any number of authors here, 35% e.g. one 'row of three' or two rows (consisting of one row of three 36% and a second row of one, two or three). 37% 38% The command \alignauthor (no curly braces needed) should 39% precede each author name, affiliation/snail-mail address and 40% e-mail address. Additionally, tag each line of 41% affiliation/address with \affaddr, and tag the 42% e-mail address with \email. 43% 44% 1st. author 45%\alignauthor 46%Rob Cameron, Tom Shermer, Arrvindh Shriraman, Ken Herdy, Dan Lin, Ben Hull, Meng Lin\\ 47Robert D. Cameron, Thomas C. Shermer, Arrvindh Shriraman,\\ 48Kenneth S. Herdy, Dan Lin, Benjamin R. Hull, Meng Lin\\ 49 \affaddr{School of Computing Science}\\ 50 \affaddr{Simon Fraser University}\\ 51 \affaddr{Surrey, British Columbia}\\ 52 \email{\{cameron,shermer,ashriram,ksherdy,lindanl,bhull,linmengl\}@sfu.ca} 53} 54 55 56\date{} 57\maketitle 58 59\thispagestyle{empty} 60 61\sloppy 62 63\begin{abstract} 64\input{abstract} 65\end{abstract} 66\category{D.1.3}{Programming Techniques}{Concurrent Programming}[parallel programming] 67\category{D.3.3}{Programming Languages}{Processors}[compilers] 68\category{F.2.2}{Analysis of Algorithms and Problem Complexity}{Nonnumerical Algorithms and Problems}[pattern matching] 69\keywords{regular expression matching; parallel bit streams} 70\section{Introduction} 71 72The use of regular expressions to search texts 73for patterns has a long history and 74remains an important technique. 75% {\em a brief history} 76%The origins of regular expression matching date back to automata theory 77%studied by Kleene in the 1950s \cite{kleene1951}. 78Thompson \cite{thompson1968} is credited with the 79first construction to convert regular expressions 80to nondeterministic finite automata (NFA). 81Following Thompson's approach, 82a regular expression of lengthm$is converted 83to an NFA with$O(m)$states. Using the 84NFA it is possible to search a text 85of length$n$in$O(mn)$time. 86Frequently, a more efficient choice 87is to convert an NFA into a deterministic finite automata (DFA). 88A DFA maintains a single active state 89throughout the matching process and 90hence, using a DFA it is possible to 91search a text of length$n$in$O(n)$time\footnote{It is well 92known that the conversion of an NFA to an equivalent DFA may result 93in \textit{state explosion}, i.e., the number of resultant DFA states may increase exponentially.}. 94 95A significant proportion of the research 96in fast regular expression matching can be 97regarded as the quest for efficient automata'' \cite{navarro98fastand}. 98In \cite{baeza1992new}, Baeza-Yates and Gonnet 99presented a new approach to string search based on bit-level parallelism. 100This technique takes advantage of the intrinsic parallelism of bitwise operations 101within a computer word. Thus, 102given a$w$-bit word, the number of operations that a string search algorithms 103performs can be reduced by a factor$w$. 104Building on this observation, the Shift-Or algorithm simulates an NFA using 105bitwise operations and achieves$O(\frac{nm}{w})$time in the worst-case \cite{navarro2000}. 106A disadvantage of the Shift-Or approach 107is an inability to skip input characters. 108Simple string matching algorithms, 109such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, 110skip input characters 111to achieve sublinear times in the average case. 112% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 113% based on suffix automata are able to skip characters. 114The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 115combines the advantages of the Shift-Or approach 116with the ability to skip characters. %character skipping property of BDM algorithms. 117The nrgrep tool is based on the BNDM algorithm. It is generally 118considered 119the fastest grep tool 120for matching complex patterns, and achieves similar performance 121to the fastest existing string 122matching tools for simple patterns \cite{navarro2000}. 123 124Recently, there has been considerable 125interest in the use of 126parallel hardware such as multicore processors (CPUs), 127 graphics processing units (GPUs), 128field-programmable gate arrays (FPGAs), 129or specialized architectures such as 130the Cell Broadband Engine (Cell BE) to 131accelerate regular expression matching. % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) 132Generally, speedups are achieved by 133using parallel hardware features to improve the throughput of 134multiple instances of a matching problem at a time, 135i.e., by matching against sets of patterns or multiple input streams. 136In contrast, our approach uses parallelism to accelerate the throughput of 137a single problem instance, i.e., a single regular expression 138matched against a single input stream. 139 140%\paragraph{Multicore} 141In related work targeting multicore hardware, 142Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that 143text processing algorithms that exhibit irregular memory access patterns 144can be efficiently executed. 145Pasetto et al \cite{pasetto2010} presented a flexible tool that 146performs small-ruleset regular expression matching at a rate of 1472.88 Gbps per chip on Intel Xeon E5472 hardware. 148Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) 149string matching algorithm \cite{aho1975} is well-suited for parallel 150implementation on multicore CPUs, GPUs and the Cell BE. 151% On each system, both thread-level parallelism (cores) and data-level parallelism 152% (SIMD units) were leveraged for performance. 153Salapura et al \cite{salapura2012accelerating} advocated 154the use of vector-style processing for regular expressions 155in business analytics applications and leveraged the SIMD hardware available 156on multicore processors to achieve a speedup of greater than 1.8 over a 157range of data sizes. 158%\paragraph{Cell Broadband Engine} 159On the Cell Broadband Engine, Scarpazza and Russell 160%presented a SIMD tokenizer 161%that delivered 1.00--1.78 Gbps on a single 162%chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee 163%instruction set \cite{scarpazza2009larrabee}. 164%Furthermore, 165\cite{scarpazza2009cell} 166described a pattern matching 167implementation that delivered a throughput of 40 168Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4 169Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 170presented a string matching implementation for automata that achieved 1714 Gbps on the Cell BE. 172On GPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based 173implementation of the AC algorithm for 174accelerating string matching on GPUs. Lin et al., proposed 175the Parallel Failureless Aho-Corasick (PFAC) 176algorithm to accelerate pattern matching on GPU hardware and 177achieved 143 Gbps raw data throughput, 178although system throughput was limited to 15 Gbps 179\cite{lin2013accelerating}. 180Most recently, Mytkowicz et al \cite{DBLP:conf/asplos/MytkowiczMS14} 181have developed a method for combining 182SIMD parallelism and data parallelism on multicore hardware. 183Of each of these related works, this approach stands 184out since it also focuses on the acceleration 185of matching against a single input stream. 186 187Whereas the existing approaches to parallelization have been 188based on adapting traditional sequential algorithms to emergent 189parallel architectures, we introduce both a new algorithmic 190approach and its implementation on SIMD and GPU architectures. 191This approach relies on a bitwise data parallel view of text 192streams as well as a surprising use of addition to match 193runs of characters in a single step. The closest previous 194work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD 195technology together with a parallel scanning primitive also 196based on addition \cite{cameron2011parallel}. 197However, in contrast to the deterministic, longest-match 198scanning associated with the ScanThru primitive of that 199work, we introduce here a new primitive MatchStar 200that can be used in full generality for nondeterministic 201regular expression matching. We also introduce a long-stream 202addition technique involving a further application of MatchStar 203that enables us to scale the technique to$n$-bit addition 204in$\lceil\log_{64}{n}\rceil$steps. We ultimately apply this technique, 205for example, to perform 206synchronized 4096-bit addition on GPU wavefronts of 64 threads. 207 208There is also a strong keyword match between the bit-parallel 209data streams used in our approach and the bit-parallelism 210used for NFA state transitions in the classical algorithms of 211Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new} 212and Navarro and Raffinot \cite{navarro1998bit}. 213However those algorithms use bit-parallelism in a fundamentally 214different way: representing all possible current NFA states 215as a bit vector and performing parallel transitions to a new 216set of states using table lookups and bitwise logic. 217Whereas our approach can match multiple characters per step, bit-parallel 218NFA algorithms proceed through the input one byte at a time. 219Nevertheless, with the BNDM extensions, the 220nrgrep \cite{navarro2000} program remains 221among the strongest competitors in regular expression matching 222performance, so we include it in our comparative evaluation. 223 224The remainder of this paper is organized as follows. 225Section \ref{sec:grep} briefly describes regular expression 226notation and the grep problem. 227Section \ref{sec:bitwise} presents our basic algorithm and MatchStar 228primitive using a model of arbitrary-length bit-parallel data streams. 229Section \ref{sec:blockwise} discusses the block-by-block 230implementation of our techniques including the long stream 231addition techniques for 256-bit addition with AVX2 and 2324096-bit additions with GPU SIMT. 233Section \ref{sec:SSE2} describes our overall SSE2 implementation 234and carries out a performance study in comparison with 235existing grep implementations. 236Given the dramatic variation in grep performance across 237different implementation techniques, expressions and data sets, 238Section \ref{sec:analysis} considers a comparison between 239the bit-stream and NFA approaches from a theoretical perspective. 240Section \ref{sec:AVX2} then examines and demonstrates 241the scalability of our 242bitwise data-parallel approach in moving from 243128-bit to 256-bit SIMD on Intel Haswell architecture. 244To further investigate scalability, Section \ref{sec:GPU} 245addresses the implementation of our matcher using 246groups of 64 threads working together SIMT-style on a GPU system. 247Section \ref{sec:Concl} concludes the paper with a discussion of results and 248areas for future work. 249 250\section{Regular Expressions and Grep}\label{sec:grep} 251 252We follow common POSIX notation for regular expressions. 253A regular expression specifies a set of strings through 254a pattern notation. Individual characters normally 255stand for themselves, unless they are one of the 256special characters \verb:*+?[{\(|^$.: that serve as metacharacters
257of the notation system.  Thus the regular expression \verb:cat:
258is a pattern for the set consisting of the single 3-character
259string \verb:cat:''.   The special characters must be escaped
260with a backslash to prevent interpretation as metacharacter, thus
261\verb:\$: represents the dollar-sign and \verb:\\\\: represent 262the string consisting of two backslash characters. 263Character class bracket expressions are pattern elements 264that allow any character in a given class to be used in a particular 265context. For example, \verb:[@#%]: is a regular expression 266that stands for any of the three given symbols. Contiguous 267ranges of characters may be specified using hyphens; 268for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]: 269for any alphanumeric character or underscore. If the 270caret character immediately follows the opening bracket, 271the class is negated, thus \verb:[^0-9]: stands for 272any character except a digit. The period metacharacter 273\verb:.: stands for the class of all characters. 274 275Consecutive pattern elements stand for strings formed by 276concatenation, thus \verb:[cd][ao][tg]: stands for the 277set of 8 three-letter strings \verb:cat:'' through \verb:dog:''. 278The alternation operator \verb:|: allows a pattern to be 279defined to have two alternative forms, thus \verb:cat|dog: 280matches either \verb:cat:'' or \verb:dog:''. Concatenation 281takes precedence over alternation, but parenthesis may be 282used to change this, thus \verb:(ab|cd)[0-9]: stands for any 283digit following one of the two prefixes \verb:ab:'' or \verb:cd:''. 284 285Repetition operators may be appended to a pattern to specify 286a variable number of occurrences of that pattern. 287The Kleene star operator \verb:*: specifies zero-or-more occurrences 288matching the previous pattern, while Kleene plus \verb:+: specifies 289one-or-more occurrences. Thus \verb:[a-z][a-z]*: and \verb:[a-z]+: 290both specify the same set: strings of at least one lower-case 291letter. The postfix operator \verb:?: specifies an optional 292component, i.e., zero-or-one occurrence of strings matching 293the element. Specific bounds may be given within braces: 294\verb:(ab){3}: specifies the string \verb:ababab:'', 295\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three 296or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies 297strings of at least 4 consecutive capital letters. 298 299The grep program searches a file for lines containing matches 300to a regular expression using any of the above notations. 301In addition, the pattern elements \verb:^: and \verb:$:
302may be used to match respectively the beginning or the
303end of a line.  In line-based tools such as grep, \verb:.:
304matches any character except newlines; matches cannot extend
305over lines.
306Normally, grep prints all matching
307lines to its output.  However, grep programs typically
308allow a command line flag such as \verb:-c: to specify
309that only a count of matching lines be produced; we use
310this option in our experimental evaluation to focus
311our comparisons on the performance of the underlying matching
312algorithms.
313
314\section{Bit-Parallel Data Streams}\label{sec:bitwise}
315
316Whereas the traditional approaches to regular expression matching
317using NFAs, DFAs or backtracking all rely on a byte-at-a-time
318processing model, the approach  we introduce in this paper is based
319on quite a different concept:  a data-parallel approach to simultaneous
320processing of data stream elements.  Indeed, our most abstract model
321is that of unbounded data parallelism: processing all elements of
322the input data stream simultaneously.   In essence, data streams are viewed
323as (very large) integers.   The fundamental operations are bitwise
324logic, stream shifting and long-stream addition.
325
326Depending on the available parallel processing resources, an actual
327implementation may divide an input stream into blocks  and process
328the blocks sequentially.   Within each block  all elements of the
329input stream are processed together, relying on the availability of
330bitwise logic and addition scaled to the block size.   On commodity
331Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
332we typically process input streams 128 bytes at a time.
333In this
334case, we rely on the Parabix tool chain \cite{lin2012parabix}
335to handle the details of compilation to block-by-block processing.
336On the
337latest processors supporting the 256-bit AVX2 SIMD operations,
338we also use the Parabix tool chain, but substitute a parallelized
339long-stream addition technique to avoid the sequential chaining
341Our GPU implementation uses scripts to modify the output
342of the Parabix tools, effectively dividing the input into blocks
343of 4K bytes.
346SIMT fashion.
347
348\begin{figure}[tbh]
349\begin{center}
350\begin{tabular}{cr}\\
351input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
352$B_7$ & \verb.................................\\
353$B_6$ & \verb1...1..1.1..11..1.....1..11..1...\\
354$B_5$ & \verb111111111111111111111111111111111\\
355$B_4$ & \verb.1111...11...1...111111....1111..\\
356$B_3$ & \verb....111..111.111...1.1111....1.11\\
357$B_2$ & \verb.11..11...11..11....1..11.....111\\
358$B_1$ & \verb...11..111...1....1...1..1.1111..\\
359$B_0$ & \verb1.11.11.1.111.1111.1.1.1111...111\\
360\verb:[a]: & \verb1...........1...1.........1......\\
361\verb:[z9]: & \verb....1....1...1.....1.11......1...\\
362\verb:[0-9]: & \verb.111....1........11111.....11.1..
363\end{tabular}
364
365\end{center}
366\caption{Basis and Character Class Streams}
367\label{fig:streams}
368\end{figure}
369
370A key concept in this streaming approach is the derivation of bit streams
371that are parallel to the input data stream, i.e., in one-to-one
372correspondence with the data element positions of the input
373streams.   Typically, the input stream is a byte stream comprising
374the 8-bit character code units of a particular encoding such
375as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
376easily be used with wider code units such as the 16-bit code units of
377UTF-16.   In the case of a byte stream, the first step is to transpose
378the byte stream into eight parallel bit streams, such that bit stream
379$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
380a set of basis bit streams from which many other parallel bit
381streams can be calculated, such as character class bit
382streams such that each bit $j$ of the stream specifies
383whether character $j$ of the input stream is in the class
384or not.  Figure \ref{fig:streams} shows an example of an
385input byte stream in ASCII, the eight basis bit streams of the
386transposed representation, and the character class bit streams
387\verb:[a]:,
388\verb:[z9]:, and
389\verb:[0-9]:
390that may be computed from the basis bit streams using bitwise logic.
391Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
392Transposition and character class construction are straightforward
393using the Parabix tool chain \cite{lin2012parabix}.
394
395\begin{figure}[tbh]
396\begin{center}
397\begin{tabular}{cr}\\
398input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
399$M_1$ & \verb.1...........1...1.........1.....\\
400$M_2$ & \verb.1111........1...111111....111...\\
401$M_3$ & \verb.....1........1.....1.11......1..
402\end{tabular}
403
404\end{center}
405\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
406\label{fig:streams2}
407\end{figure}
408
409\paragraph*{Marker Streams}  Now consider how bit-parallel data
410streams can be used in regular expression matching.   Consider
411the problem of searching the input stream of Figure \ref{fig:streams}
412to finding occurrence of strings matching
413the regular expression \verb:a[0-9]*[z9]:.
414Note that this is an ambiguous regular expression, which could match
415texts such as \verb:a12949z: in multiple ways.
416The matching process involves the concept of {\em marker streams}, that
417is streams that mark the positions of current matches during the
418overall process.  In this case there are three marker streams computed
419during the match process, namely,
420$M_1$ representing match positions after an initial \verb:a:
421character has been found, $M_2$ representing positions
422reachable from positions marked by $M_1$ by further matching zero or
423more digits (\verb:[0-9]*:) and finally $M_3$ the stream
424marking positions after a final \verb:z: or \verb:9: has been found.
425Without describing the details of how these streams are computed
426for the time being, Figure \ref{fig:streams2} shows what each
427of these streams should be for our example matching problem.
428Our convention that a marker stream contains a 1 bit
429at the next character position to be matched, that is,
430immediately past the last position that was matched.
431Note that all three matches from the third occurrence of \verb:a:
432are correctly marked in $M_3$.
433
434
435\paragraph*{MatchStar}  MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
436
437\begin{figure}[tbh]
438\begin{center}
439\begin{tabular}{cr}\\
440input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
441$M_1$ & \verb.1...........1...1.........1.....\\
442$C = \text{\tt [0-9]}$ & \verb.111....1........11111.....11.1..\\
443$T_0 = M_1 \wedge C$ & \verb.1...............1.........1.....\\
444$T_1 = T_0 + C$ & \verb....1...1.............1......11..\\
445$T_2 = T_1 \oplus C$ & \verb.1111............111111....111...\\
446$M_2 = T_2 \vee M_1$ & \verb.1111........1...111111....111...
447\end{tabular}
448
449\end{center}
450\caption{$M_2 = \text{MatchStar}(M_1, C)$}
451\label{fig:matchstar}
452\end{figure}
453
454
455Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
456it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
457conventional presentation of our character data input.   However, this reverses the normal
458order of presentation when considering bitstreams as numeric values.  The key point here is
459that when we perform bitstream addition, we will show bit movement from left-to-right.
460For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
461
462The first row of the figure is the input data,
463the second and third rows are the input bitstreams: the initial marker position bitstream and the
464character class bitstream for digits derived from input data.
465
466In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.
467The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
468output marker stream is obtained by ORing $T_2$ with the initial marker stream.
469
470In general, given a marker stream $M$ and a character class stream $C$,
471the operation of MatchStar is defined by the following equation.
472$\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$
473Given a set of initial marker positions, the result stream marks
474all possible positions that can be reached by 0 or more occurrences
475of characters in class $C$ from each position in $M$
476
477MatchStar differs from ScanThru of the Parabix tool chain in that it
478finds all matches, not just the longest match.  This is necessary
479for general matching involving possibly ambiguous regular
480expressions.
481
482\input{compilation}
483
484
485\section{Block-at-a-Time Processing}\label{sec:blockwise}
486
487The unbounded stream model of the previous section must of course
488be translated an implementation that proceeds block-at-a-time for
489realistic application.  In this, we primarily rely on the Pablo
490compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
491statements expressed as arbitrary-length bitstream equations, Pablo
492produces block-at-a-time C++ code that initializes and maintains all the necessary
493carry bits for each of the additions and shifts involved in the
494bitstream calculations.
495
496In the present work, our principal contribution to the Parabix tool
497chain is to incorporate the technique of long-stream addition described below.
498Otherwise, we were able to use Pablo directly in compiling our
499SSE2 and AVX2 implementations.   Our GPU implementation required
500some scripting to modify the output of the Pablo compiler for our
501purpose.
502
503\paragraph*{Long-Stream Addition}  The maximum word size for
504addition on commodity processors is typically 64 bits.  In order
505to implement long-stream addition for block sizes of 256 or larger,
506a method for propagating carries through the individual stages of
50764-bit addition is required.  However, the normal technique of
509is far from ideal.
510
511We use the following general model using SIMD methods for constant-time
512long-stream addition up to 4096 bits.   Related GPU solutions have been
513independently developed\cite{Crovella2012},
514however our model is intended to be a more broadly applicable abstraction.
515We assume the availability of the following SIMD/SIMT operations
516operating on vectors of $f$ 64-bit fields.
517\begin{itemize}
519in two vectors to produce a result vector of $f$ 64-bit fields.
520\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
521of \verb:x: each with the constant value -1 (all bits 1), producing
522an $f$-bit mask value,
523\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
524field into a single compressed $f$-bit mask value, and
525\item normal bitwise logic operations on $f$-bit masks, and
526\item  \verb#simd<64>::spread(X)#: distributing the bits of
527an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
528\end{itemize}
529
532communication requirements between the parallel processing units
533(SIMD lanes or SIMT processors).    In essence, we just need
534the ability to quickly send and receive 1 bit of information
535per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
536gathers 1 bit from each of the processors to a central resource.
537After calculations on the gather bits are performed, we then
538just need an operation to invert the communication, i.e.,
539sending 1 bit each from the central processor to each of
540the parallel units.   There are a variety of ways in which
541these facilities may be implemented depending on the
542underlying architecture; details of our AVX2 and GPU implementations
543are presented later.
544
545Given these operations, our method for long stream addition of
546two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
547\begin{enumerate}
548\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
549$\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$
550
551\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
552$\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$
553$\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$
554$\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$
555
556\item Compute an $f$-bit mask of carries generated for each of the
55764-bit additions of \verb:X: and \verb:Y:.
558$\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$
559
560\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
561an incoming carry bit.  This is called the {\em bubble mask}.
562$\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$
563
564\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
565incremented to produce the final sum.  Here we find a new application of
566MatchStar.
567$\text{\tt i} = \text{\tt MatchStar(c*2, b)}$
568
569This is the key step.  The mask {\tt c} of outgoing carries must be
570shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
571with the next digit.  At the incoming position, the carry will
572increment the 64-bit digit.   However, if this digit is all ones (as
573signaled by the corresponding bit of bubble mask {\tt b}, then the addition
574will generate another carry.  In fact, if there is a sequence of
575digits that are all ones, then the carry must bubble through
576each of them.   This is just MatchStar.
577
578\item Compute the final result {\tt Z}.
579$\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$
580
581\end{enumerate}
582\begin{figure}
583\begin{center}
584\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
585{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
586{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
587{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
588{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
589{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
590{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
591{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
592{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
593{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
594{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
595{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
596\end{tabular}
597\end{center}
599\end{figure}
600
601Figure \ref{fig:longadd} illustrates the process.  In the figure,
602we illustrate the process with 8-bit fields rather than 64-bit fields
603and show all field values in hexadecimal notation.  Note that
604two of the individual 8-bit additions produce carries, while two
605others produce {\tt FF} values that generate bubble bits.  The
606net result is that four of the original 8-bit sums must be
607incremented to produce the long stream result.
608
609A slight extension to the process produces a long-stream full adder
610that can be used in chained addition.  In this case, the
612{\tt p} and produce a carry-out bit {\tt q}.
613This may be accomplished by incorporating {\tt p}
614in calculating the increment mask in the low bit position,
615and then extracting the carry-out {\tt q} from the high bit position.
616$\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}$
617$\text{\tt q} = \text{\tt i >}\text{\tt > f}$
618
619As described subsequently, we use a two-level long-stream addition technique
620in both our AVX2 and GPU implementations.  In principle, one can extend
622$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
623A three-level scheme could coordinate
62464 groups each performing 4096-bit long additions in a two-level structure.
625However, whether there are reasonable architectures that can support fine-grained
626SIMT style at this level is an open question.
627
628Using the methods outlined, it is quite conceivable that instruction
630future SIMD and GPU processors.   Given the fundamental nature
631of addition as a primitive and its particular application to regular
632expression matching as shown herein, it seems reasonable to expect
633such instructions to become available.    Alternatively, it may
634be worthwhile to simply ensure that the \verb#hmask# and
636
637
638\input{sse2}
639
640\input{analysis}
641
642\input{avx2}
643
644
645
646\section{GPU Implementation}\label{sec:GPU}
647
648To further assess the scalability of our regular expression matching
649using bit-parallel data streams, we implemented a GPU version
650in OpenCL.
651We arranged for 64 work groups each having 64 threads.
652The size of work group and number of work groups is chosen
653to provide the best occupancy as calculated by the AMD App Profiler.
654Input files are divided in data parallel fashion among
655the 64 work groups.  Each work group carries out the regular
656expression matching operations 4096 bytes at a time using SIMT
657processing.   Although the GPU
660we are able to simulate them using shared memory.
662its own carry and bubble values in shared memory and performs
664parallel-prefix style process.  Others have implemented
665long-stream addition on the GPU using similar techniques,
666as noted previously.
667
668We performed our test on an AMD Radeon HD A10-6800K APU machine.
669On the AMD Fusion systems, the input buffer is allocated in
670pinned memory to take advantage of the zero-copy memory regions
671where data can be read directly into this region by the CPU
672and also accessed by the GPU for further processing. Therefore,
673the expensive data transferring time that is needed by traditional
674discrete GPUs is hidden and we compare only the kernel execution
675time with our SSE2 and AVX implementations as shown in Figure
676\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
677improvement over SSE version and up to 40\% performance
678improvement over AVX version.   However, because of
679implementation complexities of the triply-nested while loop for
680the StarHeight expression, it has been omitted.
681
682Although we intended to process
68364 work groups with 4096 bytes each at a time rather than 128 bytes
684at a time on SSE or 256 bytes at a time on AVX, the performance
685improvement is less than 60\%. The first reason is hardware
686limitations. Our kernel occupancy is limited by register usage
687and not all the work groups can be scheduled at the same time.
688The second reason is that the long-stream addition implemented
689on GPU is more expensive than the implementations on SSE or AVX.
690Another important reason is the control flow. When a possible
691match is found in one thread, the rest of the threads in the
692same work group have to execute the same instructions for
694simple IF test. Therefore, the performance of different
695regular expressions is dependent on the number of
696long-stream addition operations and the total number of matches
697of a given input.   Perhaps surprisingly, the overhead of the Parabix
698transformation was not a dominant factor, coming in at 0.08 ms/MB.
699
700
701\begin{figure}
702\begin{center}
703\begin{tikzpicture}
704\begin{axis}[
705xtick=data,
706ylabel=Running Time (ms per megabyte),
707xticklabels={@,Date,Email,URI,Hex,StarHeight},
708tick label style={font=\tiny},
709enlarge x limits=0.15,
710%enlarge y limits={0.15, upper},
711ymin=0,
712legend style={at={(0.5,-0.15)},
713anchor=north,legend columns=-1},
714ybar,
715bar width=7pt,
716cycle list = {black,black!70,black!40,black!10}
717]
719file {data/ssetime.dat};
721file {data/avxtime.dat};
723file {data/gputime.dat};
724
725\legend{SSE2,AVX2,GPU,Annot}
726\end{axis}
727\end{tikzpicture}
728\end{center}
729\caption{Running Time}\label{fig:SSE-AVX-GPU}
730\end{figure}
731
732
733
734
735
736
737
738
739\input{conclusion}
740
741
742
743%\appendix
744%\section{Appendix Title}
745
746%This is the text of the appendix, if you need one.
747
748\section{Acknowledgments}
749This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
750MITACS, Inc.
751
752\bibliographystyle{abbrv}
753\bibliography{reference}
754
755\end{document}
756
757
Note: See TracBrowser for help on using the repository browser.