# source:docs/Working/re/pact051-cameron.tex@3889

Last change on this file since 3889 was 3889, checked in by cameron, 5 years ago

Various cleanups, drop instructions/byte, branch miss figures

File size: 35.4 KB
RevLine
[3870]1\documentclass{sig-alternate-2013}
[3617]2
[3870]3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
5\let\crnotice\mycrnotice%
6\let\confname\myconfname%
[3459]7
[3870]8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
12\crdata{978-1-4503-2809-8/14/08\ ...\15.00.\\ 13http://dx.doi.org/10.1145/2628071.2628079} 14 15\clubpenalty=10000 16\widowpenalty = 10000 [3624]17\usepackage[normalem]{ulem} [3617]18\usepackage{amssymb} [3459]19\usepackage{amsmath} [3617]20\usepackage{graphicx} 21\usepackage{tikz} [3870]22\usepackage{multirow} [3482]23\usepackage{pgfplots} [3459]24 25\begin{document} 26 [3617]27\title{Bitwise Data Parallelism in Regular Expression Matching} [3459]28 [3885]29\numberofauthors{1} % in this sample file, there are a *total* [3870]30% of EIGHT authors. SIX appear on the 'first-page' (for formatting 31% reasons) and the remaining two appear in the \additionalauthors section. 32% 33\author{ 34% You can go ahead and credit any number of authors here, 35% e.g. one 'row of three' or two rows (consisting of one row of three 36% and a second row of one, two or three). 37% 38% The command \alignauthor (no curly braces needed) should 39% precede each author name, affiliation/snail-mail address and 40% e-mail address. Additionally, tag each line of 41% affiliation/address with \affaddr, and tag the 42% e-mail address with \email. 43% 44% 1st. author [3880]45%\alignauthor 46Rob Cameron, Tom Shermer, Arrvindh Shriraman, Ken Herdy, Dan Lin, Ben Hull, Meng Lin\\ [3870]47 \affaddr{School of Computing Science}\\ 48 \affaddr{Simon Fraser University}\\ 49 \affaddr{Surrey, British Columbia}\\ [3880]50 \email{\{cameron,shermer,ashrira,ksherdy,lindanl,bhull,linmengl\}@sfu.ca} [3870]51} [3459]52 [3870]53 [3624]54\date{} [3459]55\maketitle 56 [3624]57\thispagestyle{empty} [3617]58 59 [3459]60\begin{abstract} 61\input{abstract} 62\end{abstract} 63 64\section{Introduction} 65 [3881]66The use of regular expressions to search texts 67for patterns has a long history and 68remains an important technique. [3468]69% {\em a brief history} [3881]70%The origins of regular expression matching date back to automata theory 71%studied by Kleene in the 1950s \cite{kleene1951}. 72Thompson \cite{thompson1968} is credited with the 73first construction to convert regular expressions [3468]74to nondeterministic finite automata (NFA). [3881]75Following Thompson's approach, 76a regular expression of lengthm$is converted 77to an NFA with$O(m)$states. Using the 78NFA it is possible to search a text 79of length$n$in$O(mn)$time. 80Frequently, a more efficient choice 81is to convert an NFA into a deterministic finite automata (DFA). 82A DFA maintains a single active state [3522]83throughout the matching process and [3881]84hence, using a DFA it is possible to 85search a text of length$n$in$O(n)$time\footnote{It is well [3645]86known that the conversion of an NFA to an equivalent DFA may result [3881]87in \textit{state explosion}, i.e., the number of resultant DFA states may increase exponentially.}. [3459]88 [3881]89A significant proportion of the research 90in fast regular expression matching can be [3522]91regarded as the quest for efficient automata'' \cite{navarro98fastand}. 92In \cite{baeza1992new}, Baeza-Yates and Gonnet [3645]93presented a new approach to string search based on bit-level parallelism. [3522]94This technique takes advantage of the intrinsic parallelism of bitwise operations [3665]95within a computer word. Thus, 96given a$w$-bit word, the number of operations that a string search algorithms [3522]97performs can be reduced by a factor$w$. [3665]98Building on this observation, the Shift-Or algorithm simulates an NFA using [3881]99bitwise operations and achieves$O(\frac{nm}{w})$time in the worst-case \cite{navarro2000}. [3645]100A disadvantage of the Shift-Or approach [3522]101is an inability to skip input characters. 102Simple string matching algorithms, [3881]103such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, 104skip input characters [3522]105to achieve sublinear times in the average case. [3645]106% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 107% based on suffix automata are able to skip characters. [3468]108The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} [3665]109combines the advantages of the Shift-Or approach [3645]110with the ability to skip characters. %character skipping property of BDM algorithms. [3862]111The nrgrep tool is based on the BNDM algorithm. It is generally 112considered 113the fastest grep tool 114for matching complex patterns, and achieves similar performance [3665]115to the fastest existing string [3522]116matching tools for simple patterns \cite{navarro2000}. [3468]117 [3881]118Recently, there has been considerable 119interest in the use of 120parallel hardware such as multicore processors (CPUs), [3883]121 graphics processing units (GPUs), [3645]122field-programmable gate arrays (FPGAs), [3881]123or specialized architectures such as 124the Cell Broadband Engine (Cell BE) to 125accelerate regular expression matching. % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) 126Generally, speedups are achieved by 127using parallel hardware features to improve the throughput of 128multiple instances of a matching problem at a time, 129i.e., by matching against sets of patterns or multiple input streams. 130In contrast, our approach uses parallelism to accelerate the throughput of 131a single problem instance, i.e., a single regular expression 132matched against a single input stream. 133 134%\paragraph{Multicore} 135In related work targeting multicore hardware, [3471]136Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that 137text processing algorithms that exhibit irregular memory access patterns [3881]138can be efficiently executed. 139Pasetto et al \cite{pasetto2010} presented a flexible tool that [3468]140performs small-ruleset regular expression matching at a rate of [3881]1412.88 Gbps per chip on Intel Xeon E5472 hardware. [3652]142Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) [3665]143string matching algorithm \cite{aho1975} is well-suited for parallel [3883]144implementation on multicore CPUs, GPUs and the Cell BE. [3881]145% On each system, both thread-level parallelism (cores) and data-level parallelism 146% (SIMD units) were leveraged for performance. 147Salapura et al \cite{salapura2012accelerating} advocated 148the use of vector-style processing for regular expressions [3471]149in business analytics applications and leveraged the SIMD hardware available [3667]150on multicore processors to achieve a speedup of greater than 1.8 over a [3881]151range of data sizes. 152%\paragraph{Cell Broadband Engine} [3883]153On the Cell Broadband Engine, Scarpazza and Russell 154%presented a SIMD tokenizer 155%that delivered 1.00--1.78 Gbps on a single 156%chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee 157%instruction set \cite{scarpazza2009larrabee}. 158%Furthermore, 159\cite{scarpazza2009cell} 160described a pattern matching [3471]161implementation that delivered a throughput of 40 [3522]162Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4 [3665]163Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} [3522]164presented a string matching implementation for automata that achieved [3471]1654 Gbps on the Cell BE. [3883]166On GPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based [3471]167implementation of the AC algorithm for [3883]168accelerating string matching on GPUs. Lin et al., proposed [3471]169the Parallel Failureless Aho-Corasick (PFAC) [3883]170algorithm to accelerate pattern matching on GPU hardware and [3652]171achieved 143 Gbps raw data throughput, [3862]172although system throughput was limited to 15 Gbps 173\cite{lin2013accelerating}. [3881]174Most recently, Mytkowicz et al \cite{DBLP:conf/asplos/MytkowiczMS14} 175have developed a method for combining 176SIMD parallelism and data parallelism on multicore hardware. 177Of each of these related works, this approach stands 178out since it also focuses on the acceleration 179of matching against a single input stream. [3468]180 [3464]181Whereas the existing approaches to parallelization have been 182based on adapting traditional sequential algorithms to emergent 183parallel architectures, we introduce both a new algorithmic [3883]184approach and its implementation on SIMD and GPU architectures. [3464]185This approach relies on a bitwise data parallel view of text [3466]186streams as well as a surprising use of addition to match [3486]187runs of characters in a single step. The closest previous [3466]188work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD [3464]189technology together with a parallel scanning primitive also 190based on addition \cite{cameron2011parallel}. 191However, in contrast to the deterministic, longest-match 192scanning associated with the ScanThru primitive of that 193work, we introduce here a new primitive MatchStar 194that can be used in full generality for nondeterministic 195regular expression matching. We also introduce a long-stream 196addition technique involving a further application of MatchStar 197that enables us to scale the technique to$n$-bit addition [3486]198in$\lceil\log_{64}{n}\rceil$steps. We ultimately apply this technique, [3466]199for example, to perform [3883]200synchronized 4096-bit addition on GPU wavefronts of 64 threads. [3459]201 [3464]202There is also a strong keyword match between the bit-parallel [3466]203data streams used in our approach and the bit-parallelism 204used for NFA state transitions in the classical algorithms of 205Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new} 206and Navarro and Raffinot \cite{navarro1998bit}. [3464]207However those algorithms use bit-parallelism in a fundamentally [3466]208different way: representing all possible current NFA states 209as a bit vector and performing parallel transitions to a new 210set of states using table lookups and bitwise logic. Whereas 211our approach can match multiple characters per step, bit-parallel 212NFA algorithms proceed through the input one byte at a time. 213Nevertheless, the agrep \cite{wu1992agrep} and 214nrgrep \cite{navarro2000} programs implemented using these techniques remain 215among the strongest competitors in regular expression matching 216performance, so we include them in our comparative evaluation. [3459]217 218The remainder of this paper is organized as follows. [3486]219Section \ref{sec:grep} briefly describes regular expression 220notation and the grep problem. [3469]221Section \ref{sec:bitwise} presents our basic algorithm and MatchStar [3645]222primitive using a model of arbitrary-length bit-parallel data streams. [3469]223Section \ref{sec:blockwise} discusses the block-by-block 224implementation of our techniques including the long stream 225addition techniques for 256-bit addition with AVX2 and [3883]2264096-bit additions with GPU SIMT. [3502]227Section \ref{sec:SSE2} describes our overall SSE2 implementation 228and carries out a performance study in comparison with 229existing grep implementations. 230Given the dramatic variation in grep performance across 231different implementation techniques, expressions and data sets, 232Section \ref{sec:analysis} considers a comparison between 233the bit-stream and NFA approaches from a theoretical perspective. 234Section \ref{sec:AVX2} then examines and demonstrates 235the scalability of our 236bitwise data-parallel approach in moving from 237128-bit to 256-bit SIMD on Intel Haswell architecture. 238To further investigate scalability, Section \ref{sec:GPU} 239addresses the implementation of our matcher using [3883]240groups of 64 threads working together SIMT-style on a GPU system. [3502]241Section \ref{sec:Concl} concludes the paper with a discussion of results and 242areas for future work. [3459]243 [3889]244\section{Regular Expressions and Grep}\label{sec:grep} [3459]245 [3523]246We follow common POSIX notation for regular expressions. [3486]247A regular expression specifies a set of strings through 248a pattern notation. Individual characters normally 249stand for themselves, unless they are one of the 250special characters \verb:*+?[{\(|^$.: that serve as metacharacters
251of the notation system.  Thus the regular expression \verb:cat:
252is a pattern for the set consisting of the single 3-character
253string \verb:cat:''.   The special characters must be escaped
254with a backslash to prevent interpretation as metacharacter, thus
255\verb:\$: represents the dollar-sign and \verb:\\\\: represent 256the string consisting of two backslash characters. 257Character class bracket expressions are pattern elements 258that allow any character in a given class to be used in a particular 259context. For example, \verb:[@#%]: is a regular expression 260that stands for any of the three given symbols. Contiguous 261ranges of characters may be specified using hyphens; 262for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]: 263for any alphanumeric character or underscore. If the 264caret character immediately follows the opening bracket, 265the class is negated, thus \verb:[^0-9]: stands for 266any character except a digit. The period metacharacter 267\verb:.: stands for the class of all characters. 268 269Consecutive pattern elements stand for strings formed by 270concatenation, thus \verb:[cd][ao][tg]: stands for the 271set of 8 three-letter strings \verb:cat:'' through \verb:dog:''. 272The alternation operator \verb:|: allows a pattern to be [3665]273defined to have two alternative forms, thus \verb:cat|dog: [3486]274matches either \verb:cat:'' or \verb:dog:''. Concatenation 275takes precedence over alternation, but parenthesis may be 276used to change this, thus \verb:(ab|cd)[0-9]: stands for any 277digit following one of the two prefixes \verb:ab:'' or \verb:cd:''. 278 279Repetition operators may be appended to a pattern to specify 280a variable number of occurrences of that pattern. [3665]281The Kleene star operator \verb:*: specifies zero-or-more occurrences [3877]282matching the previous pattern, while Kleene plus \verb:+: specifies 283one-or-more occurrences. Thus \verb:[a-z][a-z]*: and \verb:[a-z]+: [3486]284both specify the same set: strings of at least one lower-case 285letter. The postfix operator \verb:?: specifies an optional 286component, i.e., zero-or-one occurrence of strings matching 287the element. Specific bounds may be given within braces: 288\verb:(ab){3}: specifies the string \verb:ababab:'', 289\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three 290or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies 291strings of at least 4 consecutive capital letters. 292 293The grep program searches a file for lines containing matches 294to a regular expression using any of the above notations. 295In addition, the pattern elements \verb:^: and \verb:$:
296may be used to match respectively the beginning or the
297end of a line.  In line-based tools such as grep, \verb:.:
298matches any character except newlines; matches cannot extend
299over lines.
300Normally, grep prints all matching
301lines to its output.  However, grep programs typically
302allow a command line flag such as \verb:-c: to specify
303that only a count of matching lines be produced; we use
304this option in our experimental evaluation to focus
305our comparisons on the performance of the underlying matching
306algorithms.
307
[3883]308\section{Bit-Parallel Data Streams}\label{sec:bitwise}
[3459]309
[3469]310Whereas the traditional approaches to regular expression matching
311using NFAs, DFAs or backtracking all rely on a byte-at-a-time
312processing model, the approach  we introduce in this paper is based
313on quite a different concept:  a data-parallel approach to simultaneous
314processing of data stream elements.  Indeed, our most abstract model
315is that of unbounded data parallelism: processing all elements of
[3490]316the input data stream simultaneously.   In essence, data streams are viewed
317as (very large) integers.   The fundamental operations are bitwise
318logic, stream shifting and long-stream addition.
[3463]319
[3469]320Depending on the available parallel processing resources, an actual
321implementation may divide an input stream into blocks  and process
322the blocks sequentially.   Within each block  all elements of the
[3665]323input stream are processed together, relying on the availability of
[3469]324bitwise logic and addition scaled to the block size.   On commodity
325Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
[3623]326we typically process input streams 128 bytes at a time.
327In this
[3469]328case, we rely on the Parabix tool chain \cite{lin2012parabix}
329to handle the details of compilation to block-by-block processing.
[3623]330On the
331latest processors supporting the 256-bit AVX2 SIMD operations,
332we also use the Parabix tool chain, but substitute a parallelized
333long-stream addition technique to avoid the sequential chaining
[3883]335Our GPU implementation uses scripts to modify the output
[3623]336of the Parabix tools, effectively dividing the input into blocks
337of 4K bytes.
[3649]340SIMT fashion.
[3623]341
[3491]342\begin{figure}[tbh]
343\begin{center}
344\begin{tabular}{cr}\\
[3636]345input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
[3491]346$B_7$ & \verb.................................\\
[3636]347$B_6$ & \verb1...1..1.1..11..1.....1..11..1...\\
[3491]348$B_5$ & \verb111111111111111111111111111111111\\
[3636]349$B_4$ & \verb.1111...11...1...111111....1111..\\
350$B_3$ & \verb....111..111.111...1.1111....1.11\\
351$B_2$ & \verb.11..11...11..11....1..11.....111\\
352$B_1$ & \verb...11..111...1....1...1..1.1111..\\
353$B_0$ & \verb1.11.11.1.111.1111.1.1.1111...111\\
354\verb:[a]: & \verb1...........1...1.........1......\\
355\verb:[z9]: & \verb....1....1...1.....1.11......1...\\
356\verb:[0-9]: & \verb.111....1........11111.....11.1..
[3491]357\end{tabular}
358
359\end{center}
360\caption{Basis and Character Class Streams}
361\label{fig:streams}
362\end{figure}
363
[3469]364A key concept in this streaming approach is the derivation of bit streams
[3463]365that are parallel to the input data stream, i.e., in one-to-one
366correspondence with the data element positions of the input
367streams.   Typically, the input stream is a byte stream comprising
368the 8-bit character code units of a particular encoding such
369as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
370easily be used with wider code units such as the 16-bit code units of
371UTF-16.   In the case of a byte stream, the first step is to transpose
372the byte stream into eight parallel bit streams, such that bit stream
[3466]373$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
[3463]374a set of basis bit streams from which many other parallel bit
375streams can be calculated, such as character class bit
376streams such that each bit $j$ of the stream specifies
377whether character $j$ of the input stream is in the class
[3472]378or not.  Figure \ref{fig:streams} shows an example of an
379input byte stream in ASCII, the eight basis bit streams of the
[3490]380transposed representation, and the character class bit streams
381\verb:[a]:,
[3636]382\verb:[z9]:, and
[3490]383\verb:[0-9]:
[3472]384that may be computed from the basis bit streams using bitwise logic.
[3491]385Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
[3472]386Transposition and character class construction are straightforward
387using the Parabix tool chain \cite{lin2012parabix}.
[3463]388
[3490]389\begin{figure}[tbh]
390\begin{center}
391\begin{tabular}{cr}\\
[3636]392input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
393$M_1$ & \verb.1...........1...1.........1.....\\
394$M_2$ & \verb.1111........1...111111....111...\\
395$M_3$ & \verb.....1........1.....1.11......1..
[3490]396\end{tabular}
397
398\end{center}
[3636]399\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
[3491]400\label{fig:streams2}
[3490]401\end{figure}
402
[3876]403\paragraph*{Marker Streams}  Now consider how bit-parallel data
[3472]404streams can be used in regular expression matching.   Consider
405the problem of searching the input stream of Figure \ref{fig:streams}
406to finding occurrence of strings matching
[3636]407the regular expression \verb:a[0-9]*[z9]:.
408Note that this is an ambiguous regular expression, which could match
409texts such as \verb:a12949z: in multiple ways.
[3472]410The matching process involves the concept of {\em marker streams}, that
411is streams that mark the positions of current matches during the
412overall process.  In this case there are three marker streams computed
413during the match process, namely,
414$M_1$ representing match positions after an initial \verb:a:
415character has been found, $M_2$ representing positions
416reachable from positions marked by $M_1$ by further matching zero or
417more digits (\verb:[0-9]*:) and finally $M_3$ the stream
[3636]418marking positions after a final \verb:z: or \verb:9: has been found.
[3472]419Without describing the details of how these streams are computed
[3491]420for the time being, Figure \ref{fig:streams2} shows what each
[3472]421of these streams should be for our example matching problem.
[3636]422Our convention that a marker stream contains a 1 bit
[3472]423at the next character position to be matched, that is,
424immediately past the last position that was matched.
[3636]425Note that all three matches from the third occurrence of \verb:a:
426are correctly marked in $M_3$.
[3463]427
[3459]428
[3880]429\paragraph*{MatchStar}  MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
[3459]430
[3491]431\begin{figure}[tbh]
[3472]432\begin{center}
433\begin{tabular}{cr}\\
[3636]434input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
435$M_1$ & \verb.1...........1...1.........1.....\\
[3870]436$C = \text{\tt [0-9]}$ & \verb.111....1........11111.....11.1..\\
437$T_0 = M_1 \wedge C$ & \verb.1...............1.........1.....\\
438$T_1 = T_0 + C$ & \verb....1...1.............1......11..\\
439$T_2 = T_1 \oplus C$ & \verb.1111............111111....111...\\
440$M_2 = T_2 \vee M_1$ & \verb.1111........1...111111....111...
[3472]441\end{tabular}
442
443\end{center}
[3870]444\caption{$M_2 = \text{MatchStar}(M_1, C)$}
[3491]445\label{fig:matchstar}
446\end{figure}
[3472]447
448
[3493]449Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
450it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
[3491]451conventional presentation of our character data input.   However, this reverses the normal
452order of presentation when considering bitstreams as numeric values.  The key point here is
453that when we perform bitstream addition, we will show bit movement from left-to-right.
[3493]454For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
[3491]455
[3493]456The first row of the figure is the input data,
457the second and third rows are the input bitstreams: the initial marker position bitstream and the
458character class bitstream for digits derived from input data.
459
[3491]460In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.
461The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
462output marker stream is obtained by ORing $T_2$ with the initial marker stream.
463
[3472]464In general, given a marker stream $M$ and a character class stream $C$,
465the operation of MatchStar is defined by the following equation.
[3870]466$\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$
[3472]467Given a set of initial marker positions, the result stream marks
468all possible positions that can be reached by 0 or more occurrences
469of characters in class $C$ from each position in $M$
470
[3636]471MatchStar differs from ScanThru of the Parabix tool chain in that it
472finds all matches, not just the longest match.  This is necessary
473for general matching involving possibly ambiguous regular
474expressions.
475
[3502]476\input{compilation}
[3472]477
[3511]478\input{re-Unicode}
479
[3472]480\section{Block-at-a-Time Processing}\label{sec:blockwise}
481
[3475]482The unbounded stream model of the previous section must of course
483be translated an implementation that proceeds block-at-a-time for
484realistic application.  In this, we primarily rely on the Pablo
485compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
486statements expressed as arbitrary-length bitstream equations, Pablo
487produces block-at-a-time C++ code that initializes and maintains all the necessary
488carry bits for each of the additions and shifts involved in the
489bitstream calculations.
[3472]490
[3623]491In the present work, our principal contribution to the Parabix tool
492chain is to incorporate the technique of long-stream addition described below.
[3475]493Otherwise, we were able to use Pablo directly in compiling our
[3883]494SSE2 and AVX2 implementations.   Our GPU implementation required
[3475]495some scripting to modify the output of the Pablo compiler for our
496purpose.
[3472]497
[3876]498\paragraph*{Long-Stream Addition}  The maximum word size for
[3472]499addition on commodity processors is typically 64 bits.  In order
500to implement long-stream addition for block sizes of 256 or larger,
501a method for propagating carries through the individual stages of
50264-bit addition is required.  However, the normal technique of
504is far from ideal.
505
[3642]506We use the following general model using SIMD methods for constant-time
[3883]507long-stream addition up to 4096 bits.   Related GPU solutions have been
[3649]508independently developed\cite{Crovella2012},
509however our model is intended to be a more broadly applicable abstraction.
[3472]510We assume the availability of the following SIMD/SIMT operations
511operating on vectors of $f$ 64-bit fields.
512\begin{itemize}
514in two vectors to produce a result vector of $f$ 64-bit fields.
[3665]515\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
[3472]516of \verb:x: each with the constant value -1 (all bits 1), producing
[3475]517an $f$-bit mask value,
[3665]518\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
[3472]519field into a single compressed $f$-bit mask value, and
[3490]520\item normal bitwise logic operations on $f$-bit masks, and
[3665]521\item  \verb#simd<64>::spread(X)#: distributing the bits of
[3490]522an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
[3472]523\end{itemize}
524
[3625]527communication requirements between the parallel processing units
528(SIMD lanes or SIMT processors).    In essence, we just need
529the ability to quickly send and receive 1 bit of information
530per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
531gathers 1 bit from each of the processors to a central resource.
532After calculations on the gather bits are performed, we then
533just need an operation to invert the communication, i.e.,
534sending 1 bit each from the central processor to each of
535the parallel units.   There are a variety of ways in which
536these facilities may be implemented depending on the
[3883]537underlying architecture; details of our AVX2 and GPU implementations
[3625]538are presented later.
539
[3472]540Given these operations, our method for long stream addition of
541two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
542\begin{enumerate}
543\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
544$\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$
545
546\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
547$\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$
548$\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$
549$\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$
550
551\item Compute an $f$-bit mask of carries generated for each of the
55264-bit additions of \verb:X: and \verb:Y:.
553$\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$
554
555\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
[3665]556an incoming carry bit.  This is called the {\em bubble mask}.
[3472]557$\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$
558
559\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
[3475]560incremented to produce the final sum.  Here we find a new application of
[3625]561MatchStar.
[3472]562$\text{\tt i} = \text{\tt MatchStar(c*2, b)}$
563
[3475]564This is the key step.  The mask {\tt c} of outgoing carries must be
565shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
566with the next digit.  At the incoming position, the carry will
567increment the 64-bit digit.   However, if this digit is all ones (as
[3665]568signaled by the corresponding bit of bubble mask {\tt b}, then the addition
[3475]569will generate another carry.  In fact, if there is a sequence of
570digits that are all ones, then the carry must bubble through
[3625]571each of them.   This is just MatchStar.
[3475]572
573\item Compute the final result {\tt Z}.
[3472]574$\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$
575
576\end{enumerate}
[3475]577\begin{figure}
578\begin{center}
579\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
580{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
581{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
582{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
583{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
584{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
585{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
586{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
587{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
588{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
589{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
590{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
591\end{tabular}
592\end{center}
594\end{figure}
[3472]595
[3475]596Figure \ref{fig:longadd} illustrates the process.  In the figure,
597we illustrate the process with 8-bit fields rather than 64-bit fields
598and show all field values in hexadecimal notation.  Note that
599two of the individual 8-bit additions produce carries, while two
600others produce {\tt FF} values that generate bubble bits.  The
601net result is that four of the original 8-bit sums must be
602incremented to produce the long stream result.
[3472]603
[3475]604A slight extension to the process produces a long-stream full adder
605that can be used in chained addition.  In this case, the
607{\tt p} and produce a carry-out bit {\tt q}.
608This may be accomplished by incorporating {\tt p}
609in calculating the increment mask in the low bit position,
610and then extracting the carry-out {\tt q} from the high bit position.
611$\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}$
[3882]612$\text{\tt q} = \text{\tt i >}\text{\tt > f}$
[3477]613
614As described subsequently, we use a two-level long-stream addition technique
[3883]615in both our AVX2 and GPU implementations.  In principle, one can extend
[3486]617$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
[3477]618A three-level scheme could coordinate
61964 groups each performing 4096-bit long additions in a two-level structure.
620However, whether there are reasonable architectures that can support fine-grained
[3665]621SIMT style at this level is an open question.
[3477]622
623Using the methods outlined, it is quite conceivable that instruction
[3883]625future SIMD and GPU processors.   Given the fundamental nature
[3623]626of addition as a primitive and its particular application to regular
[3477]627expression matching as shown herein, it seems reasonable to expect
[3625]628such instructions to become available.    Alternatively, it may
[3889]629be worthwhile to simply ensure that the \verb#hmask# and
[3459]631
632
[3502]633\input{sse2}
[3459]634
[3502]635\input{analysis}
[3459]636
[3501]637\input{avx2}
[3486]638
639
[3482]640
[3883]641\section{GPU Implementation}\label{sec:GPU}
[3493]642
643To further assess the scalability of our regular expression matching
[3883]644using bit-parallel data streams, we implemented a GPU version
[3501]645in OpenCL.
[3633]646We arranged for 64 work groups each having 64 threads.
[3661]647The size of work group and number of work groups is chosen
[3665]648to provide the best occupancy as calculated by the AMD App Profiler.
[3633]649Input files are divided in data parallel fashion among
[3493]650the 64 work groups.  Each work group carries out the regular
651expression matching operations 4096 bytes at a time using SIMT
[3883]652processing.   Although the GPU
655we are able to simulate them using shared memory.
657its own carry and bubble values in shared memory and performs
[3637]659parallel-prefix style process.  Others have implemented
[3883]660long-stream addition on the GPU using similar techniques,
[3642]661as noted previously.
[3493]662
[3633]663We performed our test on an AMD Radeon HD A10-6800K APU machine.
664On the AMD Fusion systems, the input buffer is allocated in
665pinned memory to take advantage of the zero-copy memory regions
[3665]666where data can be read directly into this region by the CPU
[3883]667and also accessed by the GPU for further processing. Therefore,
[3665]668the expensive data transferring time that is needed by traditional
[3883]669discrete GPUs is hidden and we compare only the kernel execution
[3637]670time with our SSE2 and AVX implementations as shown in Figure
[3883]671\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
[3878]672improvement over SSE version and up to 40\% performance
[3889]673improvement over AVX version.   However, because of
674implementation complexities of the triply-nested while loop for
675the StarHeight expression, it has been omitted.
676
677Although we intended to process
[3634]67864 work groups with 4096 bytes each at a time rather than 128 bytes
679at a time on SSE or 256 bytes at a time on AVX, the performance
[3661]680improvement is less than 60\%. The first reason is hardware
[3634]681limitations. Our kernel occupancy is limited by register usage
682and not all the work groups can be scheduled at the same time.
683The second reason is that the long-stream addition implemented
[3883]684on GPU is more expensive than the implementations on SSE or AVX.
[3634]685Another important reason is the control flow. When a possible
686match is found in one thread, the rest of the threads in the
687same work group have to execute the same instructions for
689simple IF test. Therefore, the performance of different
[3661]690regular expressions is dependent on the number of
[3637]691long-stream addition operations and the total number of matches
[3862]692of a given input.   Perhaps surprisingly, the overhead of the Parabix
693transformation was not a dominant factor, coming in at 0.08 ms/MB.
[3633]694
[3862]695
[3485]696\begin{figure}
697\begin{center}
698\begin{tikzpicture}
699\begin{axis}[
700xtick=data,
[3505]701ylabel=Running Time (ms per megabyte),
[3889]702xticklabels={@,Date,Email,URI,Hex,StarHeight},
[3485]703tick label style={font=\tiny},
[3505]704enlarge x limits=0.15,
[3617]705%enlarge y limits={0.15, upper},
[3505]706ymin=0,
[3485]707legend style={at={(0.5,-0.15)},
708anchor=north,legend columns=-1},
709ybar,
710bar width=7pt,
711]
[3485]713file {data/ssetime.dat};
[3485]715file {data/avxtime.dat};
[3485]717file {data/gputime.dat};
[3459]718
[3883]719\legend{SSE2,AVX2,GPU,Annot}
[3485]720\end{axis}
721\end{tikzpicture}
722\end{center}
[3493]723\caption{Running Time}\label{fig:SSE-AVX-GPU}
[3485]724\end{figure}
725
726
727
728
[3505]729
730
731
732
[3486]733\input{conclusion}
[3459]734
735
736
737%\appendix
738%\section{Appendix Title}
739
740%This is the text of the appendix, if you need one.
741
742
743This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
744MITACS, Inc.
745
[3624]746\bibliographystyle{IEEEtranS}
[3459]747\bibliography{reference}
748
749\end{document}
750
751
Note: See TracBrowser for help on using the repository browser.