# source:docs/Working/re/pact051-cameron.tex@3870

Last change on this file since 3870 was 3870, checked in by cameron, 5 years ago

Formatting, PACT publication template

File size: 36.5 KB
Line
1\documentclass{sig-alternate-2013}
2
3\newfont{\mycrnotice}{ptmr8t at 7pt}
4\newfont{\myconfname}{ptmri8t at 7pt}
5\let\crnotice\mycrnotice%
6\let\confname\myconfname%
7
8\permission{Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.}
9\conferenceinfo{PACT'14,}{August 24--27, 2014, Edmonton, AB, Canada. \\
12\crdata{978-1-4503-2809-8/14/08\ ...\15.00.\\ 13http://dx.doi.org/10.1145/2628071.2628079} 14 15\clubpenalty=10000 16\widowpenalty = 10000 17\usepackage[normalem]{ulem} 18\usepackage{amssymb} 19\usepackage{amsmath} 20\usepackage{graphicx} 21\usepackage{tikz} 22\usepackage{multirow} 23\usepackage{pgfplots} 24 25\begin{document} 26 27\title{Bitwise Data Parallelism in Regular Expression Matching} 28 29\numberofauthors{7} % in this sample file, there are a *total* 30% of EIGHT authors. SIX appear on the 'first-page' (for formatting 31% reasons) and the remaining two appear in the \additionalauthors section. 32% 33\author{ 34% You can go ahead and credit any number of authors here, 35% e.g. one 'row of three' or two rows (consisting of one row of three 36% and a second row of one, two or three). 37% 38% The command \alignauthor (no curly braces needed) should 39% precede each author name, affiliation/snail-mail address and 40% e-mail address. Additionally, tag each line of 41% affiliation/address with \affaddr, and tag the 42% e-mail address with \email. 43% 44% 1st. author 45\alignauthor 46Rob Cameron\\ 47 \affaddr{School of Computing Science}\\ 48 \affaddr{Simon Fraser University}\\ 49 \affaddr{Surrey, British Columbia}\\ 50 \email{cameron@cs.sfu.ca} 51% 2nd. author 52\alignauthor 53Tom Shermer\\ 54 \affaddr{School of Computing Science}\\ 55 \affaddr{Simon Fraser University}\\ 56 \affaddr{Surrey, British Columbia}\\ 57 \email{shermer@cs.sfu.ca} 58% 3rd. author 59\alignauthor Arrvindh Shriraman\\ 60 \affaddr{School of Computing Science}\\ 61 \affaddr{Simon Fraser University}\\ 62 \affaddr{Surrey, British Columbia}\\ 63 \email{ashriram@cs.sfu.ca} 64\and % use '\and' if you need 'another row' of author names 65% 4th. author 66\alignauthor Ken Herdy\\ 67 \affaddr{School of Computing Science}\\ 68 \affaddr{Simon Fraser University}\\ 69 \affaddr{Surrey, British Columbia}\\ 70 \email{ksherdy@sfu.ca} 71% 5th. author 72\alignauthor Dan Lin\\ 73 \affaddr{School of Computing Science}\\ 74 \affaddr{Simon Fraser University}\\ 75 \affaddr{Surrey, British Columbia}\\ 76 \email{lindanl@sfu.ca} 77% 6th. author 78\alignauthor Ben Hull\\ 79 \affaddr{School of Computing Science}\\ 80 \affaddr{Simon Fraser University}\\ 81 \affaddr{Surrey, British Columbia}\\ 82 \email{bhull@sfu.ca} 83% 7th. author 84\alignauthor Meng Lin\\ 85 \affaddr{School of Computing Science}\\ 86 \affaddr{Simon Fraser University}\\ 87 \affaddr{Surrey, British Columbia}\\ 88 \email{linmengl@sfu.ca} 89} 90% There's nothing stopping you putting the seventh, eighth, etc. 91% author on the opening page (as the 'third row') but we ask, 92% for aesthetic reasons that you place these 'additional authors' 93% in the \additional authors block, viz. 94%\additionalauthors{Additional authors: John Smith (The Th{\o}rv{\"a}ld Group, 95%email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat 96%(The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).} 97%\date{30 July 1999} 98% Just remember to make sure that the TOTAL number of authors 99% is the number that will appear on the first page PLUS the 100% number that will appear in the \additionalauthors section. 101 102 103\date{} 104\maketitle 105 106\thispagestyle{empty} 107 108 109\begin{abstract} 110\input{abstract} 111\end{abstract} 112 113\section{Introduction} 114 115The use of regular expressions to search texts for occurrences 116of string patterns has a long history and 117remains a pervasive technique throughout computing applications today. 118% {\em a brief history} 119The origins of regular expression matching date back to automata theory 120developed by Kleene in the 1950s \cite{kleene1951}. 121Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions 122to nondeterministic finite automata (NFA). 123Following Thompson's approach, a regular expression of lengthm$is converted 124to an NFA with$O(m)$states. It is then possible to search a text of length$n$using the 125NFA in worst case$O(mn)$time. Often, a more efficient choice 126is to convert an NFA into a DFA. A DFA has a single active state at any time 127throughout the matching process and 128hence it is possible to search a text of length$n$in$O(n)$time\footnote{It is well 129known that the conversion of an NFA to an equivalent DFA may result 130in state explosion. That is, the number of resultant DFA states may increase exponentially.}. 131 132A significant proportion of the research in fast regular expression matching can be 133regarded as the quest for efficient automata'' \cite{navarro98fastand}. 134In \cite{baeza1992new}, Baeza-Yates and Gonnet 135presented a new approach to string search based on bit-level parallelism. 136This technique takes advantage of the intrinsic parallelism of bitwise operations 137within a computer word. Thus, 138given a$w$-bit word, the number of operations that a string search algorithms 139performs can be reduced by a factor$w$. 140Building on this observation, the Shift-Or algorithm simulates an NFA using 141bitwise operations and achieves$O(\frac{nm}{w})$worst-case time \cite{navarro2000}. 142A disadvantage of the Shift-Or approach 143is an inability to skip input characters. 144Simple string matching algorithms, 145such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, skip input characters 146to achieve sublinear times in the average case. 147% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 148% based on suffix automata are able to skip characters. 149The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 150combines the advantages of the Shift-Or approach 151with the ability to skip characters. %character skipping property of BDM algorithms. 152The nrgrep tool is based on the BNDM algorithm. It is generally 153considered 154the fastest grep tool 155for matching complex patterns, and achieves similar performance 156to the fastest existing string 157matching tools for simple patterns \cite{navarro2000}. 158 159There has been considerable recent 160interest in accelerating regular expression matching 161on parallel hardware 162such as multicore processors (CPUs), 163general purpose graphics processing units (GPGPUs), 164field-programmable gate arrays (FPGAs), 165and specialized architectures such as 166the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) 167%CPU 168Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that 169text processing algorithms that exhibit irregular memory access patterns 170can be efficiently executed on multicore hardware. 171In related work, Pasetto et al presented a flexible tool that 172performs small-ruleset regular expression matching at a rate of 1732.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}. 174Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) 175string matching algorithm \cite{aho1975} is well-suited for parallel 176implementation on multicore CPUs, GPGPUs and the Cell BE. 177On each system, both thread-level parallelism (cores) and data-level parallelism 178(SIMD units) were leveraged for performance. 179Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions 180in business analytics applications and leveraged the SIMD hardware available 181on multicore processors to achieve a speedup of greater than 1.8 over a 182range of data sizes of interest. 183%Cell 184In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer 185that delivered 1.00--1.78 Gbps on a single 186Cell BE chip and extended this approach for emulation on the Intel Larrabee 187instruction set \cite{scarpazza2009larrabee}. 188On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching 189implementation that delivered a throughput of 40 190Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4 191Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 192presented a string matching implementation for automata that achieved 1934 Gbps on the Cell BE. 194% GPU 195On GPGPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based 196implementation of the AC algorithm for 197accelerating string matching on GPGPUs. Lin et al., proposed 198the Parallel Failureless Aho-Corasick (PFAC) 199algorithm to accelerate pattern matching on GPGPU hardware and 200achieved 143 Gbps raw data throughput, 201although system throughput was limited to 15 Gbps 202\cite{lin2013accelerating}. 203Most recently, Mytkowicz et al have developed a method for combining 204SIMD parallellism and data parallelism 205 206Whereas the existing approaches to parallelization have been 207based on adapting traditional sequential algorithms to emergent 208parallel architectures, we introduce both a new algorithmic 209approach and its implementation on SIMD and GPGPU architectures. 210This approach relies on a bitwise data parallel view of text 211streams as well as a surprising use of addition to match 212runs of characters in a single step. The closest previous 213work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD 214technology together with a parallel scanning primitive also 215based on addition \cite{cameron2011parallel}. 216However, in contrast to the deterministic, longest-match 217scanning associated with the ScanThru primitive of that 218work, we introduce here a new primitive MatchStar 219that can be used in full generality for nondeterministic 220regular expression matching. We also introduce a long-stream 221addition technique involving a further application of MatchStar 222that enables us to scale the technique to$n$-bit addition 223in$\lceil\log_{64}{n}\rceil$steps. We ultimately apply this technique, 224for example, to perform 225synchronized 4096-bit addition on GPGPU wavefronts of 64 threads. 226 227There is also a strong keyword match between the bit-parallel 228data streams used in our approach and the bit-parallelism 229used for NFA state transitions in the classical algorithms of 230Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new} 231and Navarro and Raffinot \cite{navarro1998bit}. 232However those algorithms use bit-parallelism in a fundamentally 233different way: representing all possible current NFA states 234as a bit vector and performing parallel transitions to a new 235set of states using table lookups and bitwise logic. Whereas 236our approach can match multiple characters per step, bit-parallel 237NFA algorithms proceed through the input one byte at a time. 238Nevertheless, the agrep \cite{wu1992agrep} and 239nrgrep \cite{navarro2000} programs implemented using these techniques remain 240among the strongest competitors in regular expression matching 241performance, so we include them in our comparative evaluation. 242 243The remainder of this paper is organized as follows. 244Section \ref{sec:grep} briefly describes regular expression 245notation and the grep problem. 246Section \ref{sec:bitwise} presents our basic algorithm and MatchStar 247primitive using a model of arbitrary-length bit-parallel data streams. 248Section \ref{sec:blockwise} discusses the block-by-block 249implementation of our techniques including the long stream 250addition techniques for 256-bit addition with AVX2 and 2514096-bit additions with GPGPU SIMT. 252Section \ref{sec:SSE2} describes our overall SSE2 implementation 253and carries out a performance study in comparison with 254existing grep implementations. 255Given the dramatic variation in grep performance across 256different implementation techniques, expressions and data sets, 257Section \ref{sec:analysis} considers a comparison between 258the bit-stream and NFA approaches from a theoretical perspective. 259Section \ref{sec:AVX2} then examines and demonstrates 260the scalability of our 261bitwise data-parallel approach in moving from 262128-bit to 256-bit SIMD on Intel Haswell architecture. 263To further investigate scalability, Section \ref{sec:GPU} 264addresses the implementation of our matcher using 265groups of 64 threads working together SIMT-style on a GPGPU system. 266Section \ref{sec:Concl} concludes the paper with a discussion of results and 267areas for future work. 268 269\section{Regular Expression Notation and Grep}\label{sec:grep} 270 271We follow common POSIX notation for regular expressions. 272A regular expression specifies a set of strings through 273a pattern notation. Individual characters normally 274stand for themselves, unless they are one of the 275special characters \verb:*+?[{\(|^$.: that serve as metacharacters
276of the notation system.  Thus the regular expression \verb:cat:
277is a pattern for the set consisting of the single 3-character
278string \verb:cat:''.   The special characters must be escaped
279with a backslash to prevent interpretation as metacharacter, thus
280\verb:\$: represents the dollar-sign and \verb:\\\\: represent 281the string consisting of two backslash characters. 282Character class bracket expressions are pattern elements 283that allow any character in a given class to be used in a particular 284context. For example, \verb:[@#%]: is a regular expression 285that stands for any of the three given symbols. Contiguous 286ranges of characters may be specified using hyphens; 287for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]: 288for any alphanumeric character or underscore. If the 289caret character immediately follows the opening bracket, 290the class is negated, thus \verb:[^0-9]: stands for 291any character except a digit. The period metacharacter 292\verb:.: stands for the class of all characters. 293 294Consecutive pattern elements stand for strings formed by 295concatenation, thus \verb:[cd][ao][tg]: stands for the 296set of 8 three-letter strings \verb:cat:'' through \verb:dog:''. 297The alternation operator \verb:|: allows a pattern to be 298defined to have two alternative forms, thus \verb:cat|dog: 299matches either \verb:cat:'' or \verb:dog:''. Concatenation 300takes precedence over alternation, but parenthesis may be 301used to change this, thus \verb:(ab|cd)[0-9]: stands for any 302digit following one of the two prefixes \verb:ab:'' or \verb:cd:''. 303 304Repetition operators may be appended to a pattern to specify 305a variable number of occurrences of that pattern. 306The Kleene star operator \verb:*: specifies zero-or-more occurrences 307matching the previous pattern, while Kleene plus \verb:+: specifies one-or 308more occurrences. Thus \verb:[a-z][a-z]*: and \verb:[a-z]+: 309both specify the same set: strings of at least one lower-case 310letter. The postfix operator \verb:?: specifies an optional 311component, i.e., zero-or-one occurrence of strings matching 312the element. Specific bounds may be given within braces: 313\verb:(ab){3}: specifies the string \verb:ababab:'', 314\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three 315or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies 316strings of at least 4 consecutive capital letters. 317 318The grep program searches a file for lines containing matches 319to a regular expression using any of the above notations. 320In addition, the pattern elements \verb:^: and \verb:$:
321may be used to match respectively the beginning or the
322end of a line.  In line-based tools such as grep, \verb:.:
323matches any character except newlines; matches cannot extend
324over lines.
325Normally, grep prints all matching
326lines to its output.  However, grep programs typically
327allow a command line flag such as \verb:-c: to specify
328that only a count of matching lines be produced; we use
329this option in our experimental evaluation to focus
330our comparisons on the performance of the underlying matching
331algorithms.
332
333\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
334
335Whereas the traditional approaches to regular expression matching
336using NFAs, DFAs or backtracking all rely on a byte-at-a-time
337processing model, the approach  we introduce in this paper is based
338on quite a different concept:  a data-parallel approach to simultaneous
339processing of data stream elements.  Indeed, our most abstract model
340is that of unbounded data parallelism: processing all elements of
341the input data stream simultaneously.   In essence, data streams are viewed
342as (very large) integers.   The fundamental operations are bitwise
343logic, stream shifting and long-stream addition.
344
345Depending on the available parallel processing resources, an actual
346implementation may divide an input stream into blocks  and process
347the blocks sequentially.   Within each block  all elements of the
348input stream are processed together, relying on the availability of
349bitwise logic and addition scaled to the block size.   On commodity
350Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
351we typically process input streams 128 bytes at a time.
352In this
353case, we rely on the Parabix tool chain \cite{lin2012parabix}
354to handle the details of compilation to block-by-block processing.
355On the
356latest processors supporting the 256-bit AVX2 SIMD operations,
357we also use the Parabix tool chain, but substitute a parallelized
358long-stream addition technique to avoid the sequential chaining
360Our GPGPU implementation uses scripts to modify the output
361of the Parabix tools, effectively dividing the input into blocks
362of 4K bytes.
365SIMT fashion.
366
367\begin{figure}[tbh]
368\begin{center}
369\begin{tabular}{cr}\\
370input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
371$B_7$ & \verb.................................\\
372$B_6$ & \verb1...1..1.1..11..1.....1..11..1...\\
373$B_5$ & \verb111111111111111111111111111111111\\
374$B_4$ & \verb.1111...11...1...111111....1111..\\
375$B_3$ & \verb....111..111.111...1.1111....1.11\\
376$B_2$ & \verb.11..11...11..11....1..11.....111\\
377$B_1$ & \verb...11..111...1....1...1..1.1111..\\
378$B_0$ & \verb1.11.11.1.111.1111.1.1.1111...111\\
379\verb:[a]: & \verb1...........1...1.........1......\\
380\verb:[z9]: & \verb....1....1...1.....1.11......1...\\
381\verb:[0-9]: & \verb.111....1........11111.....11.1..
382\end{tabular}
383
384\end{center}
385\caption{Basis and Character Class Streams}
386\label{fig:streams}
387\end{figure}
388
389A key concept in this streaming approach is the derivation of bit streams
390that are parallel to the input data stream, i.e., in one-to-one
391correspondence with the data element positions of the input
392streams.   Typically, the input stream is a byte stream comprising
393the 8-bit character code units of a particular encoding such
394as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
395easily be used with wider code units such as the 16-bit code units of
396UTF-16.   In the case of a byte stream, the first step is to transpose
397the byte stream into eight parallel bit streams, such that bit stream
398$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
399a set of basis bit streams from which many other parallel bit
400streams can be calculated, such as character class bit
401streams such that each bit $j$ of the stream specifies
402whether character $j$ of the input stream is in the class
403or not.  Figure \ref{fig:streams} shows an example of an
404input byte stream in ASCII, the eight basis bit streams of the
405transposed representation, and the character class bit streams
406\verb:[a]:,
407\verb:[z9]:, and
408\verb:[0-9]:
409that may be computed from the basis bit streams using bitwise logic.
410Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
411Transposition and character class construction are straightforward
412using the Parabix tool chain \cite{lin2012parabix}.
413
414\begin{figure}[tbh]
415\begin{center}
416\begin{tabular}{cr}\\
417input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
418$M_1$ & \verb.1...........1...1.........1.....\\
419$M_2$ & \verb.1111........1...111111....111...\\
420$M_3$ & \verb.....1........1.....1.11......1..
421\end{tabular}
422
423\end{center}
424\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}}
425\label{fig:streams2}
426\end{figure}
427
428\paragraph*{Marker Streams.}  Now consider how bit-parallel data
429streams can be used in regular expression matching.   Consider
430the problem of searching the input stream of Figure \ref{fig:streams}
431to finding occurrence of strings matching
432the regular expression \verb:a[0-9]*[z9]:.
433Note that this is an ambiguous regular expression, which could match
434texts such as \verb:a12949z: in multiple ways.
435The matching process involves the concept of {\em marker streams}, that
436is streams that mark the positions of current matches during the
437overall process.  In this case there are three marker streams computed
438during the match process, namely,
439$M_1$ representing match positions after an initial \verb:a:
440character has been found, $M_2$ representing positions
441reachable from positions marked by $M_1$ by further matching zero or
442more digits (\verb:[0-9]*:) and finally $M_3$ the stream
443marking positions after a final \verb:z: or \verb:9: has been found.
444Without describing the details of how these streams are computed
445for the time being, Figure \ref{fig:streams2} shows what each
446of these streams should be for our example matching problem.
447Our convention that a marker stream contains a 1 bit
448at the next character position to be matched, that is,
449immediately past the last position that was matched.
450Note that all three matches from the third occurrence of \verb:a:
451are correctly marked in $M_3$.
452
453
454\paragraph*{MatchStar.}
455MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
456
457\begin{figure}[tbh]
458\begin{center}
459\begin{tabular}{cr}\\
460input data  & \verba453z--b3z--az--a12949z--ca22z7--\\
461$M_1$ & \verb.1...........1...1.........1.....\\
462$C = \text{\tt [0-9]}$ & \verb.111....1........11111.....11.1..\\
463$T_0 = M_1 \wedge C$ & \verb.1...............1.........1.....\\
464$T_1 = T_0 + C$ & \verb....1...1.............1......11..\\
465$T_2 = T_1 \oplus C$ & \verb.1111............111111....111...\\
466$M_2 = T_2 \vee M_1$ & \verb.1111........1...111111....111...
467\end{tabular}
468
469\end{center}
470\caption{$M_2 = \text{MatchStar}(M_1, C)$}
471\label{fig:matchstar}
472\end{figure}
473
474
475Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
476it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
477conventional presentation of our character data input.   However, this reverses the normal
478order of presentation when considering bitstreams as numeric values.  The key point here is
479that when we perform bitstream addition, we will show bit movement from left-to-right.
480For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
481
482The first row of the figure is the input data,
483the second and third rows are the input bitstreams: the initial marker position bitstream and the
484character class bitstream for digits derived from input data.
485
486In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream.
487The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
488output marker stream is obtained by ORing $T_2$ with the initial marker stream.
489
490In general, given a marker stream $M$ and a character class stream $C$,
491the operation of MatchStar is defined by the following equation.
492$\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$
493Given a set of initial marker positions, the result stream marks
494all possible positions that can be reached by 0 or more occurrences
495of characters in class $C$ from each position in $M$
496
497MatchStar differs from ScanThru of the Parabix tool chain in that it
498finds all matches, not just the longest match.  This is necessary
499for general matching involving possibly ambiguous regular
500expressions.
501
502\input{compilation}
503
504\input{re-Unicode}
505
506\section{Block-at-a-Time Processing}\label{sec:blockwise}
507
508The unbounded stream model of the previous section must of course
509be translated an implementation that proceeds block-at-a-time for
510realistic application.  In this, we primarily rely on the Pablo
511compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
512statements expressed as arbitrary-length bitstream equations, Pablo
513produces block-at-a-time C++ code that initializes and maintains all the necessary
514carry bits for each of the additions and shifts involved in the
515bitstream calculations.
516
517In the present work, our principal contribution to the Parabix tool
518chain is to incorporate the technique of long-stream addition described below.
519Otherwise, we were able to use Pablo directly in compiling our
520SSE2 and AVX2 implementations.   Our GPGPU implementation required
521some scripting to modify the output of the Pablo compiler for our
522purpose.
523
524\paragraph*{Long-Stream Addition.}  The maximum word size for
525addition on commodity processors is typically 64 bits.  In order
526to implement long-stream addition for block sizes of 256 or larger,
527a method for propagating carries through the individual stages of
52864-bit addition is required.  However, the normal technique of
530is far from ideal.
531
532We use the following general model using SIMD methods for constant-time
533long-stream addition up to 4096 bits.   Related GPGPU solutions have been
534independently developed\cite{Crovella2012},
535however our model is intended to be a more broadly applicable abstraction.
536We assume the availability of the following SIMD/SIMT operations
537operating on vectors of $f$ 64-bit fields.
538\begin{itemize}
540in two vectors to produce a result vector of $f$ 64-bit fields.
541\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
542of \verb:x: each with the constant value -1 (all bits 1), producing
543an $f$-bit mask value,
544\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
545field into a single compressed $f$-bit mask value, and
546\item normal bitwise logic operations on $f$-bit masks, and
547\item  \verb#simd<64>::spread(X)#: distributing the bits of
548an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
549\end{itemize}
550
551In this model, the \verb#hsimd<64>::mask(X)# and
553communication requirements between the parallel processing units
554(SIMD lanes or SIMT processors).    In essence, we just need
555the ability to quickly send and receive 1 bit of information
556per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
557gathers 1 bit from each of the processors to a central resource.
558After calculations on the gather bits are performed, we then
559just need an operation to invert the communication, i.e.,
560sending 1 bit each from the central processor to each of
561the parallel units.   There are a variety of ways in which
562these facilities may be implemented depending on the
563underlying architecture; details of our AVX2 and GPGPU implementations
564are presented later.
565
566Given these operations, our method for long stream addition of
567two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
568\begin{enumerate}
569\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
570$\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$
571
572\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
573$\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$
574$\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$
575$\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$
576
577\item Compute an $f$-bit mask of carries generated for each of the
57864-bit additions of \verb:X: and \verb:Y:.
579$\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$
580
581\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
582an incoming carry bit.  This is called the {\em bubble mask}.
583$\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$
584
585\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
586incremented to produce the final sum.  Here we find a new application of
587MatchStar.
588$\text{\tt i} = \text{\tt MatchStar(c*2, b)}$
589
590This is the key step.  The mask {\tt c} of outgoing carries must be
591shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
592with the next digit.  At the incoming position, the carry will
593increment the 64-bit digit.   However, if this digit is all ones (as
594signaled by the corresponding bit of bubble mask {\tt b}, then the addition
595will generate another carry.  In fact, if there is a sequence of
596digits that are all ones, then the carry must bubble through
597each of them.   This is just MatchStar.
598
599\item Compute the final result {\tt Z}.
600$\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$
601
602\end{enumerate}
603\begin{figure}
604\begin{center}
605\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9}
606{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
607{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
608{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
609{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
610{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
611{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
612{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
613{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
614{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
615{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
616{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
617\end{tabular}
618\end{center}
620\end{figure}
621
622Figure \ref{fig:longadd} illustrates the process.  In the figure,
623we illustrate the process with 8-bit fields rather than 64-bit fields
624and show all field values in hexadecimal notation.  Note that
625two of the individual 8-bit additions produce carries, while two
626others produce {\tt FF} values that generate bubble bits.  The
627net result is that four of the original 8-bit sums must be
628incremented to produce the long stream result.
629
630A slight extension to the process produces a long-stream full adder
631that can be used in chained addition.  In this case, the
633{\tt p} and produce a carry-out bit {\tt q}.
634This may be accomplished by incorporating {\tt p}
635in calculating the increment mask in the low bit position,
636and then extracting the carry-out {\tt q} from the high bit position.
637$\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}$
638$\text{\tt q} = \text{\tt i >> f}$
639
640As described subsequently, we use a two-level long-stream addition technique
641in both our AVX2 and GPGPU implementations.  In principle, one can extend
643$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
644A three-level scheme could coordinate
64564 groups each performing 4096-bit long additions in a two-level structure.
646However, whether there are reasonable architectures that can support fine-grained
647SIMT style at this level is an open question.
648
649Using the methods outlined, it is quite conceivable that instruction
651future SIMD and GPGPU processors.   Given the fundamental nature
652of addition as a primitive and its particular application to regular
653expression matching as shown herein, it seems reasonable to expect
654such instructions to become available.    Alternatively, it may
655be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and
657
658
659\input{sse2}
660
661\input{analysis}
662
663\input{avx2}
664
665
666
667\section{GPGPU Implementation}\label{sec:GPU}
668
669To further assess the scalability of our regular expression matching
670using bit-parallel data streams, we implemented a GPGPU version
671in OpenCL.
672We arranged for 64 work groups each having 64 threads.
673The size of work group and number of work groups is chosen
674to provide the best occupancy as calculated by the AMD App Profiler.
675Input files are divided in data parallel fashion among
676the 64 work groups.  Each work group carries out the regular
677expression matching operations 4096 bytes at a time using SIMT
678processing.   Although the GPGPU
681we are able to simulate them using shared memory.
683its own carry and bubble values in shared memory and performs
685parallel-prefix style process.  Others have implemented
686long-stream addition on the GPGPU using similar techniques,
687as noted previously.
688
689We performed our test on an AMD Radeon HD A10-6800K APU machine.
690On the AMD Fusion systems, the input buffer is allocated in
691pinned memory to take advantage of the zero-copy memory regions
692where data can be read directly into this region by the CPU
693and also accessed by the GPGPU for further processing. Therefore,
694the expensive data transferring time that is needed by traditional
695discrete GPGPUs is hidden and we compare only the kernel execution
696time with our SSE2 and AVX implementations as shown in Figure
697\ref{fig:SSE-AVX-GPU}. The GPGPU version gives 30\% to 60\% performance
698improvement over SSE version and 10\% to 40\% performance
699improvement over AVX version. Although we intended to process
70064 work groups with 4096 bytes each at a time rather than 128 bytes
701at a time on SSE or 256 bytes at a time on AVX, the performance
702improvement is less than 60\%. The first reason is hardware
703limitations. Our kernel occupancy is limited by register usage
704and not all the work groups can be scheduled at the same time.
705The second reason is that the long-stream addition implemented
706on GPGPU is more expensive than the implementations on SSE or AVX.
707Another important reason is the control flow. When a possible
708match is found in one thread, the rest of the threads in the
709same work group have to execute the same instructions for
711simple IF test. Therefore, the performance of different
712regular expressions is dependent on the number of
713long-stream addition operations and the total number of matches
714of a given input.   Perhaps surprisingly, the overhead of the Parabix
715transformation was not a dominant factor, coming in at 0.08 ms/MB.
716
717
718\begin{figure}
719\begin{center}
720\begin{tikzpicture}
721\begin{axis}[
722xtick=data,
723ylabel=Running Time (ms per megabyte),
724xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
725tick label style={font=\tiny},
726enlarge x limits=0.15,
727%enlarge y limits={0.15, upper},
728ymin=0,
729legend style={at={(0.5,-0.15)},
730anchor=north,legend columns=-1},
731ybar,
732bar width=7pt,
733]
735file {data/ssetime.dat};
737file {data/avxtime.dat};
739file {data/gputime.dat};
740
741\legend{SSE2,AVX2,GPGPU,Annot}
742\end{axis}
743\end{tikzpicture}
744\end{center}
745\caption{Running Time}\label{fig:SSE-AVX-GPU}
746\end{figure}
747
748
749
750
751
752
753
754
755\input{conclusion}
756
757
758
759%\appendix
760%\section{Appendix Title}
761
762%This is the text of the appendix, if you need one.
763
764
765This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
766MITACS, Inc.
767
768\bibliographystyle{IEEEtranS}
769\bibliography{reference}
770
771\end{document}
772
773
Note: See TracBrowser for help on using the repository browser.