# source:docs/Working/re/re-main.tex@3658

Last change on this file since 3658 was 3653, checked in by lindanl, 5 years ago

GPU chart

File size: 32.7 KB
Line
1\documentclass[pageno]{jpaper}
2
3%replace XXX with the submission number you are given from the PACT submission site.
4\newcommand{\pactsubmissionnumber}{XXX}
5
6\usepackage[normalem]{ulem}
7\usepackage{amssymb}
8\usepackage{amsmath}
9\usepackage{graphicx}
10\usepackage{tikz}
11\usepackage{pgfplots}
12
13\begin{document}
14
15\title{Bitwise Data Parallelism in Regular Expression Matching}
16
17
18\date{}
19\maketitle
20
21\thispagestyle{empty}
22
23
24\begin{abstract}
25\input{abstract}
26\end{abstract}
27
28\section{Introduction}
29
30The use of regular expressions to search texts for occurrences
31of string patterns has a long history and
32remains a pervasive technique throughout computing applications today.
33% {\em a brief history}
34The origins of regular expression matching date back to automata theory
35developed by Kleene in the 1950s \cite{kleene1951}.
36Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
37to nondeterministic finite automata (NFA).
38Following Thompson's approach, a regular expression of length $m$ is converted
39to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
40NFA in worst case $O(mn)$ time. Often, a more efficient choice
41is to convert an NFA into a DFA. A DFA has a single active state at any time
42throughout the matching process and
43hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well
44known that the conversion of an NFA to an equivalent DFA may result
45in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
46
47A significant proportion of the research in fast regular expression matching can be
48regarded as the quest for efficient automata'' \cite{navarro98fastand}.
49In \cite{baeza1992new}, Baeza-Yates and Gonnet
50presented a new approach to string search based on bit-level parallelism.
51This technique takes advantage of the intrinsic parallelism of bitwise operations
52within a computer word.
53Given a $w$-bit word, the number of operations that a string search algorithms
54performs can be reduced by a factor $w$.
55Using this fact, the Shift-Or algorithm simulates an NFA using
56bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}.
57A disadvantage of the Shift-Or approach
58is an inability to skip input characters.
59Simple string matching algorithms,
60such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
61to achieve sublinear times in the average case.
62% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
63% based on suffix automata are able to skip characters.
64The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast}
65combines the bit-parallel advantages of the Shift-Or approach
66with the ability to skip characters. %character skipping property of BDM algorithms.
67The nrgrep pattern matching tool,
68is based on the BNDM algorithm. Prior to the bitwise
69data parallel approach presented herein, nrgrep
70was by far the fastest grep tool
71for matching complex patterns, and achieved similar performance
72to that of the fastest existing string
73matching tools for simple patterns \cite{navarro2000}.
74
75There has been considerable recent
76interest in accelerating regular expression matching
77on parallel hardware
78such as multicore processors (CPUs),
79general purpose graphics processing units (GPGPUs),
80field-programmable gate arrays (FPGAs),
81and specialized architectures such as
82the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
83%CPU
84Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
85text processing algorithms that exhibit irregular memory access patterns
86can be efficiently executed on multicore hardware.
87In related work, Pasetto et al presented a flexible tool that
88performs small-ruleset regular expression matching at a rate of
892.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
90Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
91string matching algorithm \cite{aho1975} is well suited for parallel
92implementation on multicore CPUs, GPGPUs and the Cell BE.
93On each hardware, both thread-level parallelism (cores) and data-level parallelism
94(SIMD units) were leveraged for performance.
95Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
96in business analytics applications and leveraged the SIMD hardware available
97on multi-core processors to acheive a speedup of greater than 1.8 over a
98range of data sizes of interest.
99%Cell
100In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
101that delivered 1.00--1.78 Gbps on a single
102Cell BE chip and extended this approach for emulation on the Intel Larrabee
103instruction set \cite{scarpazza2009larrabee}.
104On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
105implementation that delivered a throughput of 40
106Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
107Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008}
108presented a string matching implementation for automata that achieved
1094 Gbps on the Cell BE.
110% GPU
111In more recent work, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
112implementation of the AC algorithm for
113accelerating string matching on GPGPUs. Lin et al., proposed
114the Parallel Failureless Aho-Corasick (PFAC)
115algorithm to accelerate pattern matching on GPGPU hardware and
116achieved 143 Gbps raw data throughput,
117although system throughput was limited to 15 Gbps \cite{lin2013accelerating}.
118
119Whereas the existing approaches to parallelization have been
121parallel architectures, we introduce both a new algorithmic
122approach and its implementation on SIMD and GPGPU architectures.
123This approach relies on a bitwise data parallel view of text
124streams as well as a surprising use of addition to match
125runs of characters in a single step.  The closest previous
126work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
127technology together with a parallel scanning primitive also
129However, in contrast to the deterministic, longest-match
130scanning associated with the ScanThru primitive of that
131work, we introduce here a new primitive MatchStar
132that can be used in full generality for nondeterministic
133regular expression matching.   We also introduce a long-stream
134addition technique involving a further application of MatchStar
135that enables us to scale the technique to $n$-bit addition
136in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
137for example, to perform
139
140There is also a strong keyword match between the bit-parallel
141data streams used in our approach and the bit-parallelism
142used for NFA state transitions in the classical algorithms of
143Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
144and Navarro and Raffinot \cite{navarro1998bit}.
145However those algorithms use bit-parallelism in a fundamentally
146different way: representing all possible current NFA states
147as a bit vector and performing parallel transitions to a new
148set of states using table lookups and bitwise logic.    Whereas
149our approach can match multiple characters per step, bit-parallel
150NFA algorithms proceed through the input one byte at a time.
151Nevertheless, the agrep \cite{wu1992agrep} and
152nrgrep \cite{navarro2000} programs implemented using these techniques remain
153among the strongest competitors in regular expression matching
154performance, so we include them in our comparative evaluation.
155
156The remainder of this paper is organized as follows.
157Section \ref{sec:grep} briefly describes regular expression
158notation and the grep problem.
159Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
160primitive using a model of arbitrary-length bit-parallel data streams.
161Section \ref{sec:blockwise} discusses the block-by-block
162implementation of our techniques including the long stream
165Section \ref{sec:SSE2} describes our overall SSE2 implementation
166and carries out a performance study in comparison with
167existing grep implementations.
168Given the dramatic variation in grep performance across
169different implementation techniques, expressions and data sets,
170Section \ref{sec:analysis} considers a comparison between
171the bit-stream and NFA approaches from a theoretical perspective.
172Section \ref{sec:AVX2} then examines and demonstrates
173the scalability of our
174bitwise data-parallel approach in moving from
175128-bit to 256-bit SIMD on Intel Haswell architecture.
176To further investigate scalability, Section \ref{sec:GPU}
177addresses the implementation of our matcher using
178groups of 64 threads working together SIMT-style on a GPGPU system.
179Section \ref{sec:Concl} concludes the paper with a discussion of results and
180areas for future work.
181
182\section{Regular Expression Notation and Grep}\label{sec:grep}
183
184We follow common POSIX notation for regular expressions.
185A regular expression specifies a set of strings through
186a pattern notation.   Individual characters normally
187stand for themselves, unless they are one of the
188special characters \verb:*+?[{\(|^$.: that serve as metacharacters 189of the notation system. Thus the regular expression \verb:cat: 190is a pattern for the set consisting of the single 3-character 191string \verb:cat:''. The special characters must be escaped 192with a backslash to prevent interpretation as metacharacter, thus 193\verb:\$: represents the dollar-sign and \verb:\\\\: represent
194the string consisting of two backslash characters.
195Character class bracket expressions are pattern elements
196that allow any character in a given class to be used in a particular
197context.  For example, \verb:[@#%]: is a regular expression
198that stands for any of the three given symbols.  Contiguous
199ranges of characters may be specified using hyphens;
200for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
201for any alphanumeric character or underscore.  If the
202caret character immediately follows the opening bracket,
203the class is negated, thus \verb:[^0-9]: stands for
204any character except a digit.  The period metacharacter
205\verb:.: stands for the class of all characters.
206
207Consecutive pattern elements stand for strings formed by
208concatenation, thus \verb:[cd][ao][tg]: stands for the
209set of 8 three-letter strings \verb:cat:'' through \verb:dog:''.
210The alternation operator \verb:|: allows a pattern to be
211defined to have to alternative forms, thus \verb:cat|dog:
212matches either \verb:cat:'' or \verb:dog:''.  Concatenation
213takes precedence over alternation, but parenthesis may be
214used to change this, thus \verb:(ab|cd)[0-9]: stands for any
215digit following one of the two prefixes  \verb:ab:'' or \verb:cd:''.
216
217Repetition operators may be appended to a pattern to specify
218a variable number of occurrences of that pattern.
219The Kleene \verb:*: specifies zero-or-more occurrences
220matching the previous pattern, while \verb:+: specifies one-or
221more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
222both specify the same set: strings of at least one lower-case
223letter.  The postfix operator \verb:?: specifies an optional
224component, i.e., zero-or-one occurrence of strings matching
225the element.  Specific bounds may be given within braces:
226\verb:(ab){3}: specifies the string \verb:ababab:'',
227\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
228or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
229strings of at least 4 consecutive capital letters.
230
231The grep program searches a file for lines containing matches
232to a regular expression using any of the above notations.
233In addition, the pattern elements \verb:^: and \verb:$: 234may be used to match respectively the beginning or the 235end of a line. In line-based tools such as grep, \verb:.: 236matches any character except newlines; matches cannot extend 237over lines. 238Normally, grep prints all matching 239lines to its output. However, grep programs typically 240allow a command line flag such as \verb:-c: to specify 241that only a count of matching lines be produced; we use 242this option in our experimental evaluation to focus 243our comparisons on the performance of the underlying matching 244algorithms. 245 246\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise} 247 248Whereas the traditional approaches to regular expression matching 249using NFAs, DFAs or backtracking all rely on a byte-at-a-time 250processing model, the approach we introduce in this paper is based 251on quite a different concept: a data-parallel approach to simultaneous 252processing of data stream elements. Indeed, our most abstract model 253is that of unbounded data parallelism: processing all elements of 254the input data stream simultaneously. In essence, data streams are viewed 255as (very large) integers. The fundamental operations are bitwise 256logic, stream shifting and long-stream addition. 257 258Depending on the available parallel processing resources, an actual 259implementation may divide an input stream into blocks and process 260the blocks sequentially. Within each block all elements of the 261input stream are processed together, relying the availability of 262bitwise logic and addition scaled to the block size. On commodity 263Intel and AMD processors with 128-bit SIMD capabilities (SSE2), 264we typically process input streams 128 bytes at a time. 265In this 266case, we rely on the Parabix tool chain \cite{lin2012parabix} 267to handle the details of compilation to block-by-block processing. 268On the 269latest processors supporting the 256-bit AVX2 SIMD operations, 270we also use the Parabix tool chain, but substitute a parallelized 271long-stream addition technique to avoid the sequential chaining 272of 4 64-bit additions. 273Our GPGPU implementation uses scripts to modify the output 274of the Parabix tools, effectively dividing the input into blocks 275of 4K bytes. 276We also have adapted our long-stream addition technique 277to perform 4096-bit additions using 64 threads working in lock-step 278SIMT fashion. 279 280\begin{figure}[tbh] 281\begin{center} 282\begin{tabular}{cr}\\ 283input data & \verba453z--b3z--az--a12949z--ca22z7--\\ 284$B_7$& \verb.................................\\ 285$B_6$& \verb1...1..1.1..11..1.....1..11..1...\\ 286$B_5$& \verb111111111111111111111111111111111\\ 287$B_4$& \verb.1111...11...1...111111....1111..\\ 288$B_3$& \verb....111..111.111...1.1111....1.11\\ 289$B_2$& \verb.11..11...11..11....1..11.....111\\ 290$B_1$& \verb...11..111...1....1...1..1.1111..\\ 291$B_0$& \verb1.11.11.1.111.1111.1.1.1111...111\\ 292\verb:[a]: & \verb1...........1...1.........1......\\ 293\verb:[z9]: & \verb....1....1...1.....1.11......1...\\ 294\verb:[0-9]: & \verb.111....1........11111.....11.1.. 295\end{tabular} 296 297\end{center} 298\caption{Basis and Character Class Streams} 299\label{fig:streams} 300\end{figure} 301 302A key concept in this streaming approach is the derivation of bit streams 303that are parallel to the input data stream, i.e., in one-to-one 304correspondence with the data element positions of the input 305streams. Typically, the input stream is a byte stream comprising 306the 8-bit character code units of a particular encoding such 307as extended ASCII, ISO-8859-1 or UTF-8. However, the method may also 308easily be used with wider code units such as the 16-bit code units of 309UTF-16. In the case of a byte stream, the first step is to transpose 310the byte stream into eight parallel bit streams, such that bit stream 311$i$comprises the$i^\text{th}$bit of each byte. These streams form 312a set of basis bit streams from which many other parallel bit 313streams can be calculated, such as character class bit 314streams such that each bit$j$of the stream specifies 315whether character$j$of the input stream is in the class 316or not. Figure \ref{fig:streams} shows an example of an 317input byte stream in ASCII, the eight basis bit streams of the 318transposed representation, and the character class bit streams 319\verb:[a]:, 320\verb:[z9]:, and 321\verb:[0-9]: 322that may be computed from the basis bit streams using bitwise logic. 323Zero bits are marked with periods ({\tt .}) so that the one bits stand out. 324Transposition and character class construction are straightforward 325using the Parabix tool chain \cite{lin2012parabix}. 326 327\begin{figure}[tbh] 328\begin{center} 329\begin{tabular}{cr}\\ 330input data & \verba453z--b3z--az--a12949z--ca22z7--\\ 331$M_1$& \verb.1...........1...1.........1.....\\ 332$M_2$& \verb.1111........1...111111....111...\\ 333$M_3$& \verb.....1........1.....1.11......1.. 334\end{tabular} 335 336\end{center} 337\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}} 338\label{fig:streams2} 339\end{figure} 340 341\paragraph*{Marker Streams.} Now consider how bit-parallel data 342streams can be used in regular expression matching. Consider 343the problem of searching the input stream of Figure \ref{fig:streams} 344to finding occurrence of strings matching 345the regular expression \verb:a[0-9]*[z9]:. 346Note that this is an ambiguous regular expression, which could match 347texts such as \verb:a12949z: in multiple ways. 348The matching process involves the concept of {\em marker streams}, that 349is streams that mark the positions of current matches during the 350overall process. In this case there are three marker streams computed 351during the match process, namely, 352$M_1$representing match positions after an initial \verb:a: 353character has been found,$M_2$representing positions 354reachable from positions marked by$M_1$by further matching zero or 355more digits (\verb:[0-9]*:) and finally$M_3$the stream 356marking positions after a final \verb:z: or \verb:9: has been found. 357Without describing the details of how these streams are computed 358for the time being, Figure \ref{fig:streams2} shows what each 359of these streams should be for our example matching problem. 360Our convention that a marker stream contains a 1 bit 361at the next character position to be matched, that is, 362immediately past the last position that was matched. 363Note that all three matches from the third occurrence of \verb:a: 364are correctly marked in$M_3$. 365 366 367\paragraph*{MatchStar.} 368MatchStar takes a marker bitstream and a character class bitstream as input. It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream. 369 370\begin{figure}[tbh] 371\begin{center} 372\begin{tabular}{cr}\\ 373input data & \verba453z--b3z--az--a12949z--ca22z7--\\ 374$M_1$& \verb.1...........1...1.........1.....\\ 375$D = \text{\tt [0-9]}$& \verb.111....1........11111.....11.1..\\ 376$T_0 = M_1 \wedge D$& \verb.1...............1.........1.....\\ 377$T_1 = T_0 + D$& \verb....1...1.............1......11..\\ 378$T_2 = T_1 \oplus D$& \verb.1111............111111....111...\\ 379$M_2 = T_2 \, | \, M_1$& \verb.1111........1...111111....111... 380\end{tabular} 381 382\end{center} 383\caption{$M_2 = \text{MatchStar}(M_1, D)$} 384\label{fig:matchstar} 385\end{figure} 386 387 388Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure, 389it is important to note that our bitstreams are shown in natural left-to-right order reflecting the 390conventional presentation of our character data input. However, this reverses the normal 391order of presentation when considering bitstreams as numeric values. The key point here is 392that when we perform bitstream addition, we will show bit movement from left-to-right. 393For example,$\verb:111.: + \verb:1...: = \verb:...1:$. 394 395The first row of the figure is the input data, 396the second and third rows are the input bitstreams: the initial marker position bitstream and the 397character class bitstream for digits derived from input data. 398 399In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic. Next, the temporary marker bitstream is added to the character class bitstream. 400The addition produces 1s in three types of positions. There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions. Any character class positions that have a 0 in$T_1$were affected by the addition and are part of the desired output. These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream.$T_2$is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in$T_1$. The 401output marker stream is obtained by ORing$T_2$with the initial marker stream. 402 403In general, given a marker stream$M$and a character class stream$C$, 404the operation of MatchStar is defined by the following equation. 405$\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) | M$ 406Given a set of initial marker positions, the result stream marks 407all possible positions that can be reached by 0 or more occurrences 408of characters in class$C$from each position in$M$409 410MatchStar differs from ScanThru of the Parabix tool chain in that it 411finds all matches, not just the longest match. This is necessary 412for general matching involving possibly ambiguous regular 413expressions. 414 415\input{compilation} 416 417\input{re-Unicode} 418 419\section{Block-at-a-Time Processing}\label{sec:blockwise} 420 421The unbounded stream model of the previous section must of course 422be translated an implementation that proceeds block-at-a-time for 423realistic application. In this, we primarily rely on the Pablo 424compiler of the Parabix toolchain \cite{lin2012parabix}. Given input 425statements expressed as arbitrary-length bitstream equations, Pablo 426produces block-at-a-time C++ code that initializes and maintains all the necessary 427carry bits for each of the additions and shifts involved in the 428bitstream calculations. 429 430In the present work, our principal contribution to the Parabix tool 431chain is to incorporate the technique of long-stream addition described below. 432Otherwise, we were able to use Pablo directly in compiling our 433SSE2 and AVX2 implementations. Our GPGPU implementation required 434some scripting to modify the output of the Pablo compiler for our 435purpose. 436 437\paragraph*{Long-Stream Addition.} The maximum word size for 438addition on commodity processors is typically 64 bits. In order 439to implement long-stream addition for block sizes of 256 or larger, 440a method for propagating carries through the individual stages of 44164-bit addition is required. However, the normal technique of 442sequential addition using add-with-carry instructions, for example, 443is far from ideal. 444 445We use the following general model using SIMD methods for constant-time 446long-stream addition up to 4096 bits. Related GPGPU solutions have been 447independently developed\cite{Crovella2012}, 448however our model is intended to be a more broadly applicable abstraction. 449We assume the availability of the following SIMD/SIMT operations 450operating on vectors of$f$64-bit fields. 451\begin{itemize} 452\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields 453in two vectors to produce a result vector of$f$64-bit fields. 454\item \verb#simd<64>::eq(X, -1)# : comparison of the 64-bit fields 455of \verb:x: each with the constant value -1 (all bits 1), producing 456an$f$-bit mask value, 457\item \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit 458field into a single compressed$f$-bit mask value, and 459\item normal bitwise logic operations on$f$-bit masks, and 460\item \verb#simd<64>::spread(x)# : distributing the bits of 461an$f$bit mask, one bit each to the$f$64-bit fields of a vector. 462\end{itemize} 463 464In this model, the \verb#hsimd<64>::mask(X)# and 465\verb#simd<64>::spread(x)# model the minimum 466communication requirements between the parallel processing units 467(SIMD lanes or SIMT processors). In essence, we just need 468the ability to quickly send and receive 1 bit of information 469per parallel unit. The \verb#hsimd<64>::mask(X)# operation 470gathers 1 bit from each of the processors to a central resource. 471After calculations on the gather bits are performed, we then 472just need an operation to invert the communication, i.e., 473sending 1 bit each from the central processor to each of 474the parallel units. There are a variety of ways in which 475these facilities may be implemented depending on the 476underlying architecture; details of our AVX2 and GPGPU implementations 477are presented later. 478 479Given these operations, our method for long stream addition of 480two$f \times 64$bit values \verb:X: and \verb:Y: is the following. 481\begin{enumerate} 482\item Form the vector of 64-bit sums of \verb:x: and \verb:y:. 483$\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$ 484 485\item Extract the$f$-bit masks of \verb:X:, \verb:Y: and \verb:R:. 486$\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$ 487$\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$ 488$\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$ 489 490\item Compute an$f$-bit mask of carries generated for each of the 49164-bit additions of \verb:X: and \verb:Y:. 492$\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$ 493 494\item Compute an$f$-bit mask of all fields of {\tt R} that will overflow with 495an incoming carry bit. This is the {\em bubble mask}. 496$\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$ 497 498\item Determine an$f$-bit mask identifying the fields of {\tt R} that need to be 499incremented to produce the final sum. Here we find a new application of 500MatchStar. 501$\text{\tt i} = \text{\tt MatchStar(c*2, b)}$ 502 503This is the key step. The mask {\tt c} of outgoing carries must be 504shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated 505with the next digit. At the incoming position, the carry will 506increment the 64-bit digit. However, if this digit is all ones (as 507signalled by the corresponding bit of bubble mask {\tt b}, then the addition 508will generate another carry. In fact, if there is a sequence of 509digits that are all ones, then the carry must bubble through 510each of them. This is just MatchStar. 511 512\item Compute the final result {\tt Z}. 513$\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$ 514 515\end{enumerate} 516\begin{figure} 517\begin{center} 518\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9} 519{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9} 520{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9} 521{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9} 522{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} 523{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} 524{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} 525{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} 526{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} 527{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} 528{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} 529{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9} 530\end{tabular} 531\end{center} 532\caption{Long Stream Addition}\label{fig:longadd} 533\end{figure} 534 535Figure \ref{fig:longadd} illustrates the process. In the figure, 536we illustrate the process with 8-bit fields rather than 64-bit fields 537and show all field values in hexadecimal notation. Note that 538two of the individual 8-bit additions produce carries, while two 539others produce {\tt FF} values that generate bubble bits. The 540net result is that four of the original 8-bit sums must be 541incremented to produce the long stream result. 542 543A slight extension to the process produces a long-stream full adder 544that can be used in chained addition. In this case, the 545adder must take an additional carry-in bit 546{\tt p} and produce a carry-out bit {\tt q}. 547This may be accomplished by incorporating {\tt p} 548in calculating the increment mask in the low bit position, 549and then extracting the carry-out {\tt q} from the high bit position. 550$\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}$ 551$\text{\tt q} = \text{\tt i >> f}$ 552 553As described subsequently, we use a two-level long-stream addition technique 554in both our AVX2 and GPGPU implementations. In principle, one can extend 555the technique to additional levels. Using 64-bit adders throughout, 556$\lceil\log_{64}{n}\rceil$steps are needed for$n\$-bit addition.
557A three-level scheme could coordinate
55864 groups each performing 4096-bit long additions in a two-level structure.
559However, whether there are reasonable architectures that can support fine-grained
560SIMT style coordinate at this level is an open question.
561
562Using the methods outlined, it is quite conceivable that instruction
564future SIMD and GPGPU processors.   Given the fundamental nature
565of addition as a primitive and its particular application to regular
566expression matching as shown herein, it seems reasonable to expect
567such instructions to become available.    Alternatively, it may
568be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and
570
571
572\input{sse2}
573
574\input{analysis}
575
576\input{avx2}
577
578
579
580\section{GPGPU Implementation}\label{sec:GPU}
581
582To further assess the scalability of our regular expression matching
583using bit-parallel data streams, we implemented a GPGPU version
584in OpenCL.
585We arranged for 64 work groups each having 64 threads.
586The size of work group and number of work groups is choosen
587to provide the best occupancy calculated by AMD App Profiler.
588Input files are divided in data parallel fashion among
589the 64 work groups.  Each work group carries out the regular
590expression matching operations 4096 bytes at a time using SIMT
591processing.   Although the GPGPU
594we are able to simulate them using shared memory.
596its own carry and bubble values in shared memory and performs
598parallel-prefix style process.  Others have implemented
599long-stream addition on the GPGPU using similar techniques,
600as noted previously.
601
602We performed our test on an AMD Radeon HD A10-6800K APU machine.
603On the AMD Fusion systems, the input buffer is allocated in
604pinned memory to take advantage of the zero-copy memory regions
605where data can be read directly into this region by CPU
606and also accessed by GPGPU for further processing. Therefore,
607the expensive data transferring time that needed by traditional
608discrete GPGPUs is hidden and we compare only the kernel execution
609time with our SSE2 and AVX implementations as shown in Figure
610\ref{fig:SSE-AVX-GPU}. The GPGPU version gives 30\% to 60\% performance
611improvement over SSE version and 10\% to 40\% performance
612improvement over AVX version. Although we intended to process
61364 work groups with 4096 bytes each at a time rather than 128 bytes
614at a time on SSE or 256 bytes at a time on AVX, the performance
615improvement is less than 55\%. The first reason is hardware
616limitations. Our kernel occupancy is limited by register usage
617and not all the work groups can be scheduled at the same time.
618The second reason is that the long-stream addition implemented
619on GPGPU is more expensive than the implementations on SSE or AVX.
620Another important reason is the control flow. When a possible
621match is found in one thread, the rest of the threads in the
622same work group have to execute the same instructions for
624simple IF test. Therefore, the performance of different
625regular expresions is dependent on the number of
626long-stream addition operations and the total number of matches
627of a given input.
628
629\begin{figure}
630\begin{center}
631\begin{tikzpicture}
632\begin{axis}[
633xtick=data,
634ylabel=Running Time (ms per megabyte),
635xticklabels={@,Date,Email,URIorEmail,HexBytes},
636tick label style={font=\tiny},
637enlarge x limits=0.15,
638%enlarge y limits={0.15, upper},
639ymin=0,
640legend style={at={(0.5,-0.15)},
641anchor=north,legend columns=-1},
642ybar,
643bar width=7pt,
644]
646file {data/ssetime.dat};
648file {data/avxtime.dat};
650file {data/gputime.dat};
651
652\legend{SSE2,AVX2,GPGPU,Annot}
653\end{axis}
654\end{tikzpicture}
655\end{center}
656\caption{Running Time}\label{fig:SSE-AVX-GPU}
657\end{figure}
658
659
660
661
662
663
664
665
666\input{conclusion}
667
668
669
670%\appendix
671%\section{Appendix Title}
672
673%This is the text of the appendix, if you need one.
674
675
676This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
677MITACS, Inc.
678
679\bibliographystyle{IEEEtranS}
680\bibliography{reference}
681
682\end{document}
683
684
Note: See TracBrowser for help on using the repository browser.