# source:docs/Working/re/re-main.tex@3667

Last change on this file since 3667 was 3667, checked in by ksherdy, 5 years ago

Minor edit.

File size: 32.7 KB
Line
1\documentclass[pageno]{jpaper}
2
3%replace XXX with the submission number you are given from the PACT submission site.
4\newcommand{\pactsubmissionnumber}{XXX}
5
6\usepackage[normalem]{ulem}
7\usepackage{amssymb}
8\usepackage{amsmath}
9\usepackage{graphicx}
10\usepackage{tikz}
11\usepackage{pgfplots}
12
13\begin{document}
14
15\title{Bitwise Data Parallelism in Regular Expression Matching}
16
17
18\date{}
19\maketitle
20
21\thispagestyle{empty}
22
23
24\begin{abstract}
25\input{abstract}
26\end{abstract}
27
28\section{Introduction}
29
30The use of regular expressions to search texts for occurrences
31of string patterns has a long history and
32remains a pervasive technique throughout computing applications today.
33% {\em a brief history}
34The origins of regular expression matching date back to automata theory
35developed by Kleene in the 1950s \cite{kleene1951}.
36Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
37to nondeterministic finite automata (NFA).
38Following Thompson's approach, a regular expression of length $m$ is converted
39to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
40NFA in worst case $O(mn)$ time. Often, a more efficient choice
41is to convert an NFA into a DFA. A DFA has a single active state at any time
42throughout the matching process and
43hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well
44known that the conversion of an NFA to an equivalent DFA may result
45in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
46
47A significant proportion of the research in fast regular expression matching can be
48regarded as the quest for efficient automata'' \cite{navarro98fastand}.
49In \cite{baeza1992new}, Baeza-Yates and Gonnet
50presented a new approach to string search based on bit-level parallelism.
51This technique takes advantage of the intrinsic parallelism of bitwise operations
52within a computer word. Thus,
53given a $w$-bit word, the number of operations that a string search algorithms
54performs can be reduced by a factor $w$.
55Building on this observation, the Shift-Or algorithm simulates an NFA using
56bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}.
57A disadvantage of the Shift-Or approach
58is an inability to skip input characters.
59Simple string matching algorithms,
60such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, skip input characters
61to achieve sublinear times in the average case.
62% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
63% based on suffix automata are able to skip characters.
64The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast}
65combines the advantages of the Shift-Or approach
66with the ability to skip characters. %character skipping property of BDM algorithms.
67The nrgrep tool is based on the BNDM algorithm. Prior to the bitwise
68data parallel approach presented herein, nrgrep
69was the fastest grep tool
70for matching complex patterns, and achieved similar performance
71to the fastest existing string
72matching tools for simple patterns \cite{navarro2000}.
73
74There has been considerable recent
75interest in accelerating regular expression matching
76on parallel hardware
77such as multicore processors (CPUs),
78general purpose graphics processing units (GPGPUs),
79field-programmable gate arrays (FPGAs),
80and specialized architectures such as
81the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
82%CPU
83Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
84text processing algorithms that exhibit irregular memory access patterns
85can be efficiently executed on multicore hardware.
86In related work, Pasetto et al presented a flexible tool that
87performs small-ruleset regular expression matching at a rate of
882.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
89Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
90string matching algorithm \cite{aho1975} is well-suited for parallel
91implementation on multicore CPUs, GPGPUs and the Cell BE.
92On each hardware, both thread-level parallelism (cores) and data-level parallelism
93(SIMD units) were leveraged for performance.
94Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
95in business analytics applications and leveraged the SIMD hardware available
96on multicore processors to achieve a speedup of greater than 1.8 over a
97range of data sizes of interest.
98%Cell
99In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
100that delivered 1.00--1.78 Gbps on a single
101Cell BE chip and extended this approach for emulation on the Intel Larrabee
102instruction set \cite{scarpazza2009larrabee}.
103On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
104implementation that delivered a throughput of 40
105Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
106Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008}
107presented a string matching implementation for automata that achieved
1084 Gbps on the Cell BE.
109% GPU
110In more recent work, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
111implementation of the AC algorithm for
112accelerating string matching on GPGPUs. Lin et al., proposed
113the Parallel Failureless Aho-Corasick (PFAC)
114algorithm to accelerate pattern matching on GPGPU hardware and
115achieved 143 Gbps raw data throughput,
116although system throughput was limited to 15 Gbps \cite{lin2013accelerating}.
117
118Whereas the existing approaches to parallelization have been
119based on adapting traditional sequential algorithms to emergent
120parallel architectures, we introduce both a new algorithmic
121approach and its implementation on SIMD and GPGPU architectures.
122This approach relies on a bitwise data parallel view of text
123streams as well as a surprising use of addition to match
124runs of characters in a single step.  The closest previous
125work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
126technology together with a parallel scanning primitive also
127based on addition \cite{cameron2011parallel}.
128However, in contrast to the deterministic, longest-match
129scanning associated with the ScanThru primitive of that
130work, we introduce here a new primitive MatchStar
131that can be used in full generality for nondeterministic
132regular expression matching.   We also introduce a long-stream
133addition technique involving a further application of MatchStar
134that enables us to scale the technique to $n$-bit addition
135in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
136for example, to perform
137synchronized 4096-bit addition on GPGPU wavefronts of 64 threads.
138
139There is also a strong keyword match between the bit-parallel
140data streams used in our approach and the bit-parallelism
141used for NFA state transitions in the classical algorithms of
142Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
143and Navarro and Raffinot \cite{navarro1998bit}.
144However those algorithms use bit-parallelism in a fundamentally
145different way: representing all possible current NFA states
146as a bit vector and performing parallel transitions to a new
147set of states using table lookups and bitwise logic.    Whereas
148our approach can match multiple characters per step, bit-parallel
149NFA algorithms proceed through the input one byte at a time.
150Nevertheless, the agrep \cite{wu1992agrep} and
151nrgrep \cite{navarro2000} programs implemented using these techniques remain
152among the strongest competitors in regular expression matching
153performance, so we include them in our comparative evaluation.
154
155The remainder of this paper is organized as follows.
156Section \ref{sec:grep} briefly describes regular expression
157notation and the grep problem.
158Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
159primitive using a model of arbitrary-length bit-parallel data streams.
160Section \ref{sec:blockwise} discusses the block-by-block
161implementation of our techniques including the long stream
162addition techniques for 256-bit addition with AVX2 and
1634096-bit additions with GPGPU SIMT.
164Section \ref{sec:SSE2} describes our overall SSE2 implementation
165and carries out a performance study in comparison with
166existing grep implementations.
167Given the dramatic variation in grep performance across
168different implementation techniques, expressions and data sets,
169Section \ref{sec:analysis} considers a comparison between
170the bit-stream and NFA approaches from a theoretical perspective.
171Section \ref{sec:AVX2} then examines and demonstrates
172the scalability of our
173bitwise data-parallel approach in moving from
174128-bit to 256-bit SIMD on Intel Haswell architecture.
175To further investigate scalability, Section \ref{sec:GPU}
176addresses the implementation of our matcher using
177groups of 64 threads working together SIMT-style on a GPGPU system.
178Section \ref{sec:Concl} concludes the paper with a discussion of results and
179areas for future work.
180
181\section{Regular Expression Notation and Grep}\label{sec:grep}
182
183We follow common POSIX notation for regular expressions.
184A regular expression specifies a set of strings through
185a pattern notation.   Individual characters normally
186stand for themselves, unless they are one of the
187special characters \verb:*+?[{\(|^$.: that serve as metacharacters 188of the notation system. Thus the regular expression \verb:cat: 189is a pattern for the set consisting of the single 3-character 190string \verb:cat:''. The special characters must be escaped 191with a backslash to prevent interpretation as metacharacter, thus 192\verb:\$: represents the dollar-sign and \verb:\\\\: represent
193the string consisting of two backslash characters.
194Character class bracket expressions are pattern elements
195that allow any character in a given class to be used in a particular
196context.  For example, \verb:[@#%]: is a regular expression
197that stands for any of the three given symbols.  Contiguous
198ranges of characters may be specified using hyphens;
199for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
200for any alphanumeric character or underscore.  If the
201caret character immediately follows the opening bracket,
202the class is negated, thus \verb:[^0-9]: stands for
203any character except a digit.  The period metacharacter
204\verb:.: stands for the class of all characters.
205
206Consecutive pattern elements stand for strings formed by
207concatenation, thus \verb:[cd][ao][tg]: stands for the
208set of 8 three-letter strings \verb:cat:'' through \verb:dog:''.
209The alternation operator \verb:|: allows a pattern to be
210defined to have two alternative forms, thus \verb:cat|dog:
211matches either \verb:cat:'' or \verb:dog:''.  Concatenation
212takes precedence over alternation, but parenthesis may be
213used to change this, thus \verb:(ab|cd)[0-9]: stands for any
214digit following one of the two prefixes  \verb:ab:'' or \verb:cd:''.
215
216Repetition operators may be appended to a pattern to specify
217a variable number of occurrences of that pattern.
218The Kleene star operator \verb:*: specifies zero-or-more occurrences
219matching the previous pattern, while Kleene plus \verb:+: specifies one-or
220more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
221both specify the same set: strings of at least one lower-case
222letter.  The postfix operator \verb:?: specifies an optional
223component, i.e., zero-or-one occurrence of strings matching
224the element.  Specific bounds may be given within braces:
225\verb:(ab){3}: specifies the string \verb:ababab:'',
226\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
227or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
228strings of at least 4 consecutive capital letters.
229
230The grep program searches a file for lines containing matches
231to a regular expression using any of the above notations.
232In addition, the pattern elements \verb:^: and \verb:$: 233may be used to match respectively the beginning or the 234end of a line. In line-based tools such as grep, \verb:.: 235matches any character except newlines; matches cannot extend 236over lines. 237Normally, grep prints all matching 238lines to its output. However, grep programs typically 239allow a command line flag such as \verb:-c: to specify 240that only a count of matching lines be produced; we use 241this option in our experimental evaluation to focus 242our comparisons on the performance of the underlying matching 243algorithms. 244 245\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise} 246 247Whereas the traditional approaches to regular expression matching 248using NFAs, DFAs or backtracking all rely on a byte-at-a-time 249processing model, the approach we introduce in this paper is based 250on quite a different concept: a data-parallel approach to simultaneous 251processing of data stream elements. Indeed, our most abstract model 252is that of unbounded data parallelism: processing all elements of 253the input data stream simultaneously. In essence, data streams are viewed 254as (very large) integers. The fundamental operations are bitwise 255logic, stream shifting and long-stream addition. 256 257Depending on the available parallel processing resources, an actual 258implementation may divide an input stream into blocks and process 259the blocks sequentially. Within each block all elements of the 260input stream are processed together, relying on the availability of 261bitwise logic and addition scaled to the block size. On commodity 262Intel and AMD processors with 128-bit SIMD capabilities (SSE2), 263we typically process input streams 128 bytes at a time. 264In this 265case, we rely on the Parabix tool chain \cite{lin2012parabix} 266to handle the details of compilation to block-by-block processing. 267On the 268latest processors supporting the 256-bit AVX2 SIMD operations, 269we also use the Parabix tool chain, but substitute a parallelized 270long-stream addition technique to avoid the sequential chaining 271of 4 64-bit additions. 272Our GPGPU implementation uses scripts to modify the output 273of the Parabix tools, effectively dividing the input into blocks 274of 4K bytes. 275We also have adapted our long-stream addition technique 276to perform 4096-bit additions using 64 threads working in lock-step 277SIMT fashion. 278 279\begin{figure}[tbh] 280\begin{center} 281\begin{tabular}{cr}\\ 282input data & \verba453z--b3z--az--a12949z--ca22z7--\\ 283$B_7$& \verb.................................\\ 284$B_6$& \verb1...1..1.1..11..1.....1..11..1...\\ 285$B_5$& \verb111111111111111111111111111111111\\ 286$B_4$& \verb.1111...11...1...111111....1111..\\ 287$B_3$& \verb....111..111.111...1.1111....1.11\\ 288$B_2$& \verb.11..11...11..11....1..11.....111\\ 289$B_1$& \verb...11..111...1....1...1..1.1111..\\ 290$B_0$& \verb1.11.11.1.111.1111.1.1.1111...111\\ 291\verb:[a]: & \verb1...........1...1.........1......\\ 292\verb:[z9]: & \verb....1....1...1.....1.11......1...\\ 293\verb:[0-9]: & \verb.111....1........11111.....11.1.. 294\end{tabular} 295 296\end{center} 297\caption{Basis and Character Class Streams} 298\label{fig:streams} 299\end{figure} 300 301A key concept in this streaming approach is the derivation of bit streams 302that are parallel to the input data stream, i.e., in one-to-one 303correspondence with the data element positions of the input 304streams. Typically, the input stream is a byte stream comprising 305the 8-bit character code units of a particular encoding such 306as extended ASCII, ISO-8859-1 or UTF-8. However, the method may also 307easily be used with wider code units such as the 16-bit code units of 308UTF-16. In the case of a byte stream, the first step is to transpose 309the byte stream into eight parallel bit streams, such that bit stream 310$i$comprises the$i^\text{th}$bit of each byte. These streams form 311a set of basis bit streams from which many other parallel bit 312streams can be calculated, such as character class bit 313streams such that each bit$j$of the stream specifies 314whether character$j$of the input stream is in the class 315or not. Figure \ref{fig:streams} shows an example of an 316input byte stream in ASCII, the eight basis bit streams of the 317transposed representation, and the character class bit streams 318\verb:[a]:, 319\verb:[z9]:, and 320\verb:[0-9]: 321that may be computed from the basis bit streams using bitwise logic. 322Zero bits are marked with periods ({\tt .}) so that the one bits stand out. 323Transposition and character class construction are straightforward 324using the Parabix tool chain \cite{lin2012parabix}. 325 326\begin{figure}[tbh] 327\begin{center} 328\begin{tabular}{cr}\\ 329input data & \verba453z--b3z--az--a12949z--ca22z7--\\ 330$M_1$& \verb.1...........1...1.........1.....\\ 331$M_2$& \verb.1111........1...111111....111...\\ 332$M_3$& \verb.....1........1.....1.11......1.. 333\end{tabular} 334 335\end{center} 336\caption{Marker Streams in Matching {\tt a[0-9]*[z9]}} 337\label{fig:streams2} 338\end{figure} 339 340\paragraph*{Marker Streams.} Now consider how bit-parallel data 341streams can be used in regular expression matching. Consider 342the problem of searching the input stream of Figure \ref{fig:streams} 343to finding occurrence of strings matching 344the regular expression \verb:a[0-9]*[z9]:. 345Note that this is an ambiguous regular expression, which could match 346texts such as \verb:a12949z: in multiple ways. 347The matching process involves the concept of {\em marker streams}, that 348is streams that mark the positions of current matches during the 349overall process. In this case there are three marker streams computed 350during the match process, namely, 351$M_1$representing match positions after an initial \verb:a: 352character has been found,$M_2$representing positions 353reachable from positions marked by$M_1$by further matching zero or 354more digits (\verb:[0-9]*:) and finally$M_3$the stream 355marking positions after a final \verb:z: or \verb:9: has been found. 356Without describing the details of how these streams are computed 357for the time being, Figure \ref{fig:streams2} shows what each 358of these streams should be for our example matching problem. 359Our convention that a marker stream contains a 1 bit 360at the next character position to be matched, that is, 361immediately past the last position that was matched. 362Note that all three matches from the third occurrence of \verb:a: 363are correctly marked in$M_3$. 364 365 366\paragraph*{MatchStar.} 367MatchStar takes a marker bitstream and a character class bitstream as input. It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream. 368 369\begin{figure}[tbh] 370\begin{center} 371\begin{tabular}{cr}\\ 372input data & \verba453z--b3z--az--a12949z--ca22z7--\\ 373$M_1$& \verb.1...........1...1.........1.....\\ 374$D = \text{\tt [0-9]}$& \verb.111....1........11111.....11.1..\\ 375$T_0 = M_1 \wedge D$& \verb.1...............1.........1.....\\ 376$T_1 = T_0 + D$& \verb....1...1.............1......11..\\ 377$T_2 = T_1 \oplus D$& \verb.1111............111111....111...\\ 378$M_2 = T_2 \, | \, M_1$& \verb.1111........1...111111....111... 379\end{tabular} 380 381\end{center} 382\caption{$M_2 = \text{MatchStar}(M_1, D)$} 383\label{fig:matchstar} 384\end{figure} 385 386 387Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure, 388it is important to note that our bitstreams are shown in natural left-to-right order reflecting the 389conventional presentation of our character data input. However, this reverses the normal 390order of presentation when considering bitstreams as numeric values. The key point here is 391that when we perform bitstream addition, we will show bit movement from left-to-right. 392For example,$\verb:111.: + \verb:1...: = \verb:...1:$. 393 394The first row of the figure is the input data, 395the second and third rows are the input bitstreams: the initial marker position bitstream and the 396character class bitstream for digits derived from input data. 397 398In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic. Next, the temporary marker bitstream is added to the character class bitstream. 399The addition produces 1s in three types of positions. There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions. Any character class positions that have a 0 in$T_1$were affected by the addition and are part of the desired output. These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream.$T_2$is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in$T_1$. The 400output marker stream is obtained by ORing$T_2$with the initial marker stream. 401 402In general, given a marker stream$M$and a character class stream$C$, 403the operation of MatchStar is defined by the following equation. 404$\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) | M$ 405Given a set of initial marker positions, the result stream marks 406all possible positions that can be reached by 0 or more occurrences 407of characters in class$C$from each position in$M$408 409MatchStar differs from ScanThru of the Parabix tool chain in that it 410finds all matches, not just the longest match. This is necessary 411for general matching involving possibly ambiguous regular 412expressions. 413 414\input{compilation} 415 416\input{re-Unicode} 417 418\section{Block-at-a-Time Processing}\label{sec:blockwise} 419 420The unbounded stream model of the previous section must of course 421be translated an implementation that proceeds block-at-a-time for 422realistic application. In this, we primarily rely on the Pablo 423compiler of the Parabix toolchain \cite{lin2012parabix}. Given input 424statements expressed as arbitrary-length bitstream equations, Pablo 425produces block-at-a-time C++ code that initializes and maintains all the necessary 426carry bits for each of the additions and shifts involved in the 427bitstream calculations. 428 429In the present work, our principal contribution to the Parabix tool 430chain is to incorporate the technique of long-stream addition described below. 431Otherwise, we were able to use Pablo directly in compiling our 432SSE2 and AVX2 implementations. Our GPGPU implementation required 433some scripting to modify the output of the Pablo compiler for our 434purpose. 435 436\paragraph*{Long-Stream Addition.} The maximum word size for 437addition on commodity processors is typically 64 bits. In order 438to implement long-stream addition for block sizes of 256 or larger, 439a method for propagating carries through the individual stages of 44064-bit addition is required. However, the normal technique of 441sequential addition using add-with-carry instructions, for example, 442is far from ideal. 443 444We use the following general model using SIMD methods for constant-time 445long-stream addition up to 4096 bits. Related GPGPU solutions have been 446independently developed\cite{Crovella2012}, 447however our model is intended to be a more broadly applicable abstraction. 448We assume the availability of the following SIMD/SIMT operations 449operating on vectors of$f$64-bit fields. 450\begin{itemize} 451\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields 452in two vectors to produce a result vector of$f$64-bit fields. 453\item \verb#simd<64>::eq(X, -1)#: comparison of the 64-bit fields 454of \verb:x: each with the constant value -1 (all bits 1), producing 455an$f$-bit mask value, 456\item \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit 457field into a single compressed$f$-bit mask value, and 458\item normal bitwise logic operations on$f$-bit masks, and 459\item \verb#simd<64>::spread(X)#: distributing the bits of 460an$f$bit mask, one bit each to the$f$64-bit fields of a vector. 461\end{itemize} 462 463In this model, the \verb#hsimd<64>::mask(X)# and 464\verb#simd<64>::spread(X)# model the minimum 465communication requirements between the parallel processing units 466(SIMD lanes or SIMT processors). In essence, we just need 467the ability to quickly send and receive 1 bit of information 468per parallel unit. The \verb#hsimd<64>::mask(X)# operation 469gathers 1 bit from each of the processors to a central resource. 470After calculations on the gather bits are performed, we then 471just need an operation to invert the communication, i.e., 472sending 1 bit each from the central processor to each of 473the parallel units. There are a variety of ways in which 474these facilities may be implemented depending on the 475underlying architecture; details of our AVX2 and GPGPU implementations 476are presented later. 477 478Given these operations, our method for long stream addition of 479two$f \times 64$bit values \verb:X: and \verb:Y: is the following. 480\begin{enumerate} 481\item Form the vector of 64-bit sums of \verb:x: and \verb:y:. 482$\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$ 483 484\item Extract the$f$-bit masks of \verb:X:, \verb:Y: and \verb:R:. 485$\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$ 486$\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$ 487$\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$ 488 489\item Compute an$f$-bit mask of carries generated for each of the 49064-bit additions of \verb:X: and \verb:Y:. 491$\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$ 492 493\item Compute an$f$-bit mask of all fields of {\tt R} that will overflow with 494an incoming carry bit. This is called the {\em bubble mask}. 495$\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$ 496 497\item Determine an$f$-bit mask identifying the fields of {\tt R} that need to be 498incremented to produce the final sum. Here we find a new application of 499MatchStar. 500$\text{\tt i} = \text{\tt MatchStar(c*2, b)}$ 501 502This is the key step. The mask {\tt c} of outgoing carries must be 503shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated 504with the next digit. At the incoming position, the carry will 505increment the 64-bit digit. However, if this digit is all ones (as 506signaled by the corresponding bit of bubble mask {\tt b}, then the addition 507will generate another carry. In fact, if there is a sequence of 508digits that are all ones, then the carry must bubble through 509each of them. This is just MatchStar. 510 511\item Compute the final result {\tt Z}. 512$\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$ 513 514\end{enumerate} 515\begin{figure} 516\begin{center} 517\begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9} 518{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9} 519{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9} 520{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9} 521{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} 522{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} 523{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} 524{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} 525{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} 526{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} 527{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} 528{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9} 529\end{tabular} 530\end{center} 531\caption{Long Stream Addition}\label{fig:longadd} 532\end{figure} 533 534Figure \ref{fig:longadd} illustrates the process. In the figure, 535we illustrate the process with 8-bit fields rather than 64-bit fields 536and show all field values in hexadecimal notation. Note that 537two of the individual 8-bit additions produce carries, while two 538others produce {\tt FF} values that generate bubble bits. The 539net result is that four of the original 8-bit sums must be 540incremented to produce the long stream result. 541 542A slight extension to the process produces a long-stream full adder 543that can be used in chained addition. In this case, the 544adder must take an additional carry-in bit 545{\tt p} and produce a carry-out bit {\tt q}. 546This may be accomplished by incorporating {\tt p} 547in calculating the increment mask in the low bit position, 548and then extracting the carry-out {\tt q} from the high bit position. 549$\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}$ 550$\text{\tt q} = \text{\tt i >> f}$ 551 552As described subsequently, we use a two-level long-stream addition technique 553in both our AVX2 and GPGPU implementations. In principle, one can extend 554the technique to additional levels. Using 64-bit adders throughout, 555$\lceil\log_{64}{n}\rceil$steps are needed for$n\$-bit addition.
556A three-level scheme could coordinate
55764 groups each performing 4096-bit long additions in a two-level structure.
558However, whether there are reasonable architectures that can support fine-grained
559SIMT style at this level is an open question.
560
561Using the methods outlined, it is quite conceivable that instruction
562set extensions to support long-stream addition could be added for
563future SIMD and GPGPU processors.   Given the fundamental nature
564of addition as a primitive and its particular application to regular
565expression matching as shown herein, it seems reasonable to expect
566such instructions to become available.    Alternatively, it may
567be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and
568\verb#simd<64>::spread(X)# operations are efficiently supported.
569
570
571\input{sse2}
572
573\input{analysis}
574
575\input{avx2}
576
577
578
579\section{GPGPU Implementation}\label{sec:GPU}
580
581To further assess the scalability of our regular expression matching
582using bit-parallel data streams, we implemented a GPGPU version
583in OpenCL.
584We arranged for 64 work groups each having 64 threads.
585The size of work group and number of work groups is chosen
586to provide the best occupancy as calculated by the AMD App Profiler.
587Input files are divided in data parallel fashion among
588the 64 work groups.  Each work group carries out the regular
589expression matching operations 4096 bytes at a time using SIMT
590processing.   Although the GPGPU
591does not directly support the mask and spread operations required
592by our long-stream addition model,
593we are able to simulate them using shared memory.
594Each thread maintains
595its own carry and bubble values in shared memory and performs
596synchronized updates with the other threads using a six-step
597parallel-prefix style process.  Others have implemented
598long-stream addition on the GPGPU using similar techniques,
599as noted previously.
600
601We performed our test on an AMD Radeon HD A10-6800K APU machine.
602On the AMD Fusion systems, the input buffer is allocated in
603pinned memory to take advantage of the zero-copy memory regions
604where data can be read directly into this region by the CPU
605and also accessed by the GPGPU for further processing. Therefore,
606the expensive data transferring time that is needed by traditional
607discrete GPGPUs is hidden and we compare only the kernel execution
608time with our SSE2 and AVX implementations as shown in Figure
609\ref{fig:SSE-AVX-GPU}. The GPGPU version gives 30\% to 60\% performance
610improvement over SSE version and 10\% to 40\% performance
611improvement over AVX version. Although we intended to process
61264 work groups with 4096 bytes each at a time rather than 128 bytes
613at a time on SSE or 256 bytes at a time on AVX, the performance
614improvement is less than 60\%. The first reason is hardware
615limitations. Our kernel occupancy is limited by register usage
616and not all the work groups can be scheduled at the same time.
617The second reason is that the long-stream addition implemented
618on GPGPU is more expensive than the implementations on SSE or AVX.
619Another important reason is the control flow. When a possible
620match is found in one thread, the rest of the threads in the
621same work group have to execute the same instructions for
622further processing rather than jump to the next block with a
623simple IF test. Therefore, the performance of different
624regular expressions is dependent on the number of
625long-stream addition operations and the total number of matches
626of a given input.
627
628\begin{figure}
629\begin{center}
630\begin{tikzpicture}
631\begin{axis}[
632xtick=data,
633ylabel=Running Time (ms per megabyte),
634xticklabels={@,Date,Email,URIorEmail,HexBytes},
635tick label style={font=\tiny},
636enlarge x limits=0.15,
637%enlarge y limits={0.15, upper},
638ymin=0,
639legend style={at={(0.5,-0.15)},
640anchor=north,legend columns=-1},
641ybar,
642bar width=7pt,
643]
644\addplot
645file {data/ssetime.dat};
646\addplot
647file {data/avxtime.dat};
648\addplot
649file {data/gputime.dat};
650
651\legend{SSE2,AVX2,GPGPU,Annot}
652\end{axis}
653\end{tikzpicture}
654\end{center}
655\caption{Running Time}\label{fig:SSE-AVX-GPU}
656\end{figure}
657
658
659
660
661
662
663
664
665\input{conclusion}
666
667
668
669%\appendix
670%\section{Appendix Title}
671
672%This is the text of the appendix, if you need one.
673
674
675This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
676MITACS, Inc.
677
678\bibliographystyle{IEEEtranS}
679\bibliography{reference}
680
681\end{document}
682
683
Note: See TracBrowser for help on using the repository browser.