source: docs/Working/re/ppopp-re.tex @ 3471

Last change on this file since 3471 was 3471, checked in by ksherdy, 6 years ago

Added section on hardware. References.

File size: 14.3 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass{sigplanconf}
24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
33
34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
41\conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} 
42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
56\titlebanner{banner above paper title}        % These are ignored unless
57\preprintfooter{short description of paper}   % 'preprint' option specified.
58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
60\subtitle{Subtitle Text, if any}
61
62\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
63           {Simon Fraser University}
64           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
65
66\maketitle
67
68\begin{abstract}
69\input{abstract}
70\end{abstract}
71\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
72\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
73
74% general terms are not compulsory anymore,
75% you may leave them out
76%\terms
77%term1, term2
78
79\keywords
80regular expression matching, grep, parallel bit stream technology
81
82\section{Introduction}
83
84The use of regular expressions to search texts for occurrences
85of string patterns has a long history and
86remains a pervasive technique throughout computing applications today.
87% {\em a brief history}
88The origins of regular expression matching date back to automata theory
89developed by Kleene in the 1950s \cite{kleene1951}.
90Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
91to nondeterministic finite automata (NFA).
92Following Thompson's approach, a regular expression of length m is first converted
93to an NFA with O(m) nodes. It is then possible to search a text of length n using the
94NFA in worst case O(mn) time. Often, a more efficient choice
95is to convert the NFA into a DFA. A DFA has only a single active state at any time
96in the matching process and
97hence it is possible to search a text at of length n in worst-case O(n) optimal.
98However, it is well known that the conversion of an NFA to an equivalent DFA may result
99in state explosion. That is, the number of resultant DFA states may increase exponentially.
100In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}.
101This technique takes advantage of the intrinsic parallelism of bitwise operations
102within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the
103bit-parallel approach to
104simulate an NFA in O($nm/w$) worst-case time.
105
106A disadvantage of the bit-parallel Shift-Or pattern matching approach
107in comparison to simple string matching algorithms is an inability to skip input characters.
108For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters
109to achieve sublinear times in the average case. Backward Dawg Matching
110(BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters.
111The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
112combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm.
113The nrgrep pattern matching tool is built over the BNDM algorithm,
114and hence the name nrgrep \cite{navarro2000}.
115
116{\em a brief review} 
117There has been considerable interest in using parallelization techniques
118to improve the performance of regular expression matching on parallel hardware
119such as multi-core processors (CPUs), graphics processing units (GPUs),
120field-programmable gate arrays (FPGAs), and even more exotic architectures such as
121the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
122%CPU
123Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
124text processing algorithms that exhibit irregular memory access patterns
125can be efficiently executed on multicore hardware.
126In related work, Pasetto et al. presented a flexible tool that
127performs small-ruleset regular expression matching at a rate of
1282.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
129Naghmouchi et al. demonstrated that the Aho-Corasick (AC)
130string matching algorithm \cite{aho1975} is well suited for parallel
131implementation on multi-core CPUs, GPUs and the Cell BE \cite{scarpazza2011top, naghmouchi2010}.
132On each hardware, both thread-level parallelism (additional cores) and data-level parallelism
133(wide SIMD units) are leveraged for performance.
134Salapura et. al., advocated the use of vector-style processing for regular expressions
135in business analytics applications and leveraged the SIMD hardware available
136on multi-core processors to acheive a speedup of better than 1.8 over a
137range of data sizes of interest \cite{salapura2012accelerating}.
138%Cell
139In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
140that delivered 1.00–1.78 Gbps on a single
141Cell BE chip and extended this approach for emulation on the Intel Larrabee
142instruction set \cite{scarpazza2009larrabee}.
143On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
144implementation that delivered a throughput of 40
145Gbps for a small dictionary of approximately 100 patterns, and a throughput of 3.3-3.4
146Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} 
147presented a string matching implementation for automata that achieves
1484 Gbps on the Cell BE.
149% GPU
150In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
151implementation of the AC algorithm for
152accelerating string matching on GPUs. Lin et al., proposed
153the Parallel Failureless Aho-Corasick (PFAC)
154algorithm to accelerate pattern matching on GPU hardware and
155achieved 143 Gbps throughput, 14.74 times faster
156than the AC algorithm performed on a four core
157multi-core processor using OpenMP \cite{lin2013accelerating}.
158
159Whereas the existing approaches to parallelization have been
160based on adapting traditional sequential algorithms to emergent
161parallel architectures, we introduce both a new algorithmic
162approach and its implementation on SIMD and GPU architectures.
163This approach relies on a bitwise data parallel view of text
164streams as well as a surprising use of addition to match
165runs of characters in a single step.  The closest previous
166work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
167technology together with a parallel scanning primitive also
168based on addition \cite{cameron2011parallel}.   
169However, in contrast to the deterministic, longest-match
170scanning associated with the ScanThru primitive of that
171work, we introduce here a new primitive MatchStar
172that can be used in full generality for nondeterministic
173regular expression matching.   We also introduce a long-stream
174addition technique involving a further application of MatchStar
175that enables us to scale the technique to $n$-bit addition
176in $log_{64} n$ steps.   We ultimately apply this technique,
177for example, to perform
178synchronized 4096-bit addition on GPU warps of 64 threads.
179
180There is also a strong keyword match between the bit-parallel
181data streams used in our approach and the bit-parallelism
182used for NFA state transitions in the classical algorithms of
183Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
184and Navarro and Raffinot \cite{navarro1998bit}.
185However those algorithms use bit-parallelism in a fundamentally
186different way: representing all possible current NFA states
187as a bit vector and performing parallel transitions to a new
188set of states using table lookups and bitwise logic.    Whereas
189our approach can match multiple characters per step, bit-parallel
190NFA algorithms proceed through the input one byte at a time.
191Nevertheless, the agrep \cite{wu1992agrep} and
192nrgrep \cite{navarro2000} programs implemented using these techniques remain
193among the strongest competitors in regular expression matching
194performance, so we include them in our comparative evaluation.
195
196
197The remainder of this paper is organized as follows.
198Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
199using a model of arbitrary-length bit-parallel data streams.
200Section \ref{sec:blockwise} discusses the block-by-block
201implementation of our techniques including the long stream
202addition techniques for 256-bit addition with AVX2 and
2034096-bit additions with GPGPU SIMT.
204Section \ref{sec:analysis} 
205Section \ref{sec:SSE2} 
206Section \ref{sec:AVX2} 
207Section \ref{sec:GPU} 
208Section \ref{sec:Concl} concludes the paper with a discussion of areas for future work.
209
210
211\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
212
213Whereas the traditional approaches to regular expression matching
214using NFAs, DFAs or backtracking all rely on a byte-at-a-time
215processing model, the approach  we introduce in this paper is based
216on quite a different concept:  a data-parallel approach to simultaneous
217processing of data stream elements.  Indeed, our most abstract model
218is that of unbounded data parallelism: processing all elements of
219the input data stream simultaneously.   In essence, we view
220data streams as (very large) integers.   The fundamental operations
221we apply are based on bitwise logic and long-stream addition.
222
223Depending on the available parallel processing resources, an actual
224implementation may divide an input stream into blocks  and process
225the blocks sequentially.   Within each block  all elements of the
226input stream are processed together, relying the availability of
227bitwise logic and addition scaled to the block size.   On commodity
228Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
229we typically process input streams 128 bytes at a time.   In this
230case, we rely on the Parabix tool chain \cite{lin2012parabix}
231to handle the details of compilation to block-by-block processing.
232As weh show later, however, we have also adapted Parabix technology to processing
233blocks of 4K bytes at time in our GPGPU implementation, relying
234on our implementation of 4096-bit addition implemented using
235SIMT processing of 64 threads.
236
237A key concept in this streaming approach is the derivation of bit streams
238that are parallel to the input data stream, i.e., in one-to-one
239correspondence with the data element positions of the input
240streams.   Typically, the input stream is a byte stream comprising
241the 8-bit character code units of a particular encoding such
242as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
243easily be used with wider code units such as the 16-bit code units of
244UTF-16.   In the case of a byte stream, the first step is to transpose
245the byte stream into eight parallel bit streams, such that bit stream
246$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
247a set of basis bit streams from which many other parallel bit
248streams can be calculated, such as character class bit
249streams such that each bit $j$ of the stream specifies
250whether character $j$ of the input stream is in the class
251or not.   {\bf perhaps a figure here.}
252
253
254\section{Block-at-a-Time Processing}\label{sec:blockwise}
255
256
257
258
259\section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis}
260
261\begin{enumerate}
262\item Operations
263\item Memory behaviour per input byte: note tables of DFA/NFA.
264
265Bille and Throup \em{Faster regular expression matching}\cite{bille2009faster}
266
267\end{enumerate}
268
269
270
271\section{Commodity SIMD Implementation and Experimental Evaluation}\label{sec:SSE2}
272\subsection{Implementation Notes}
273\subsection{Evaluation Methodology}
274\subsection{Comparison}
275
276
277
278\section{SIMD Scalability}\label{sec:AVX2}
279\subsection{AVX Stream Addition}
280  We use MatchStar for carry propagation.
281
282\section{GPU Implementation}\label{sec:GPU}
283
284\section{Miscellaneous}
285\subsection{Skipping}
286\subsection{Unicode}
287
288\section{Conclusion}\label{sec:Concl}
289\subsection{Contributions}
290\begin{enumerate}
291\item New Algorithm Class for Regular Expression Matching
292\item MatchStar for Character Class Repetition
293\item Long Stream Addition
294\item Implementations showing performance and scalability
295\end{enumerate}
296\subsection{Future Work}
297\begin{enumerate}
298\item Substring capture
299\item Unicode character classes
300\item Nonregular regexp features: zero-width assertions, backreferences
301\item Multicore for ruleset parallelism
302\end{enumerate}
303
304
305
306%\appendix
307%\section{Appendix Title}
308
309%This is the text of the appendix, if you need one.
310
311\acks
312
313This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
314MITACS, Inc.
315
316% We recommend abbrvnat bibliography style.
317
318\bibliographystyle{abbrvnat}
319
320% The bibliography should be embedded for final submission.
321 
322\bibliography{reference}
323
324%\begin{thebibliography}{}
325%\softraggedright
326
327%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
328%P. Q. Smith, and X. Y. Jones. ...reference text...
329%
330%\end{thebibliography}
331
332
333\end{document}
334
335
Note: See TracBrowser for help on using the repository browser.