source: docs/Working/re/ppopp-re.tex @ 3469

Last change on this file since 3469 was 3469, checked in by cameron, 6 years ago

Updates

File size: 13.3 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass{sigplanconf}
24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
33
34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
41\conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} 
42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
56\titlebanner{banner above paper title}        % These are ignored unless
57\preprintfooter{short description of paper}   % 'preprint' option specified.
58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
60\subtitle{Subtitle Text, if any}
61
62\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
63           {Simon Fraser University}
64           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
65
66\maketitle
67
68\begin{abstract}
69\input{abstract}
70\end{abstract}
71\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
72\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
73
74% general terms are not compulsory anymore,
75% you may leave them out
76%\terms
77%term1, term2
78
79\keywords
80regular expression matching, grep, parallel bit stream technology
81
82\section{Introduction}
83
84The use of regular expressions to search texts for occurrences
85of string patterns has a long history and
86remains a pervasive technique throughout computing applications today.
87% {\em a brief history}
88The origins of regular expression matching date back to automata theory
89developed by Kleene in the 1950s \cite{kleene1951}.
90Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
91to nondeterministic finite automata (NFA).
92Following Thompson's approach, a regular expression of length m is first converted
93to an NFA with O(m) nodes. It is then possible to search a text of length n using the
94NFA in worst case O(mn) time. Often, a more efficient choice
95is to convert the NFA into a DFA. A DFA has only a single active state at any time
96in the matching process and
97hence it is possible to search a text at of length n in worst-case O(n) optimal.
98However, it is well known that the conversion of an NFA to an equivalent DFA may result
99in state explosion. That is, the number of resultant DFA states may increase exponentially.
100In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}.
101This technique takes advantage of the intrinsic parallelism of bitwise operations
102within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the
103bit-parallel approach to
104simulate an NFA in O($nm/w$) worst-case time.
105
106A disadvantage of the bit-parallel Shift-Or pattern matching approach
107in comparison to simple string matching algorithms is an inability to skip input characters.
108For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters
109to achieve sublinear times in the average case. Backward Dawg Matching
110(BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters.
111The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
112combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm.
113The nrgrep pattern matching tool is built over the BNDM algorithm,
114and hence the name nrgrep \cite{navarro2000}.
115
116There has been considerable interest in using parallelization techniques
117to improve the performance of regular expression matching.
118{\em a brief review} 
119
120Scarpazza and Braudaway demonstrate that irregular,
121control-flow dominated text processing algorithms can be efficiently
122executed on multicore hardware \cite{scarpazza2008fast}.
123In related work, Pasetto et al. present a flexible tool that
124performs small-ruleset regular expression matching at a rate of
1252.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
126Naghmouchi et al. demonstrate that the Aho-Corasick
127string matching algorithm is well suited for implementation on commodity multicore, the
128Cell Broadband Engine and GPGPU hardware \cite{scarpazza2011top, naghmouchi2010}. In each case, both thread-
129level (additional cores) and data-level (SIMD units) hardware features are leveraged for performance.
130On the Cell Broadband Engine, Scarpazza’s implementation delivers a throughput of 40
131Gbps for a small dictionary size of approximately 100 patterns, and a throughput of 3.3-3.4
132Gbps for a larger dictionary of tens of thousands of patterns.
133Scarpazza and Russell present a SIMD tokenizer that delivers 1.00–1.78 Gbps on a single
134Cell Broadband Engine chip \cite{scarpazza2008} and extend this approach for emulation on the Intel Larrabee
135instruction set \cite{scarpazza2009larrabee}. Iorio and van Lunteren describe a string matching implementation for
136automata that achieves 4 Gbps on the Cell Broadband Engine \cite{iorio2008}.
137In more recent work,
138
139TODO
140
141SIMD\cite{salapura2012accelerating}
142
143and
144% GPUs\cite{lin2013accelerating}
145TODO
146
147Whereas the existing approaches to parallelization have been
148based on adapting traditional sequential algorithms to emergent
149parallel architectures, we introduce both a new algorithmic
150approach and its implementation on SIMD and GPU architectures.
151This approach relies on a bitwise data parallel view of text
152streams as well as a surprising use of addition to match
153runs of characters in a single step.  The closest previous
154work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
155technology together with a parallel scanning primitive also
156based on addition \cite{cameron2011parallel}.   
157However, in contrast to the deterministic, longest-match
158scanning associated with the ScanThru primitive of that
159work, we introduce here a new primitive MatchStar
160that can be used in full generality for nondeterministic
161regular expression matching.   We also introduce a long-stream
162addition technique involving a further application of MatchStar
163that enables us to scale the technique to $n$-bit addition
164in $log_{64} n$ steps.   We ultimately apply this technique,
165for example, to perform
166synchronized 4096-bit addition on GPU warps of 64 threads.
167
168There is also a strong keyword match between the bit-parallel
169data streams used in our approach and the bit-parallelism
170used for NFA state transitions in the classical algorithms of
171Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
172and Navarro and Raffinot \cite{navarro1998bit}.
173However those algorithms use bit-parallelism in a fundamentally
174different way: representing all possible current NFA states
175as a bit vector and performing parallel transitions to a new
176set of states using table lookups and bitwise logic.    Whereas
177our approach can match multiple characters per step, bit-parallel
178NFA algorithms proceed through the input one byte at a time.
179Nevertheless, the agrep \cite{wu1992agrep} and
180nrgrep \cite{navarro2000} programs implemented using these techniques remain
181among the strongest competitors in regular expression matching
182performance, so we include them in our comparative evaluation.
183
184
185The remainder of this paper is organized as follows.
186Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
187using a model of arbitrary-length bit-parallel data streams.
188Section \ref{sec:blockwise} discusses the block-by-block
189implementation of our techniques including the long stream
190addition techniques for 256-bit addition with AVX2 and
1914096-bit additions with GPGPU SIMT.
192Section \ref{sec:analysis} 
193Section \ref{sec:SSE2} 
194Section \ref{sec:AVX2} 
195Section \ref{sec:GPU} 
196Section \ref{sec:Concl} concludes the paper with a discussion of areas for future work.
197
198
199\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
200
201Whereas the traditional approaches to regular expression matching
202using NFAs, DFAs or backtracking all rely on a byte-at-a-time
203processing model, the approach  we introduce in this paper is based
204on quite a different concept:  a data-parallel approach to simultaneous
205processing of data stream elements.  Indeed, our most abstract model
206is that of unbounded data parallelism: processing all elements of
207the input data stream simultaneously.   In essence, we view
208data streams as (very large) integers.   The fundamental operations
209we apply are based on bitwise logic and long-stream addition.
210
211Depending on the available parallel processing resources, an actual
212implementation may divide an input stream into blocks  and process
213the blocks sequentially.   Within each block  all elements of the
214input stream are processed together, relying the availability of
215bitwise logic and addition scaled to the block size.   On commodity
216Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
217we typically process input streams 128 bytes at a time.   In this
218case, we rely on the Parabix tool chain \cite{lin2012parabix}
219to handle the details of compilation to block-by-block processing.
220As weh show later, however, we have also adapted Parabix technology to processing
221blocks of 4K bytes at time in our GPGPU implementation, relying
222on our implementation of 4096-bit addition implemented using
223SIMT processing of 64 threads.
224
225A key concept in this streaming approach is the derivation of bit streams
226that are parallel to the input data stream, i.e., in one-to-one
227correspondence with the data element positions of the input
228streams.   Typically, the input stream is a byte stream comprising
229the 8-bit character code units of a particular encoding such
230as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
231easily be used with wider code units such as the 16-bit code units of
232UTF-16.   In the case of a byte stream, the first step is to transpose
233the byte stream into eight parallel bit streams, such that bit stream
234$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
235a set of basis bit streams from which many other parallel bit
236streams can be calculated, such as character class bit
237streams such that each bit $j$ of the stream specifies
238whether character $j$ of the input stream is in the class
239or not.   {\bf perhaps a figure here.}
240
241
242\section{Block-at-a-Time Processing}\label{sec:blockwise}
243
244
245
246
247\section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis}
248
249\begin{enumerate}
250\item Operations
251\item Memory behaviour per input byte: note tables of DFA/NFA.
252
253Bille and Throup \em{Faster regular expression matching}\cite{bille2009faster}
254
255\end{enumerate}
256
257
258
259\section{Commodity SIMD Implementation and Experimental Evaluation}\label{sec:SSE2}
260\subsection{Implementation Notes}
261\subsection{Evaluation Methodology}
262\subsection{Comparison}
263
264
265
266\section{SIMD Scalability}\label{sec:AVX2}
267\subsection{AVX Stream Addition}
268  We use MatchStar for carry propagation.
269
270\section{GPU Implementation}\label{sec:GPU}
271
272\section{Miscellaneous}
273\subsection{Skipping}
274\subsection{Unicode}
275
276\section{Conclusion}\label{sec:Concl}
277\subsection{Contributions}
278\begin{enumerate}
279\item New Algorithm Class for Regular Expression Matching
280\item MatchStar for Character Class Repetition
281\item Long Stream Addition
282\item Implementations showing performance and scalability
283\end{enumerate}
284\subsection{Future Work}
285\begin{enumerate}
286\item Substring capture
287\item Unicode character classes
288\item Nonregular regexp features: zero-width assertions, backreferences
289\item Multicore for ruleset parallelism
290\end{enumerate}
291
292
293
294%\appendix
295%\section{Appendix Title}
296
297%This is the text of the appendix, if you need one.
298
299\acks
300
301This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
302MITACS, Inc.
303
304% We recommend abbrvnat bibliography style.
305
306\bibliographystyle{abbrvnat}
307
308% The bibliography should be embedded for final submission.
309 
310\bibliography{reference}
311
312%\begin{thebibliography}{}
313%\softraggedright
314
315%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
316%P. Q. Smith, and X. Y. Jones. ...reference text...
317%
318%\end{thebibliography}
319
320
321\end{document}
322
323
Note: See TracBrowser for help on using the repository browser.