source: docs/Working/re/ppopp-re.tex @ 3468

Last change on this file since 3468 was 3468, checked in by ksherdy, 6 years ago

Added a brief history section. Partially added a review of state-of-the-art section.

File size: 13.1 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass{sigplanconf}
24
25% The following \documentclass options may be useful:
26
27% preprint      Remove this option only once the paper is in final form.
28% 10pt          To set in 10-point type instead of 9-point.
29% 11pt          To set in 11-point type instead of 9-point.
30% authoryear    To obtain author/year citation style instead of numeric.
31
32\usepackage{amsmath}
33
34
35\begin{document}
36
37\special{papersize=8.5in,11in}
38\setlength{\pdfpageheight}{\paperheight}
39\setlength{\pdfpagewidth}{\paperwidth}
40
41\conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} 
42\copyrightyear{2013} 
43\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
44\doi{nnnnnnn.nnnnnnn}
45
46% Uncomment one of the following two, if you are not going for the
47% traditional copyright transfer agreement.
48
49%\exclusivelicense                % ACM gets exclusive license to publish,
50                                  % you retain copyright
51
52%\permissiontopublish             % ACM gets nonexclusive license to publish
53                                  % (paid open-access papers,
54                                  % short abstracts)
55
56\titlebanner{banner above paper title}        % These are ignored unless
57\preprintfooter{short description of paper}   % 'preprint' option specified.
58
59\title{Bitwise Data Parallelism in Regular Expression Matching}
60\subtitle{Subtitle Text, if any}
61
62\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman}
63           {Simon Fraser University}
64           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca}
65
66\maketitle
67
68\begin{abstract}
69\input{abstract}
70\end{abstract}
71\category{Theory of computation}{Formal languages and automata theory}{Regular languages}
72\category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data}
73
74% general terms are not compulsory anymore,
75% you may leave them out
76%\terms
77%term1, term2
78
79\keywords
80regular expression matching, grep, parallel bit stream technology
81
82\section{Introduction}
83
84The use of regular expressions to search texts for occurrences
85of string patterns has a long history and
86remains a pervasive technique throughout computing applications today.
87% {\em a brief history}
88The origins of regular expression matching date back to automata theory
89developed by Kleene in the 1950s \cite{kleene1951}.
90Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
91to nondeterministic finite automata (NFA).
92Following Thompson's approach, a regular expression of length m is first converted
93to an NFA with O(m) nodes. It is then possible to search a text of length n using the
94NFA in worst case O(mn) time. Often, a more efficient choice
95is to convert the NFA into a DFA. A DFA has only a single active state at any time
96in the matching process and
97hence it is possible to search a text at of length n in worst-case O(n) optimal.
98However, it is well known that the conversion of an NFA to an equivalent DFA may result
99in state explosion. That is, the number of resultant DFA states may increase exponentially.
100In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}.
101This technique takes advantage of the intrinsic parallelism of bitwise operations
102within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the
103bit-parallel approach to
104simulate an NFA in O($nm/w$) worst-case time.
105
106A disadvantage of the bit-parallel Shift-Or pattern matching approach
107in comparison to simple string matching algorithms is an inability to skip input characters.
108For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters
109to achieve sublinear times in the average case. Backward Dawg Matching
110(BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters.
111The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 
112combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm.
113The nrgrep pattern matching tool is built over the BNDM algorithm,
114and hence the name nrgrep \cite{navarro2000}.
115
116There has been considerable interest in using parallelization techniques
117to improve the performance of regular expression matching.
118{\em a brief review} 
119
120Scarpazza and Braudaway demonstrate that irregular,
121control-flow dominated text processing algorithms can be efficiently
122executed on multicore hardware \cite{scarpazza2008fast}.
123In related work, Pasetto et al. present a flexible tool that
124performs small-ruleset regular expression matching at a rate of
1252.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
126Naghmouchi et al. demonstrate that the Aho-Corasick
127string matching algorithm is well suited for implementation on commodity multicore, the
128Cell Broadband Engine and GPGPU hardware \cite{scarpazza2011top, naghmouchi2010}. In each case, both thread-
129level (additional cores) and data-level (SIMD units) hardware features are leveraged for performance.
130On the Cell Broadband Engine, Scarpazza’s implementation delivers a throughput of 40
131Gbps for a small dictionary size of approximately 100 patterns, and a throughput of 3.3-3.4
132Gbps for a larger dictionary of tens of thousands of patterns.
133Scarpazza and Russell present a SIMD tokenizer that delivers 1.00–1.78 Gbps on a single
134Cell Broadband Engine chip \cite{scarpazza2008} and extend this approach for emulation on the Intel Larrabee
135instruction set \cite{scarpazza2009larrabee}. Iorio and van Lunteren describe a string matching implementation for
136automata that achieves 4 Gbps on the Cell Broadband Engine \cite{iorio2008}.
137In more recent work,
138
139TODO
140
141SIMD\cite{salapura2012accelerating}
142
143and
144% GPUs\cite{lin2013accelerating}
145TODO
146
147Whereas the existing approaches to parallelization have been
148based on adapting traditional sequential algorithms to emergent
149parallel architectures, we introduce both a new algorithmic
150approach and its implementation on SIMD and GPU architectures.
151This approach relies on a bitwise data parallel view of text
152streams as well as a surprising use of addition to match
153runs of characters in a single step.  The closest previous
154work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
155technology together with a parallel scanning primitive also
156based on addition \cite{cameron2011parallel}.   
157However, in contrast to the deterministic, longest-match
158scanning associated with the ScanThru primitive of that
159work, we introduce here a new primitive MatchStar
160that can be used in full generality for nondeterministic
161regular expression matching.   We also introduce a long-stream
162addition technique involving a further application of MatchStar
163that enables us to scale the technique to $n$-bit addition
164in $log_{64} n$ steps.   We ultimately apply this technique,
165for example, to perform
166synchronized 4096-bit addition on GPU warps of 64 threads.
167
168There is also a strong keyword match between the bit-parallel
169data streams used in our approach and the bit-parallelism
170used for NFA state transitions in the classical algorithms of
171Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
172and Navarro and Raffinot \cite{navarro1998bit}.
173However those algorithms use bit-parallelism in a fundamentally
174different way: representing all possible current NFA states
175as a bit vector and performing parallel transitions to a new
176set of states using table lookups and bitwise logic.    Whereas
177our approach can match multiple characters per step, bit-parallel
178NFA algorithms proceed through the input one byte at a time.
179Nevertheless, the agrep \cite{wu1992agrep} and
180nrgrep \cite{navarro2000} programs implemented using these techniques remain
181among the strongest competitors in regular expression matching
182performance, so we include them in our comparative evaluation.
183
184
185The remainder of this paper is organized as follows.
186Section \ref{sec:unbounded} presents our basic algorithm using
187a model of unbounded bitwise data parallelism. 
188Section \ref{sec:regexp} 
189Section \ref{sec:analysis} 
190Section \ref{sec:SSE2} 
191Section \ref{sec:AVX2} 
192Section \ref{sec:GPU} 
193Section \ref{sec:Concl} concludes the paper with a discussion of areas for future work.
194
195
196\section{An Unbounded Model for Bitwise Regular Expression Matching}\label{sec:unbounded}
197
198Whereas all traditional regular expression matching engines are
199fundamentally based on an element-at-a-time processing model,
200the approach we introduce in this paper is based on quite a
201different conceptual model: processing all elements of an input data
202stream simultaneously.    Depending on the available parallel
203processing resources, an actual implementation may divide an
204input stream into segments and process the segments
205sequentially.   Within each segment, all elements of the input
206stream are processed in parallel.   The approach is inherently
207scalable and can be adapted to a variety of parallel architectures.
208The fundamental primitives required are bitwise logic and
209long-stream addition. 
210
211A key concept in this approach is the derivation of bit streams
212that are parallel to the input data stream, i.e., in one-to-one
213correspondence with the data element positions of the input
214streams.   Typically, the input stream is a byte stream comprising
215the 8-bit character code units of a particular encoding such
216as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
217easily be used with wider code units such as the 16-bit code units of
218UTF-16.   In the case of a byte stream, the first step is to transpose
219the byte stream into eight parallel bit streams, such that bit stream
220$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
221a set of basis bit streams from which many other parallel bit
222streams can be calculated, such as character class bit
223streams such that each bit $j$ of the stream specifies
224whether character $j$ of the input stream is in the class
225or not.   {\bf perhaps a figure here.}
226
227Using the SIMD capabilities of commodity processors,
228the basis bit streams and character class bit streams may be
229efficiently computed using the Parabix framework that
230has been developed and used extensively for high-performance
231XML parsing \cite{}.  For example, working with the 128-bit
232SIMD registers on commodity Intel and AMD processors,
233the Parabix framework can be employed to process input
234streams in blocks of 128 bytes at a time.   However, the
235general model is not dependent on any particular data block size
236and can be scaled for arbitrary sized blocks depending on
237available resources.   
238
239\section{Bitwise Regular Expression Matching with MatchStar}\label{sec:regexp}
240
241
242\subsection{Bitwise Data Parallelism vs. Bit-Parallel State Transition}
243The bitap algorithms use bitwise operations in an orthogonal way.
244
245\section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis}
246
247\begin{enumerate}
248\item Operations
249\item Memory behaviour per input byte: note tables of DFA/NFA.
250
251Bille and Throup \em{Faster regular expression matching}\cite{bille2009faster}
252
253\end{enumerate}
254
255
256
257\section{Commodity SIMD Implementation and Experimental Evaluation}\label{sec:SSE2}
258\subsection{Implementation Notes}
259\subsection{Evaluation Methodology}
260\subsection{Comparison}
261
262
263
264\section{SIMD Scalability}\label{sec:AVX2}
265\subsection{AVX Stream Addition}
266  We use MatchStar for carry propagation.
267
268\section{GPU Implementation}\label{sec:GPU}
269
270\section{Miscellaneous}
271\subsection{Skipping}
272\subsection{Unicode}
273
274\section{Conclusion}\label{sec:Concl}
275\subsection{Contributions}
276\begin{enumerate}
277\item New Algorithm Class for Regular Expression Matching
278\item MatchStar for Character Class Repetition
279\item Long Stream Addition
280\item Implementations showing performance and scalability
281\end{enumerate}
282\subsection{Future Work}
283\begin{enumerate}
284\item Substring capture
285\item Unicode character classes
286\item Nonregular regexp features: zero-width assertions, backreferences
287\item Multicore for ruleset parallelism
288\end{enumerate}
289
290
291
292%\appendix
293%\section{Appendix Title}
294
295%This is the text of the appendix, if you need one.
296
297\acks
298
299This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
300MITACS, Inc.
301
302% We recommend abbrvnat bibliography style.
303
304\bibliographystyle{abbrvnat}
305
306% The bibliography should be embedded for final submission.
307 
308\bibliography{reference}
309
310%\begin{thebibliography}{}
311%\softraggedright
312
313%\bibitem[Smith et~al.(2009)Smith, Jones]{smith02}
314%P. Q. Smith, and X. Y. Jones. ...reference text...
315%
316%\end{thebibliography}
317
318
319\end{document}
320
321
Note: See TracBrowser for help on using the repository browser.