1 | %----------------------------------------------------------------------------- |
---|
2 | % |
---|
3 | % Template for sigplanconf LaTeX Class |
---|
4 | % |
---|
5 | % Name: sigplanconf-template.tex |
---|
6 | % |
---|
7 | % Purpose: A template for sigplanconf.cls, which is a LaTeX 2e class |
---|
8 | % file for SIGPLAN conference proceedings. |
---|
9 | % |
---|
10 | % Guide: Refer to "Author's Guide to the ACM SIGPLAN Class," |
---|
11 | % sigplanconf-guide.pdf |
---|
12 | % |
---|
13 | % Author: Paul C. Anagnostopoulos |
---|
14 | % Windfall Software |
---|
15 | % 978 371-2316 |
---|
16 | % paul@windfall.com |
---|
17 | % |
---|
18 | % Created: 15 February 2005 |
---|
19 | % |
---|
20 | %----------------------------------------------------------------------------- |
---|
21 | |
---|
22 | |
---|
23 | \documentclass{sigplanconf} |
---|
24 | |
---|
25 | % The following \documentclass options may be useful: |
---|
26 | |
---|
27 | % preprint Remove this option only once the paper is in final form. |
---|
28 | % 10pt To set in 10-point type instead of 9-point. |
---|
29 | % 11pt To set in 11-point type instead of 9-point. |
---|
30 | % authoryear To obtain author/year citation style instead of numeric. |
---|
31 | |
---|
32 | \usepackage{amsmath} |
---|
33 | |
---|
34 | |
---|
35 | \begin{document} |
---|
36 | |
---|
37 | \special{papersize=8.5in,11in} |
---|
38 | \setlength{\pdfpageheight}{\paperheight} |
---|
39 | \setlength{\pdfpagewidth}{\paperwidth} |
---|
40 | |
---|
41 | \conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} |
---|
42 | \copyrightyear{2013} |
---|
43 | \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} |
---|
44 | \doi{nnnnnnn.nnnnnnn} |
---|
45 | |
---|
46 | % Uncomment one of the following two, if you are not going for the |
---|
47 | % traditional copyright transfer agreement. |
---|
48 | |
---|
49 | %\exclusivelicense % ACM gets exclusive license to publish, |
---|
50 | % you retain copyright |
---|
51 | |
---|
52 | %\permissiontopublish % ACM gets nonexclusive license to publish |
---|
53 | % (paid open-access papers, |
---|
54 | % short abstracts) |
---|
55 | |
---|
56 | \titlebanner{banner above paper title} % These are ignored unless |
---|
57 | \preprintfooter{short description of paper} % 'preprint' option specified. |
---|
58 | |
---|
59 | \title{Bitwise Data Parallelism in Regular Expression Matching} |
---|
60 | \subtitle{Subtitle Text, if any} |
---|
61 | |
---|
62 | \authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman} |
---|
63 | {Simon Fraser University} |
---|
64 | {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca} |
---|
65 | |
---|
66 | \maketitle |
---|
67 | |
---|
68 | \begin{abstract} |
---|
69 | \input{abstract} |
---|
70 | \end{abstract} |
---|
71 | \category{Theory of computation}{Formal languages and automata theory}{Regular languages} |
---|
72 | \category{Computer systems organization}{Parallel architectures}{Single instruction, multiple data} |
---|
73 | |
---|
74 | % general terms are not compulsory anymore, |
---|
75 | % you may leave them out |
---|
76 | %\terms |
---|
77 | %term1, term2 |
---|
78 | |
---|
79 | \keywords |
---|
80 | regular expression matching, grep, parallel bit stream technology |
---|
81 | |
---|
82 | \section{Introduction} |
---|
83 | |
---|
84 | The use of regular expressions to search texts for occurrences |
---|
85 | of string patterns has a long history and |
---|
86 | remains a pervasive technique throughout computing applications today. |
---|
87 | % {\em a brief history} |
---|
88 | The origins of regular expression matching date back to automata theory |
---|
89 | developed by Kleene in the 1950s \cite{kleene1951}. |
---|
90 | Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions |
---|
91 | to nondeterministic finite automata (NFA). |
---|
92 | Following Thompson's approach, a regular expression of length m is first converted |
---|
93 | to an NFA with O(m) nodes. It is then possible to search a text of length n using the |
---|
94 | NFA in worst case O(mn) time. Often, a more efficient choice |
---|
95 | is to convert the NFA into a DFA. A DFA has only a single active state at any time |
---|
96 | in the matching process and |
---|
97 | hence it is possible to search a text at of length n in worst-case O(n) optimal. |
---|
98 | However, it is well known that the conversion of an NFA to an equivalent DFA may result |
---|
99 | in state explosion. That is, the number of resultant DFA states may increase exponentially. |
---|
100 | In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}. |
---|
101 | This technique takes advantage of the intrinsic parallelism of bitwise operations |
---|
102 | within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the |
---|
103 | bit-parallel approach to |
---|
104 | simulate an NFA in O($nm/w$) worst-case time. |
---|
105 | |
---|
106 | A disadvantage of the bit-parallel Shift-Or pattern matching approach |
---|
107 | in comparison to simple string matching algorithms is an inability to skip input characters. |
---|
108 | For example, the Boyer-Moore family of algorithms \cite{boyer1977fast} skip input characters |
---|
109 | to achieve sublinear times in the average case. Backward Dawg Matching |
---|
110 | (BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters. |
---|
111 | The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} |
---|
112 | combines the bit-parallel advantages of Shift-Or and with the character skipping advantages of the BDM algorithm. |
---|
113 | The nrgrep pattern matching tool is built over the BNDM algorithm, |
---|
114 | and hence the name nrgrep \cite{navarro2000}. |
---|
115 | |
---|
116 | {\em a brief review} |
---|
117 | There has been considerable interest in using parallelization techniques |
---|
118 | to improve the performance of regular expression matching on parallel hardware |
---|
119 | such as multi-core processors (CPUs), graphics processing units (GPUs), |
---|
120 | field-programmable gate arrays (FPGAs), and even more exotic architectures such as |
---|
121 | the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) |
---|
122 | %CPU |
---|
123 | Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that |
---|
124 | text processing algorithms that exhibit irregular memory access patterns |
---|
125 | can be efficiently executed on multicore hardware. |
---|
126 | In related work, Pasetto et al. presented a flexible tool that |
---|
127 | performs small-ruleset regular expression matching at a rate of |
---|
128 | 2.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}. |
---|
129 | Naghmouchi et al. demonstrated that the Aho-Corasick (AC) |
---|
130 | string matching algorithm \cite{aho1975} is well suited for parallel |
---|
131 | implementation on multi-core CPUs, GPUs and the Cell BE \cite{scarpazza2011top, naghmouchi2010}. |
---|
132 | On each hardware, both thread-level parallelism (additional cores) and data-level parallelism |
---|
133 | (wide SIMD units) are leveraged for performance. |
---|
134 | Salapura et. al., advocated the use of vector-style processing for regular expressions |
---|
135 | in business analytics applications and leveraged the SIMD hardware available |
---|
136 | on multi-core processors to acheive a speedup of better than 1.8 over a |
---|
137 | range of data sizes of interest \cite{salapura2012accelerating}. |
---|
138 | %Cell |
---|
139 | In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer |
---|
140 | that delivered 1.00â1.78 Gbps on a single |
---|
141 | Cell BE chip and extended this approach for emulation on the Intel Larrabee |
---|
142 | instruction set \cite{scarpazza2009larrabee}. |
---|
143 | On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching |
---|
144 | implementation that delivered a throughput of 40 |
---|
145 | Gbps for a small dictionary of approximately 100 patterns, and a throughput of 3.3-3.4 |
---|
146 | Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} |
---|
147 | presented a string matching implementation for automata that achieves |
---|
148 | 4 Gbps on the Cell BE. |
---|
149 | % GPU |
---|
150 | In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based |
---|
151 | implementation of the AC algorithm for |
---|
152 | accelerating string matching on GPUs. Lin et al., proposed |
---|
153 | the Parallel Failureless Aho-Corasick (PFAC) |
---|
154 | algorithm to accelerate pattern matching on GPU hardware and |
---|
155 | achieved 143 Gbps throughput, 14.74 times faster |
---|
156 | than the AC algorithm performed on a four core |
---|
157 | multi-core processor using OpenMP \cite{lin2013accelerating}. |
---|
158 | |
---|
159 | Whereas the existing approaches to parallelization have been |
---|
160 | based on adapting traditional sequential algorithms to emergent |
---|
161 | parallel architectures, we introduce both a new algorithmic |
---|
162 | approach and its implementation on SIMD and GPU architectures. |
---|
163 | This approach relies on a bitwise data parallel view of text |
---|
164 | streams as well as a surprising use of addition to match |
---|
165 | runs of characters in a single step. The closest previous |
---|
166 | work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD |
---|
167 | technology together with a parallel scanning primitive also |
---|
168 | based on addition \cite{cameron2011parallel}. |
---|
169 | However, in contrast to the deterministic, longest-match |
---|
170 | scanning associated with the ScanThru primitive of that |
---|
171 | work, we introduce here a new primitive MatchStar |
---|
172 | that can be used in full generality for nondeterministic |
---|
173 | regular expression matching. We also introduce a long-stream |
---|
174 | addition technique involving a further application of MatchStar |
---|
175 | that enables us to scale the technique to $n$-bit addition |
---|
176 | in $log_{64} n$ steps. We ultimately apply this technique, |
---|
177 | for example, to perform |
---|
178 | synchronized 4096-bit addition on GPU warps of 64 threads. |
---|
179 | |
---|
180 | There is also a strong keyword match between the bit-parallel |
---|
181 | data streams used in our approach and the bit-parallelism |
---|
182 | used for NFA state transitions in the classical algorithms of |
---|
183 | Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new} |
---|
184 | and Navarro and Raffinot \cite{navarro1998bit}. |
---|
185 | However those algorithms use bit-parallelism in a fundamentally |
---|
186 | different way: representing all possible current NFA states |
---|
187 | as a bit vector and performing parallel transitions to a new |
---|
188 | set of states using table lookups and bitwise logic. Whereas |
---|
189 | our approach can match multiple characters per step, bit-parallel |
---|
190 | NFA algorithms proceed through the input one byte at a time. |
---|
191 | Nevertheless, the agrep \cite{wu1992agrep} and |
---|
192 | nrgrep \cite{navarro2000} programs implemented using these techniques remain |
---|
193 | among the strongest competitors in regular expression matching |
---|
194 | performance, so we include them in our comparative evaluation. |
---|
195 | |
---|
196 | |
---|
197 | The remainder of this paper is organized as follows. |
---|
198 | Section \ref{sec:bitwise} presents our basic algorithm and MatchStar |
---|
199 | using a model of arbitrary-length bit-parallel data streams. |
---|
200 | Section \ref{sec:blockwise} discusses the block-by-block |
---|
201 | implementation of our techniques including the long stream |
---|
202 | addition techniques for 256-bit addition with AVX2 and |
---|
203 | 4096-bit additions with GPGPU SIMT. |
---|
204 | Section \ref{sec:analysis} |
---|
205 | Section \ref{sec:SSE2} |
---|
206 | Section \ref{sec:AVX2} |
---|
207 | Section \ref{sec:GPU} |
---|
208 | Section \ref{sec:Concl} concludes the paper with a discussion of areas for future work. |
---|
209 | |
---|
210 | |
---|
211 | \section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise} |
---|
212 | |
---|
213 | Whereas the traditional approaches to regular expression matching |
---|
214 | using NFAs, DFAs or backtracking all rely on a byte-at-a-time |
---|
215 | processing model, the approach we introduce in this paper is based |
---|
216 | on quite a different concept: a data-parallel approach to simultaneous |
---|
217 | processing of data stream elements. Indeed, our most abstract model |
---|
218 | is that of unbounded data parallelism: processing all elements of |
---|
219 | the input data stream simultaneously. In essence, we view |
---|
220 | data streams as (very large) integers. The fundamental operations |
---|
221 | we apply are based on bitwise logic and long-stream addition. |
---|
222 | |
---|
223 | Depending on the available parallel processing resources, an actual |
---|
224 | implementation may divide an input stream into blocks and process |
---|
225 | the blocks sequentially. Within each block all elements of the |
---|
226 | input stream are processed together, relying the availability of |
---|
227 | bitwise logic and addition scaled to the block size. On commodity |
---|
228 | Intel and AMD processors with 128-bit SIMD capabilities (SSE2), |
---|
229 | we typically process input streams 128 bytes at a time. In this |
---|
230 | case, we rely on the Parabix tool chain \cite{lin2012parabix} |
---|
231 | to handle the details of compilation to block-by-block processing. |
---|
232 | As weh show later, however, we have also adapted Parabix technology to processing |
---|
233 | blocks of 4K bytes at time in our GPGPU implementation, relying |
---|
234 | on our implementation of 4096-bit addition implemented using |
---|
235 | SIMT processing of 64 threads. |
---|
236 | |
---|
237 | A key concept in this streaming approach is the derivation of bit streams |
---|
238 | that are parallel to the input data stream, i.e., in one-to-one |
---|
239 | correspondence with the data element positions of the input |
---|
240 | streams. Typically, the input stream is a byte stream comprising |
---|
241 | the 8-bit character code units of a particular encoding such |
---|
242 | as extended ASCII, ISO-8859-1 or UTF-8. However, the method may also |
---|
243 | easily be used with wider code units such as the 16-bit code units of |
---|
244 | UTF-16. In the case of a byte stream, the first step is to transpose |
---|
245 | the byte stream into eight parallel bit streams, such that bit stream |
---|
246 | $i$ comprises the $i^\text{th}$ bit of each byte. These streams form |
---|
247 | a set of basis bit streams from which many other parallel bit |
---|
248 | streams can be calculated, such as character class bit |
---|
249 | streams such that each bit $j$ of the stream specifies |
---|
250 | whether character $j$ of the input stream is in the class |
---|
251 | or not. {\bf perhaps a figure here.} |
---|
252 | |
---|
253 | |
---|
254 | \section{Block-at-a-Time Processing}\label{sec:blockwise} |
---|
255 | |
---|
256 | |
---|
257 | |
---|
258 | |
---|
259 | \section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis} |
---|
260 | |
---|
261 | \begin{enumerate} |
---|
262 | \item Operations |
---|
263 | \item Memory behaviour per input byte: note tables of DFA/NFA. |
---|
264 | |
---|
265 | Bille and Throup \em{Faster regular expression matching}\cite{bille2009faster} |
---|
266 | |
---|
267 | \end{enumerate} |
---|
268 | |
---|
269 | |
---|
270 | |
---|
271 | \section{Commodity SIMD Implementation and Experimental Evaluation}\label{sec:SSE2} |
---|
272 | \subsection{Implementation Notes} |
---|
273 | \subsection{Evaluation Methodology} |
---|
274 | \subsection{Comparison} |
---|
275 | |
---|
276 | |
---|
277 | |
---|
278 | \section{SIMD Scalability}\label{sec:AVX2} |
---|
279 | \subsection{AVX Stream Addition} |
---|
280 | We use MatchStar for carry propagation. |
---|
281 | |
---|
282 | \section{GPU Implementation}\label{sec:GPU} |
---|
283 | |
---|
284 | \section{Miscellaneous} |
---|
285 | \subsection{Skipping} |
---|
286 | \subsection{Unicode} |
---|
287 | |
---|
288 | \section{Conclusion}\label{sec:Concl} |
---|
289 | \subsection{Contributions} |
---|
290 | \begin{enumerate} |
---|
291 | \item New Algorithm Class for Regular Expression Matching |
---|
292 | \item MatchStar for Character Class Repetition |
---|
293 | \item Long Stream Addition |
---|
294 | \item Implementations showing performance and scalability |
---|
295 | \end{enumerate} |
---|
296 | \subsection{Future Work} |
---|
297 | \begin{enumerate} |
---|
298 | \item Substring capture |
---|
299 | \item Unicode character classes |
---|
300 | \item Nonregular regexp features: zero-width assertions, backreferences |
---|
301 | \item Multicore for ruleset parallelism |
---|
302 | \end{enumerate} |
---|
303 | |
---|
304 | |
---|
305 | |
---|
306 | %\appendix |
---|
307 | %\section{Appendix Title} |
---|
308 | |
---|
309 | %This is the text of the appendix, if you need one. |
---|
310 | |
---|
311 | \acks |
---|
312 | |
---|
313 | This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and |
---|
314 | MITACS, Inc. |
---|
315 | |
---|
316 | % We recommend abbrvnat bibliography style. |
---|
317 | |
---|
318 | \bibliographystyle{abbrvnat} |
---|
319 | |
---|
320 | % The bibliography should be embedded for final submission. |
---|
321 | |
---|
322 | \bibliography{reference} |
---|
323 | |
---|
324 | %\begin{thebibliography}{} |
---|
325 | %\softraggedright |
---|
326 | |
---|
327 | %\bibitem[Smith et~al.(2009)Smith, Jones]{smith02} |
---|
328 | %P. Q. Smith, and X. Y. Jones. ...reference text... |
---|
329 | % |
---|
330 | %\end{thebibliography} |
---|
331 | |
---|
332 | |
---|
333 | \end{document} |
---|
334 | |
---|
335 | |
---|