source: docs/Working/re/re-main.tex.backup @ 3491

Last change on this file since 3491 was 3154, checked in by ksherdy, 6 years ago

Added references. Cleaned up background and abstract.

File size: 12.2 KB
[3149]5\title{Fast Regular Expression Matching with Bit-parallel Data Streams}
7{Robert D. Cameron} \\
9{Kenneth S. Herdy} \\
11{Ben Hull} \\
13{Thomas C. Shermer} \\
14\\School of Computing Science
15\\Simon Fraser University
[3149]24An parallel regular expression matching pattern method is
25introduced and compared with the state of the art in software tools designed for
26efficient on-line search. %such as the {\em grep} family pattern matching tools.
[3138]27The method is based on the concept of bit-parallel data streams,
28in which parallel streams of bits are formed such
[3123]29that each stream comprises bits in one-to-one correspondence with the
[3138]30character code units of a source data stream.
[3138]32An implementation of the method in the form of a regular expression
[3149]33compiler is discussed. The compiler accepts a regular expression and
34forms unbounded bit-parallel data stream operations. Bit-parallel operations
35are then transformed into a low-level C-based implementation for compilation into native
36pattern matching applications. These low-level C-based implementations take advantage of
[3138]37the SIMD (single-instruction multiple-data) capabilities of commodity
[3149]38processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
39On processors supporting W-bit addition operations, the method processes W source characters
[3123]40in parallel and performs up to W finite state transitions per clock cycle.
[3149]42We introduce a new bit-parallel scanning primitive, {\em Match Star}.
43which performs parallel Kleene closure over character classes
[3154]44and without backtracking. % expand and rephrase description of Match Star
[3149]46We evaluate the performance of our method in comparison with several widely known {\em grep}
47family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep},
48and regular expression engines such as {\em Google's re2}.
49Performance results are analyzed using the performance monitoring counters
50of commodity hardware. Overall, our results demonstrate a dramatic speed-up
51over publically available alternatives.
[3126]59Regular expresssion matching is an extensively studied problem with application to
[3149]60text processing and bioinformatics and with numerous algorithms
61and software tools developed to the address the particular
62processing demands. % reword
[3126]64The pattern matching problem can be
65stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
66find all the text positions of T that start an occurrence of P. Alternatively,
67one may want all the final positions of occurrences. Some applications require
68slightly different output such as the line that matches the pattern.
[3149]70A pattern P can be a simple string, but it can also be a regular expression.
71A regular expression, is an expression that specifies a set of strings
72and is composed of (i) simple strings and (ii) the
[3126]73union, concatenation and Kleene closure of other regular expressions.
74To avoid parentheses it is assumed that the Kleene star has the highest priority,
75next concatenation and then alternation, however, most formalisms provides grouping
76operators to allow the definition of scope and operator precedence.
77Readers unfamiliar with the concept of regular expression matching are referred
78classical texts such as \cite{aho2007}.
80Regular expression matching is commonly performed using a wide variety of
81publically available software tools for on-line pattern matching. For instance,
82UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
83expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
84agrep, and nrgrep are widely known and considered as
85the fastest regular expression matching tools in practice \cite{navarro2000}.
86and are of particular interest to this study.
88% simple patterns, extended patterns, regular expressions
[3124]90% motivation / previous work
[3126]91Although tradi finite state machine methods used in the scanning and parsing of
[3124]92text streams is considered to be the hardest of the “13 dwarves” to parallelize
93[1], parallel bitstream technology shows considerable promise for these types of
94applications [3, 4]. In this approach, character streams are processed W positions
95at a time using the W-bit SIMD registers commonly found on commodity processors
96(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
97first slicing the byte streams into eight separate basis bitstreams, one for each bit
98position within the byte. These basis bitstreams are then combined with bitwise
99logic and shifting operations to compute further parallel bit streams of interest.
[3154]101We further increase the potential parallelism in our approach by introducing a new parallel
102scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star
[3126]104The remainder of this paper is organized as follows.
[3149]105Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
[3154]106Section~\ref{Background} presents background material on classical automata-based approaches 
107to regular expression matching as well as describes the efficient {\em grep} family utilities
108and regular expression pattern matching software libraries.
109Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
[3138]110Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
[3126]111Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
[3149]112presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
[3154]113several software tools. Section~\ref{Conclusion} concludes the paper.
[3149]115The fundamental contribution of this paper is fully general approach to regular expression matching
[3154]116using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
117the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
118Specific contributions include:
120\item compilation of regular expressions into unbounded bit-parallel data stream equations;
121\item documentation of character classes compilation into bit-parallel character class data streams;
[3154]122\item the Match Star parallel scanning primitive;
123\item efficient support for unicode characters.
126\section{Basic Concepts}
127\label{Basic Concepts}
[3154]128In this section, we define the notation and basic concepts used throughout this paper.
131We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
132be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
[3154]133be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
134The goal of simple pattern expression matching is to find all the text positions of T that follow
135an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
[3154]137We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
138represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
139logical left shift, and logical right shift, respectively and are further modulated by
140the number of bit positions a given value shall be shifted by, for example ``shift left by n''.
141Vacant bit-positions are filled in with zeroes.
[3149]143\subsection{Regular Expressions}
[3154]145% Define regular expressions (a recursive def), character classes, bounded repetition
[3149]151\subsection{Classical Methods}
153\subsection{Regular Expression and Finite Automata}
[3138]155The origins of regular expression matching date back to automata theory
156and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
157Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
158to nondeterministic finite automata (NFA) for regular expression matching.
159Following Thompson's approach, a regular expression of length m is first converted
160to an NFA with O(m) nodes. It is possible to search a text of length n using the
[3149]161NFA directly in O(mn) worst case time. Often, a more efficient choice
[3138]162is to convert the NFA into a DFA. A DFA has only a single active state and
163allows to search the text at O(n) worst-case optimal. It is well known that the
164conversion of the NFA to the DFA may result in the state explosion problem.
165That is the resultant DFA may have O($2^m$) states.
[3138]167Thompson's original work marked the beginning of a long line of
[3154]168regular expression pattern matching methods that
169process an input string, character-at-a-time,
170and that transition from state to state
171according to the current state and the next input character.
[3154]173Whereas traditional automata techniques acheive
174O(n) worst-case optimal efficiency, simple string matching algorithms,
175such as the Boyer-Moore family of algorithms, skip input characters
176to achieve sublinear times in the average case \cite{boyer1977fast}.
[3154]178Boyer-Moore methods, begin comparison from the end of the pattern instead of the
179beginning and precompute skip information
180to determine how far ahead a pattern search can skip in the input whenever
181a non-matching character is encountered. Generally, Boyer-Moore family
182algorithms improve faster as the pattern being searched for becomes longer.
183In many cases, the techniques used to skip characters in simple string matching
184approaches can be extended to regular expression matching.
185Widely known techniques used to facilitate character skipping in regular
186expression matching include necessary strings and backward suffix matching
187inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
190\subsection{Bit-parallel Simulation of Automata}
[3154]192Define bit-parallelism \cite{Baeza-yates_anew}
[3154]194Shift-Or / Shift-And \cite{wu1992fast}
196Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
198\subsection{Software Tools}
[3149]200%Thompson created the first grep (UNIX grep) as a standalone adaptation
201%of the regular expression parser he had written for the UNIX ed utility.
202%In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
203%Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
204%time it took to build a complete finite automaton for the regular expression before it could even start searching.
205%Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
206%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
207%It took zero set-up time and just one additional test in the inner loop.
210%Given a regular expression R and a test T the regular expression matching
211%problem finds all ending position of substrings in Q that matches a string in
212%the language denoted by R.
213%The behaviour of Gnu grep, agrep, and nr-grep are differ ...
214%Gnu grep
220\section{Bit-parallel Data Streams}
221\label{Bit-parallel Data Streams}
223The bit-parallel data streams use the wide
224SIMD registers commonly found on commodity processors
225to process  byte positions at a time using
226bitwise logic, shifting and other operations.
228A signficant advantage of the bit-parallel
229data stream method over other
230pattern matching methods that rely on
231bit-parallel automata simulation
232is the potential to skip full register width
233number of characters in low occurence frequency text. % reword
236Skip characters register width.
[3149]238\subsection{Match Star}
241Backtracking is a general algorithm for finding all solutions to some computational problem,
242that incrementally builds candidates to the solutions.
[3126]244\section{Compiler Technology}
245\label{Compiler Technology}
[3149]251We compare the performance of our bit-parallel data stream techniques against
252Gnu grep, agrep, nr-grep, and re2.
[3123]256\section{Experimental Results}
[3135]257\label{Experimental Results}
265  \bibliographystyle{acm}
266  \bibliography{reference}
Note: See TracBrowser for help on using the repository browser.