\documentclass[a4paper,10pt]{article}
\usepackage[utf8]{inputenc}
%opening
\title{Fast Regular Expression Matching with Bit-parallel Data Streams}
\author{
{Robert D. Cameron} \\
\and
{Kenneth S. Herdy} \\
\and
{Ben Hull} \\
\and
{Thomas C. Shermer} \\
\\School of Computing Science
\\Simon Fraser University
}
\begin{document}
\date{}
\maketitle
\begin{abstract}
An parallel regular expression matching pattern method is
introduced and compared with the state of the art in software tools designed for
efficient on-line search. %such as the {\em grep} family pattern matching tools.
The method is based on the concept of bit-parallel data streams,
in which parallel streams of bits are formed such
that each stream comprises bits in one-to-one correspondence with the
character code units of a source data stream.
An implementation of the method in the form of a regular expression
compiler is discussed. The compiler accepts a regular expression and
forms unbounded bit-parallel data stream operations. Bit-parallel operations
are then transformed into a low-level C-based implementation for compilation into native
pattern matching applications. These low-level C-based implementations take advantage of
the SIMD (single-instruction multiple-data) capabilities of commodity
processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
On processors supporting W-bit addition operations, the method processes W source characters
in parallel and performs up to W finite state transitions per clock cycle.
We introduce a new bit-parallel scanning primitive, {\em Match Star}.
which performs parallel Kleene closure over character classes
and without backtracking. % expand and rephrase description of Match Star
We evaluate the performance of our method in comparison with several widely known {\em grep}
family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep},
and regular expression engines such as {\em Google's re2}.
Performance results are analyzed using the performance monitoring counters
of commodity hardware. Overall, our results demonstrate a dramatic speed-up
over publically available alternatives.
\end{abstract}
\section{Introduction}
\label{Introduction}
%\input{introduction.tex}
Regular expresssion matching is an extensively studied problem with application to
text processing and bioinformatics and with numerous algorithms
and software tools developed to the address the particular
processing demands. % reword
The pattern matching problem can be
stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
find all the text positions of T that start an occurrence of P. Alternatively,
one may want all the final positions of occurrences. Some applications require
slightly different output such as the line that matches the pattern.
A pattern P can be a simple string, but it can also be a regular expression.
A regular expression, is an expression that specifies a set of strings
and is composed of (i) simple strings and (ii) the
union, concatenation and Kleene closure of other regular expressions.
To avoid parentheses it is assumed that the Kleene star has the highest priority,
next concatenation and then alternation, however, most formalisms provides grouping
operators to allow the definition of scope and operator precedence.
Readers unfamiliar with the concept of regular expression matching are referred
classical texts such as \cite{aho2007}.
Regular expression matching is commonly performed using a wide variety of
publically available software tools for on-line pattern matching. For instance,
UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
agrep, and nrgrep are widely known and considered as
the fastest regular expression matching tools in practice \cite{navarro2000}.
and are of particular interest to this study.
% simple patterns, extended patterns, regular expressions
% motivation / previous work
Although tradi finite state machine methods used in the scanning and parsing of
text streams is considered to be the hardest of the “13 dwarves” to parallelize
[1], parallel bitstream technology shows considerable promise for these types of
applications [3, 4]. In this approach, character streams are processed W positions
at a time using the W-bit SIMD registers commonly found on commodity processors
(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
first slicing the byte streams into eight separate basis bitstreams, one for each bit
position within the byte. These basis bitstreams are then combined with bitwise
logic and shifting operations to compute further parallel bit streams of interest.
We further increase the potential parallelism in our approach by introducing a new parallel
scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star
The remainder of this paper is organized as follows.
Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
Section~\ref{Background} presents background material on classical automata-based approaches
to regular expression matching as well as describes the efficient {\em grep} family utilities
and regular expression pattern matching software libraries.
Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
several software tools. Section~\ref{Conclusion} concludes the paper.
The fundamental contribution of this paper is fully general approach to regular expression matching
using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
Specific contributions include:
\begin{itemize}
\item compilation of regular expressions into unbounded bit-parallel data stream equations;
\item documentation of character classes compilation into bit-parallel character class data streams;
\item the Match Star parallel scanning primitive;
\item efficient support for unicode characters.
\end{itemize}
\section{Basic Concepts}
\label{Basic Concepts}
In this section, we define the notation and basic concepts used throughout this paper.
\subsection{Notation}
We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
The goal of simple pattern expression matching is to find all the text positions of T that follow
an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
logical left shift, and logical right shift, respectively and are further modulated by
the number of bit positions a given value shall be shifted by, for example ``shift left by n''.
Vacant bit-positions are filled in with zeroes.
\subsection{Regular Expressions}
% Define regular expressions (a recursive def), character classes, bounded repetition
\section{Background}
\label{Background}
%\input{background.tex}
\subsection{Classical Methods}
\subsection{Regular Expression and Finite Automata}
The origins of regular expression matching date back to automata theory
and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
to nondeterministic finite automata (NFA) for regular expression matching.
Following Thompson's approach, a regular expression of length m is first converted
to an NFA with O(m) nodes. It is possible to search a text of length n using the
NFA directly in O(mn) worst case time. Often, a more efficient choice
is to convert the NFA into a DFA. A DFA has only a single active state and
allows to search the text at O(n) worst-case optimal. It is well known that the
conversion of the NFA to the DFA may result in the state explosion problem.
That is the resultant DFA may have O($2^m$) states.
Thompson's original work marked the beginning of a long line of
regular expression pattern matching methods that
process an input string, character-at-a-time,
and that transition from state to state
according to the current state and the next input character.
Whereas traditional automata techniques acheive
O(n) worst-case optimal efficiency, simple string matching algorithms,
such as the Boyer-Moore family of algorithms, skip input characters
to achieve sublinear times in the average case \cite{boyer1977fast}.
Boyer-Moore methods, begin comparison from the end of the pattern instead of the
beginning and precompute skip information
to determine how far ahead a pattern search can skip in the input whenever
a non-matching character is encountered. Generally, Boyer-Moore family
algorithms improve faster as the pattern being searched for becomes longer.
In many cases, the techniques used to skip characters in simple string matching
approaches can be extended to regular expression matching.
Widely known techniques used to facilitate character skipping in regular
expression matching include necessary strings and backward suffix matching
inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
\subsection{Bit-parallel Simulation of Automata}
Define bit-parallelism \cite{Baeza-yates_anew}
Shift-Or / Shift-And \cite{wu1992fast}
Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
\subsection{Software Tools}
%Thompson created the first grep (UNIX grep) as a standalone adaptation
%of the regular expression parser he had written for the UNIX ed utility.
%In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
%Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
%time it took to build a complete finite automaton for the regular expression before it could even start searching.
%Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
%It took zero set-up time and just one additional test in the inner loop.
%http://pages.cs.wisc.edu/~mdant/cs520_4.html
%Given a regular expression R and a test T the regular expression matching
%problem finds all ending position of substrings in Q that matches a string in
%the language denoted by R.
%The behaviour of Gnu grep, agrep, and nr-grep are differ ...
%Gnu grep
%agrep
%nr-grep
%re2
\section{Bit-parallel Data Streams}
\label{Bit-parallel Data Streams}
The bit-parallel data streams use the wide
SIMD registers commonly found on commodity processors
to process byte positions at a time using
bitwise logic, shifting and other operations.
A signficant advantage of the bit-parallel
data stream method over other
pattern matching methods that rely on
bit-parallel automata simulation
is the potential to skip full register width
number of characters in low occurence frequency text. % reword
Skip characters register width.
\subsection{Match Star}
%Wikipedia
Backtracking is a general algorithm for finding all solutions to some computational problem,
that incrementally builds candidates to the solutions.
\section{Compiler Technology}
\label{Compiler Technology}
\section{Methodology}
\label{Methodology}
%\input{methodology.tex}
We compare the performance of our bit-parallel data stream techniques against
Gnu grep, agrep, nr-grep, and re2.
\section{Experimental Results}
\label{Experimental Results}
%\input{results.tex}
\section{Conclusion}
\label{Conclusion}
%\input{conclusion.tex}
{
\bibliographystyle{acm}
\bibliography{reference}
}
\end{document}