\documentclass[a4paper,10pt]{article}
\usepackage[utf8]{inputenc}
\def \Bitstream{Bit Stream}
\def \bitstream{bit stream}
%opening
\title{Fast Regular Expression Matching using Parallel \Bitstream{}s}
\author{
{Robert D. Cameron} \\
\and
{Kenneth S. Herdy} \\
\and
{Ben Hull} \\
\and
{Thomas C. Shermer} \\
\\School of Computing Science
\\Simon Fraser University
}
\begin{document}
\date{}
\maketitle
\begin{abstract}
A parallel regular expression matching method is introduced and studied in
application to the problem of online pattern matching.
The method is based on the concept of parallel
\bitstream{} technology, in which parallel streams of bits are formed such
that each stream comprises bits in one-to-one correspondence with the
character code units of a source data stream.
On processors supporting W-bit addition operations, the method processes W source characters
in parallel and performs up to W finite state transitions per clock cycle.
Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives.
\end{abstract}
\section{Introduction}
\label{Introduction}
%\input{introduction.tex}
Regular expresssion matching is an extensively studied problem with application to
numerous application domains. A multitude
of algorithms and software tools have been developed to the address the particular
demands of the various application domains.
The pattern matching problem can be
stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
find all the text positions of T that start an occurrence of P. Alternatively,
one may want all the final positions of occurrences. Some applications require
slightly different output such as the line that matches the pattern.
A pattern P can be a simple string, but it can also be, a regular expression.
A regular expression, is an expression that specifies a set of strings.
A regular expression is composed of (i) simple strings and (ii) the
union, concatenation and Kleene closure of other regular expressions.
To avoid parentheses it is assumed that the Kleene star has the highest priority,
next concatenation and then alternation, however, most formalisms provides grouping
operators to allow the definition of scope and operator precedence.
Readers unfamiliar with the concept of regular expression matching are referred
classical texts such as \cite{aho2007}.
Regular expression matching is commonly performed using a wide variety of
publically available software tools for on-line pattern matching. For instance,
UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
agrep, and nrgrep are widely known and considered as
the fastest regular expression matching tools in practice \cite{navarro2000}.
and are of particular interest to this study.
% simple patterns, extended patterns, regular expressions
% motivation / previous work
Although tradi finite state machine methods used in the scanning and parsing of
text streams is considered to be the hardest of the “13 dwarves” to parallelize
[1], parallel bitstream technology shows considerable promise for these types of
applications [3, 4]. In this approach, character streams are processed W positions
at a time using the W-bit SIMD registers commonly found on commodity processors
(e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
first slicing the byte streams into eight separate basis bitstreams, one for each bit
position within the byte. These basis bitstreams are then combined with bitwise
logic and shifting operations to compute further parallel bit streams of interest.
We further increase the parallelism in our methods by introducing a new parallel
scanning primitive which we have coined Match Star. Match Star returns all matches
in a single operation and eliminates backtracking
when a partially successful search path fails.
The remainder of this paper is organized as follows.
Section~\ref{Background} presents background material on classic
regular expression pattern matching techniques and provides insight into the
efficiency of traditional regular expression software tools.
Section~\ref{Bitwise Parallel Data Streams} describes out data parallel
regular expression matching techniques.
Section~\ref{Compiler Technology}
Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
presents a detailed performance analysis of our data parallel \bitstream{} techniques against
Gnu grep, agrep, and nr-grep.
Section~\ref{conclusion} concludes the paper.
\section{Background}
\label{Background}
%\input{background.tex}
% Background
% History
Historically, the origins of regular expression matching date back to automata theory
and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
In 1959, Dana and Scott demonstrated that
NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA
state corresponds to a set of NFA states.
Thompson, in 1968, is credited with the first construction to convert regular expressions
to nondeterministic finite automata (NFA) for regular expression matching.
Thompson’s publication \cite{thompson1968} marked the beginning of a long line of
regular expression implementations that construct automata for pattern matching.
The traditional technique [16] to search a regular expression of length m in
a text of length n is to first convert the expression into a non-deterministic
automaton (NFA) with O(m) nodes. It is possible to search the text using the
NFA directly in O(mn) worst case time. The cost comes from the fact that
more than one state of the NFA may be active at each step, and therefore all
may need to be updated. A more effcient choice is to convert the NFA into a
DFA. A DFA has only a single active state and allows to search the text at
O(n) worst-case optimal. The problem with this approach is that the DFA
may have O(2^m) states.
In general, the general process is first to build a
NFA from the regular expression and simulate the NFA on text input,
or alternatively to convert the NFA into a
DFA, optionally minimize the number of states in the DFA,
and finally simulate the DFA on text input.
\section{Parallel Bitwise Data Streams}
\label{Parallel Bitwise Data Streams}
\section{Compiler Technology}
\label{Compiler Technology}
\section{Methodology}
\label{Methodology}
%\input{methodology.tex}
We compare the performance of our parallel \bitstream{} techniques against
Gnu grep, agrep, and nr-grep.
Given a regular expression R and a test T the regular expression matching
problem finds all ending position of substrings in Q that matches a string in
the language denoted by R.
The behaviour of Gnu grep, agrep, and nr-grep are differ in that
Gnu grep
agrep
nr-grep
\section{Experimental Results}
\label{results}
%\input{results.tex}
\section{Conclusion and Future Work}
\label{conclusion}
%\input{conclusion.tex}
{
\bibliographystyle{acm}
\bibliography{reference}
}
\end{document}