Ignore:
Timestamp:
May 18, 2013, 12:45:44 AM (6 years ago)
Author:
ksherdy
Message:

Added references. Cleaned up background and abstract.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/re-main.tex.backup

    r3149 r3154  
    4242We introduce a new bit-parallel scanning primitive, {\em Match Star}.
    4343which performs parallel Kleene closure over character classes
    44 and without eliminates backtracking. % expand and rephrase description of Match Star
     44and without backtracking. % expand and rephrase description of Match Star
    4545
    4646We evaluate the performance of our method in comparison with several widely known {\em grep}
     
    9999logic and shifting operations to compute further parallel bit streams of interest.
    100100
    101 We further increase the parallelism in our methods by introducing a new parallel
    102 scanning primitive which we have coined Match Star. Match Star returns all matches
    103 in a single operation and eliminates backtracking
    104 when a partially successful search path fails.
     101We further increase the potential parallelism in our approach by introducing a new parallel
     102scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star
    105103
    106104The remainder of this paper is organized as follows.
    107105Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
    108 Section~\ref{Background} presents background material on classical and state-of-the-art approaches
    109 to high performance regular expression matching. In addition, this section provides insight into the
    110 efficiency of traditional on-line pattern matching tools. Next,
    111 Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
     106Section~\ref{Background} presents background material on classical automata-based approaches
     107to regular expression matching as well as describes the efficient {\em grep} family utilities
     108and regular expression pattern matching software libraries.
     109Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
    112110Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
    113111Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
    114112presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
    115 Gnu grep, agrep, and nr-grep, and re2. Section~\ref{Conclusion} concludes the paper.
     113several software tools. Section~\ref{Conclusion} concludes the paper.
    116114
    117115The fundamental contribution of this paper is fully general approach to regular expression matching
    118 using bit-parallel data streams. The algorithmic aspects of this paper build upon
    119 the fundamental concepts of our previous work
    120 in some cases \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
    121  Individual contributions include:
     116using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
     117the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
     118Specific contributions include:
    122119\begin{itemize}
    123120\item compilation of regular expressions into unbounded bit-parallel data stream equations;
    124121\item documentation of character classes compilation into bit-parallel character class data streams;
    125 % \item methods to select optimal subpatterns;
    126 \item bit-parallel and backtrack-free scanning primitive termed Match Star;
    127 \item bit-parallel data stream support for unicode.
     122\item the Match Star parallel scanning primitive;
     123\item efficient support for unicode characters.
    128124\end{itemize}
    129125
    130126\section{Basic Concepts}
    131127\label{Basic Concepts}
    132 We define the notations and basic concepts used throughout this paper.
     128In this section, we define the notation and basic concepts used throughout this paper.
    133129
    134130\subsection{Notation}
    135131We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
    136132be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
    137 be a text of length n defined over a finite alphabet $sigma$ of size $alpha$.
    138 The task of regular expression matching
    139 is to find all the text positions of T that follow an occurrence of P. We use
    140 C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$, $\ll{}$, and $\lgg{}$
    141 represent bitwise NOT, OR, AND, XOR, k-bit left shift and k-bit right shift, respectively.
     133be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
     134The goal of simple pattern expression matching is to find all the text positions of T that follow
     135an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
     136
     137We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
     138represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
     139logical left shift, and logical right shift, respectively and are further modulated by
     140the number of bit positions a given value shall be shifted by, for example ``shift left by n''.
     141Vacant bit-positions are filled in with zeroes.
    142142
    143143\subsection{Regular Expressions}
    144144
    145 % TODO
     145% Define regular expressions (a recursive def), character classes, bounded repetition
    146146
    147147\section{Background}
     
    166166
    167167Thompson's original work marked the beginning of a long line of
    168 regular expression implementations that
    169 process an input string, character-at-a-time, and that transition patterm matching state
    170 based on the current state and the next character read.
    171 
    172 The Boyer-Moore family of algorithms \cite{boyer1977fast} , Horspool, ... skip characters
    173 
    174 Suffix automata (BDM)
    175 
    176 The ideas presented up to now aim at a good implementation of the automa-
    177 ton, but they must inspect all the text characters. In many cases, however, the
    178 regular expression involves sets of relatively long substrings that must appear
    179 for the regular expression to match.
     168regular expression pattern matching methods that
     169process an input string, character-at-a-time,
     170and that transition from state to state
     171according to the current state and the next input character.
     172
     173Whereas traditional automata techniques acheive
     174O(n) worst-case optimal efficiency, simple string matching algorithms,
     175such as the Boyer-Moore family of algorithms, skip input characters
     176to achieve sublinear times in the average case \cite{boyer1977fast}.
     177
     178Boyer-Moore methods, begin comparison from the end of the pattern instead of the
     179beginning and precompute skip information
     180to determine how far ahead a pattern search can skip in the input whenever
     181a non-matching character is encountered. Generally, Boyer-Moore family
     182algorithms improve faster as the pattern being searched for becomes longer.
     183In many cases, the techniques used to skip characters in simple string matching
     184approaches can be extended to regular expression matching.
     185Widely known techniques used to facilitate character skipping in regular
     186expression matching include necessary strings and backward suffix matching
     187inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
     188
    180189
    181190\subsection{Bit-parallel Simulation of Automata}
    182191
    183 Define bit-parallelism \cite{navarro2002flexible}
    184 
    185 Shift-Or \cite{}
    186 
    187 Backward Dawg Matching (BDM) \cite{navarro1998bit}
     192Define bit-parallelism \cite{Baeza-yates_anew}
     193
     194Shift-Or / Shift-And \cite{wu1992fast}
    188195
    189196Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
    190 
    191 Skip characters pattern length (occurence frequency and length).
    192 
    193 
    194 \section{Bit-parallel Data Streams}
    195 \label{Bit-parallel Data Streams}
    196 
    197 The bit-parallel data streams use the wide
    198 SIMD registers commonly found on commodity processors
    199 to process  byte positions at a time using
    200 bitwise logic, shifting and other operations.
    201 
    202 A signficant advantage of the bit-parallel
    203 data stream method over other
    204 pattern matching methods that rely on
    205 bit-parallel automata simulation
    206 is the potential to skip full register width
    207 number of characters in low occurence frequency text. % reword
    208 
    209 
    210 Skip characters register width.
    211197
    212198\subsection{Software Tools}
     
    231217%re2
    232218
     219
     220\section{Bit-parallel Data Streams}
     221\label{Bit-parallel Data Streams}
     222
     223The bit-parallel data streams use the wide
     224SIMD registers commonly found on commodity processors
     225to process  byte positions at a time using
     226bitwise logic, shifting and other operations.
     227
     228A signficant advantage of the bit-parallel
     229data stream method over other
     230pattern matching methods that rely on
     231bit-parallel automata simulation
     232is the potential to skip full register width
     233number of characters in low occurence frequency text. % reword
     234
     235
     236Skip characters register width.
     237
    233238\subsection{Match Star}
    234239
Note: See TracChangeset for help on using the changeset viewer.