# Changeset 3154 for docs/Working/re/re-main.tex.backup

Ignore:
Timestamp:
May 18, 2013, 12:45:44 AM (6 years ago)
Message:

Added references. Cleaned up background and abstract.

File:
1 edited

### Legend:

Unmodified
 r3149 We introduce a new bit-parallel scanning primitive, {\em Match Star}. which performs parallel Kleene closure over character classes and without eliminates backtracking. % expand and rephrase description of Match Star and without backtracking. % expand and rephrase description of Match Star We evaluate the performance of our method in comparison with several widely known {\em grep} logic and shifting operations to compute further parallel bit streams of interest. We further increase the parallelism in our methods by introducing a new parallel scanning primitive which we have coined Match Star. Match Star returns all matches in a single operation and eliminates backtracking when a partially successful search path fails. We further increase the potential parallelism in our approach by introducing a new parallel scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star The remainder of this paper is organized as follows. Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper. Section~\ref{Background} presents background material on classical and state-of-the-art approaches to high performance regular expression matching. In addition, this section provides insight into the efficiency of traditional on-line pattern matching tools. Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. Section~\ref{Background} presents background material on classical automata-based approaches to regular expression matching as well as describes the efficient {\em grep} family utilities and regular expression pattern matching software libraries. Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} presents a detailed performance analysis of our data parallel bitstream techniques in comparison to Gnu grep, agrep, and nr-grep, and re2. Section~\ref{Conclusion} concludes the paper. several software tools. Section~\ref{Conclusion} concludes the paper. The fundamental contribution of this paper is fully general approach to regular expression matching using bit-parallel data streams. The algorithmic aspects of this paper build upon the fundamental concepts of our previous work in some cases \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. Individual contributions include: using bit-parallel data stream operations. The algorithmic aspects of this paper build upon the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. Specific contributions include: \begin{itemize} \item compilation of regular expressions into unbounded bit-parallel data stream equations; \item documentation of character classes compilation into bit-parallel character class data streams; % \item methods to select optimal subpatterns; \item bit-parallel and backtrack-free scanning primitive termed Match Star; \item bit-parallel data stream support for unicode. \item the Match Star parallel scanning primitive; \item efficient support for unicode characters. \end{itemize} \section{Basic Concepts} \label{Basic Concepts} We define the notations and basic concepts used throughout this paper. In this section, we define the notation and basic concepts used throughout this paper. \subsection{Notation} We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$ be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$ be a text of length n defined over a finite alphabet $sigma$ of size $alpha$. The task of regular expression matching is to find all the text positions of T that follow an occurrence of P. We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$, $\ll{}$, and $\lgg{}$ represent bitwise NOT, OR, AND, XOR, k-bit left shift and k-bit right shift, respectively. be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$. The goal of simple pattern expression matching is to find all the text positions of T that follow an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression. We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$ represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent logical left shift, and logical right shift, respectively and are further modulated by the number of bit positions a given value shall be shifted by, for example shift left by n''. Vacant bit-positions are filled in with zeroes. \subsection{Regular Expressions} % TODO % Define regular expressions (a recursive def), character classes, bounded repetition \section{Background} Thompson's original work marked the beginning of a long line of regular expression implementations that process an input string, character-at-a-time, and that transition patterm matching state based on the current state and the next character read. The Boyer-Moore family of algorithms \cite{boyer1977fast} , Horspool, ... skip characters Suffix automata (BDM) The ideas presented up to now aim at a good implementation of the automa- ton, but they must inspect all the text characters. In many cases, however, the regular expression involves sets of relatively long substrings that must appear for the regular expression to match. regular expression pattern matching methods that process an input string, character-at-a-time, and that transition from state to state according to the current state and the next input character. Whereas traditional automata techniques acheive O(n) worst-case optimal efficiency, simple string matching algorithms, such as the Boyer-Moore family of algorithms, skip input characters to achieve sublinear times in the average case \cite{boyer1977fast}. Boyer-Moore methods, begin comparison from the end of the pattern instead of the beginning and precompute skip information to determine how far ahead a pattern search can skip in the input whenever a non-matching character is encountered. Generally, Boyer-Moore family algorithms improve faster as the pattern being searched for becomes longer. In many cases, the techniques used to skip characters in simple string matching approaches can be extended to regular expression matching. Widely known techniques used to facilitate character skipping in regular expression matching include necessary strings and backward suffix matching inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}. \subsection{Bit-parallel Simulation of Automata} Define bit-parallelism \cite{navarro2002flexible} Shift-Or \cite{} Backward Dawg Matching (BDM) \cite{navarro1998bit} Define bit-parallelism \cite{Baeza-yates_anew} Shift-Or / Shift-And \cite{wu1992fast} Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm) Skip characters pattern length (occurence frequency and length). \section{Bit-parallel Data Streams} \label{Bit-parallel Data Streams} The bit-parallel data streams use the wide SIMD registers commonly found on commodity processors to process  byte positions at a time using bitwise logic, shifting and other operations. A signficant advantage of the bit-parallel data stream method over other pattern matching methods that rely on bit-parallel automata simulation is the potential to skip full register width number of characters in low occurence frequency text. % reword Skip characters register width. \subsection{Software Tools} %re2 \section{Bit-parallel Data Streams} \label{Bit-parallel Data Streams} The bit-parallel data streams use the wide SIMD registers commonly found on commodity processors to process  byte positions at a time using bitwise logic, shifting and other operations. A signficant advantage of the bit-parallel data stream method over other pattern matching methods that rely on bit-parallel automata simulation is the potential to skip full register width number of characters in low occurence frequency text. % reword Skip characters register width. \subsection{Match Star}