Changeset 3154 for docs/Working/re/remain.tex.backup
 Timestamp:
 May 18, 2013, 12:45:44 AM (6 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/re/remain.tex.backup
r3149 r3154 42 42 We introduce a new bitparallel scanning primitive, {\em Match Star}. 43 43 which performs parallel Kleene closure over character classes 44 and without eliminatesbacktracking. % expand and rephrase description of Match Star44 and without backtracking. % expand and rephrase description of Match Star 45 45 46 46 We evaluate the performance of our method in comparison with several widely known {\em grep} … … 99 99 logic and shifting operations to compute further parallel bit streams of interest. 100 100 101 We further increase the parallelism in our methods by introducing a new parallel 102 scanning primitive which we have coined Match Star. Match Star returns all matches 103 in a single operation and eliminates backtracking 104 when a partially successful search path fails. 101 We further increase the potential parallelism in our approach by introducing a new parallel 102 scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star 105 103 106 104 The remainder of this paper is organized as follows. 107 105 Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper. 108 Section~\ref{Background} presents background material on classical a nd stateoftheart approaches109 to high performance regular expression matching. In addition, this section provides insight into the110 efficiency of traditional online pattern matching tools. Next, 111 Section~\ref{Bitparallel Data Streams} describes our parallel regular expression matching techniques.106 Section~\ref{Background} presents background material on classical automatabased approaches 107 to regular expression matching as well as describes the efficient {\em grep} family utilities 108 and regular expression pattern matching software libraries. 109 Next, Section~\ref{Bitparallel Data Streams} describes our parallel regular expression matching techniques. 112 110 Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. 113 111 Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} 114 112 presents a detailed performance analysis of our data parallel bitstream techniques in comparison to 115 Gnu grep, agrep, and nrgrep, and re2. Section~\ref{Conclusion} concludes the paper.113 several software tools. Section~\ref{Conclusion} concludes the paper. 116 114 117 115 The fundamental contribution of this paper is fully general approach to regular expression matching 118 using bitparallel data streams. The algorithmic aspects of this paper build upon 119 the fundamental concepts of our previous work 120 in some cases \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. 121 Individual contributions include: 116 using bitparallel data stream operations. The algorithmic aspects of this paper build upon 117 the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. 118 Specific contributions include: 122 119 \begin{itemize} 123 120 \item compilation of regular expressions into unbounded bitparallel data stream equations; 124 121 \item documentation of character classes compilation into bitparallel character class data streams; 125 % \item methods to select optimal subpatterns; 126 \item bitparallel and backtrackfree scanning primitive termed Match Star; 127 \item bitparallel data stream support for unicode. 122 \item the Match Star parallel scanning primitive; 123 \item efficient support for unicode characters. 128 124 \end{itemize} 129 125 130 126 \section{Basic Concepts} 131 127 \label{Basic Concepts} 132 We define the notationsand basic concepts used throughout this paper.128 In this section, we define the notation and basic concepts used throughout this paper. 133 129 134 130 \subsection{Notation} 135 131 We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$ 136 132 be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$ 137 be a text of length n defined over a finite alphabet $sigma$ of size $alpha$. 138 The task of regular expression matching 139 is to find all the text positions of T that follow an occurrence of P. We use 140 C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$, $\ll{}$, and $\lgg{}$ 141 represent bitwise NOT, OR, AND, XOR, kbit left shift and kbit right shift, respectively. 133 be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$. 134 The goal of simple pattern expression matching is to find all the text positions of T that follow 135 an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression. 136 137 We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$ 138 represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent 139 logical left shift, and logical right shift, respectively and are further modulated by 140 the number of bit positions a given value shall be shifted by, for example ``shift left by n''. 141 Vacant bitpositions are filled in with zeroes. 142 142 143 143 \subsection{Regular Expressions} 144 144 145 % TODO145 % Define regular expressions (a recursive def), character classes, bounded repetition 146 146 147 147 \section{Background} … … 166 166 167 167 Thompson's original work marked the beginning of a long line of 168 regular expression implementations that 169 process an input string, characteratatime, and that transition patterm matching state 170 based on the current state and the next character read. 171 172 The BoyerMoore family of algorithms \cite{boyer1977fast} , Horspool, ... skip characters 173 174 Suffix automata (BDM) 175 176 The ideas presented up to now aim at a good implementation of the automa 177 ton, but they must inspect all the text characters. In many cases, however, the 178 regular expression involves sets of relatively long substrings that must appear 179 for the regular expression to match. 168 regular expression pattern matching methods that 169 process an input string, characteratatime, 170 and that transition from state to state 171 according to the current state and the next input character. 172 173 Whereas traditional automata techniques acheive 174 O(n) worstcase optimal efficiency, simple string matching algorithms, 175 such as the BoyerMoore family of algorithms, skip input characters 176 to achieve sublinear times in the average case \cite{boyer1977fast}. 177 178 BoyerMoore methods, begin comparison from the end of the pattern instead of the 179 beginning and precompute skip information 180 to determine how far ahead a pattern search can skip in the input whenever 181 a nonmatching character is encountered. Generally, BoyerMoore family 182 algorithms improve faster as the pattern being searched for becomes longer. 183 In many cases, the techniques used to skip characters in simple string matching 184 approaches can be extended to regular expression matching. 185 Widely known techniques used to facilitate character skipping in regular 186 expression matching include necessary strings and backward suffix matching 187 inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}. 188 180 189 181 190 \subsection{Bitparallel Simulation of Automata} 182 191 183 Define bitparallelism \cite{navarro2002flexible} 184 185 ShiftOr \cite{} 186 187 Backward Dawg Matching (BDM) \cite{navarro1998bit} 192 Define bitparallelism \cite{Baezayates_anew} 193 194 ShiftOr / ShiftAnd \cite{wu1992fast} 188 195 189 196 Bitparallel suffix automata (Backward NonDeterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm) 190 191 Skip characters pattern length (occurence frequency and length).192 193 194 \section{Bitparallel Data Streams}195 \label{Bitparallel Data Streams}196 197 The bitparallel data streams use the wide198 SIMD registers commonly found on commodity processors199 to process byte positions at a time using200 bitwise logic, shifting and other operations.201 202 A signficant advantage of the bitparallel203 data stream method over other204 pattern matching methods that rely on205 bitparallel automata simulation206 is the potential to skip full register width207 number of characters in low occurence frequency text. % reword208 209 210 Skip characters register width.211 197 212 198 \subsection{Software Tools} … … 231 217 %re2 232 218 219 220 \section{Bitparallel Data Streams} 221 \label{Bitparallel Data Streams} 222 223 The bitparallel data streams use the wide 224 SIMD registers commonly found on commodity processors 225 to process byte positions at a time using 226 bitwise logic, shifting and other operations. 227 228 A signficant advantage of the bitparallel 229 data stream method over other 230 pattern matching methods that rely on 231 bitparallel automata simulation 232 is the potential to skip full register width 233 number of characters in low occurence frequency text. % reword 234 235 236 Skip characters register width. 237 233 238 \subsection{Match Star} 234 239
Note: See TracChangeset
for help on using the changeset viewer.