Changeset 3138
 Timestamp:
 May 14, 2013, 2:38:17 PM (6 years ago)
 Location:
 docs/Working/re
 Files:

 2 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/re/remain.tex
r3135 r3138 23 23 24 24 A parallel regular expression matching method is introduced and studied in 25 application to the problem of online pattern matching. 26 27 The method is based on the concept of parallel 28 bitstream technology, in which parallel streams of bits are formed such 25 comparison with software tools designed for efficient online text searching such 26 as the {\em grep} family. 27 28 The method is based on the concept of bitparallel data streams, 29 in which parallel streams of bits are formed such 29 30 that each stream comprises bits in onetoone correspondence with the 30 character code units of a source data stream. 31 32 On processors supporting Wbit addition operations, the method processes W source characters 31 character code units of a source data stream. 32 33 An implementation of the method in the form of a regular expression 34 compiler is discussed. The compiler accepts a regular expression and produces 35 sequences of arbitrarylength bitparallel equations. 36 Bitparallel equations are then transformed 37 into a lowlevel Cbased implementation. 38 39 The Cbased program accepts the text to be searched and 40 signals a match each time the text 41 matches the compiled regular expression. 42 43 44 These lowlevel implementations take advantage of 45 the SIMD (singleinstruction multipledata) capabilities of commodity 46 processors to yield a dramatic speedup over 47 traditional byteatatime approaches. 48 49 On processors supporting Wbit 50 addition operations, the method processes W source characters 33 51 in parallel and performs up to W finite state transitions per clock cycle. 34 52 35 Performance results show a dramatic speedup over traditional and stateoftheart alternatives. 53 Additional parallelism is achieved through the introduction a new parallel 54 scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure. 55 56 57 We demonstrate the features and efficiency of our bitparallel pattern 58 matching appoach by searching text data sets for lines matching a regular expression. 59 60 We evaluate the performance of our method against the widely known grep implemenations, 61 {\em Gnu grep}, {\em agrep}, {\em nrgrep}, as well as {\em Google's re2} regular expression engine. 62 63 Performance results show a dramatic speedup over the traditional and stateoftheart alternatives. 64 36 65 37 66 \end{abstract} … … 92 121 regular expression pattern matching techniques and provides insight into the 93 122 efficiency of traditional regular expression software tools. Next, 94 Section~\ref{Data Parallel Bitstreams} describes our data parallel 95 regular expression matching techniques. 96 Section~\ref{Compiler Technology} 123 Section~\ref{Bitparallel Data Streams} describes our parallel regular expression matching techniques. 124 Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. 97 125 Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} 98 126 presents a detailed performance analysis of our data parallel bitstream techniques against … … 104 132 %\input{background.tex} 105 133 106 % Background 107 108 % History 109 110 Historically, the origins of regular expression matching date back to automata theory 134 The origins of regular expression matching date back to automata theory 111 135 and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. 112 113 In 1959, Dana and Scott demonstrated that 114 NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA 115 state corresponds to a set of NFA states. 116 117 Thompson, in 1968, is credited with the first construction to convert regular expressions 136 Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions 118 137 to nondeterministic finite automata (NFA) for regular expression matching. 119 120 Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of121 regular expression implementations that construct automata for pattern matching.122 123 138 Following Thompson's approach, a regular expression of length m is first converted 124 139 to an NFA with O(m) nodes. It is possible to search a text of length n using the 125 NFA directly in O(mn) worst case time. Alternatively, a more effcient choice140 NFA directly in O(mn) worst case time. Often, a more efficient choice 126 141 is to convert the NFA into a DFA. A DFA has only a single active state and 127 142 allows to search the text at O(n) worstcase optimal. It is well known that the … … 129 144 That is the resultant DFA may have O($2^m$) states. 130 145 131 \section{Parallel Bitwise Data Streams} 132 \label{Parallel Bitwise Data Streams} 146 Thompson's original work marked the beginning of a long line of 147 regular expression implementations that 148 process an input string, characteratatime, and that transition patterm matching state 149 based on the current state and the next character read. 150 151 152 Bitparallel simulation of automata. 153 154 The ideas presented up to now aim at a good implementation of the automa 155 ton, but they must inspect all the text characters. In many cases, however, the 156 regular expression involves sets of relatively long substrings that must appear 157 for the regular expression to match. 158 159 160 Suffix automata (BDM/BNDM) 161 162 Skip characters pattern length (occurence frequency and length). 163 164 165 \section{Bitparallel Data Streams} 166 \label{Bitparallel data streams} 167 168 The advantage of the parallel bit stream representation is that we can 169 use the 128bit SIMD registers commonly found on commodity processors 170 (e.g. SSE on Intel) to process 128 byte positions at a time using 171 bitwise logic, shifting and other operations. 172 173 Skip characters register width. 174 175 \subsection{Software Tools} 176 177 %Thompson created the first grep (UNIX grep) as a standalone adaptation 178 %of the regular expression parser he had written for the UNIX ed utility. 179 %In 1976, Aho improved upon Thompson's implementation that with a DFAbased implementation called egrep. 180 %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the 181 %time it took to build a complete finite automaton for the regular expression before it could even start searching. 182 %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time 183 %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. 184 %It took zero setup time and just one additional test in the inner loop. 185 %http://pages.cs.wisc.edu/~mdant/cs520_4.html 186 187 \subsection{Mask Star} 188 189 %Wikipedia 190 Backtracking is a general algorithm for finding all solutions to some computational problem, 191 that incrementally builds candidates to the solutions. 133 192 134 193 \section{Compiler Technology} 
docs/Working/re/remain.tex.backup
r3135 r3138 23 23 24 24 A parallel regular expression matching method is introduced and studied in 25 application to the problem of online pattern matching. 26 27 The method is based on the concept of parallel 28 bitstream technology, in which parallel streams of bits are formed such 25 comparison with software tools designed for efficient online text searching such 26 as the {\em grep} family. 27 28 The method is based on the concept of bitparallel data streams, 29 in which parallel streams of bits are formed such 29 30 that each stream comprises bits in onetoone correspondence with the 30 character code units of a source data stream. 31 32 On processors supporting Wbit addition operations, the method processes W source characters 31 character code units of a source data stream. 32 33 An implementation of the method in the form of a regular expression 34 compiler is discussed. The compiler accepts a regular expression and produces 35 sequences of arbitrarylength bitparallel equations. 36 Bitparallel equations are then transformed 37 into a lowlevel Cbased implementation. 38 39 The Cbased program accepts the text to be searched and 40 signals a match each time the text 41 matches the compiled regular expression. 42 43 44 These lowlevel implementations take advantage of 45 the SIMD (singleinstruction multipledata) capabilities of commodity 46 processors to yield a dramatic speedup over 47 traditional byteatatime approaches. 48 49 On processors supporting Wbit 50 addition operations, the method processes W source characters 33 51 in parallel and performs up to W finite state transitions per clock cycle. 34 52 35 Performance results show a dramatic speedup over traditional and stateoftheart alternatives. 53 Additional parallelism is achieved through the introduction a new parallel 54 scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure. 55 56 57 We demonstrate the features and efficiency of our bitparallel pattern 58 matching appoach by searching text data sets for lines matching a regular expression. 59 60 We evaluate the performance of our method against the widely known grep implemenations, 61 {\em Gnu grep}, {\em agrep}, {\em nrgrep}, as well as {\em Google's re2} regular expression engine. 62 63 Performance results show a dramatic speedup over the traditional and stateoftheart alternatives. 64 36 65 37 66 \end{abstract} … … 92 121 regular expression pattern matching techniques and provides insight into the 93 122 efficiency of traditional regular expression software tools. Next, 94 Section~\ref{Data Parallel Bitstreams} describes our data parallel 95 regular expression matching techniques. 96 Section~\ref{Compiler Technology} 123 Section~\ref{Bitparallel Data Streams} describes our parallel regular expression matching techniques. 124 Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. 97 125 Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} 98 126 presents a detailed performance analysis of our data parallel bitstream techniques against … … 104 132 %\input{background.tex} 105 133 106 % Background 107 108 % History 109 110 Historically, the origins of regular expression matching date back to automata theory 134 The origins of regular expression matching date back to automata theory 111 135 and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. 112 113 In 1959, Dana and Scott demonstrated that 114 NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA 115 state corresponds to a set of NFA states. 116 117 Thompson, in 1968, is credited with the first construction to convert regular expressions 136 Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions 118 137 to nondeterministic finite automata (NFA) for regular expression matching. 119 120 Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of121 regular expression implementations that construct automata for pattern matching.122 123 138 Following Thompson's approach, a regular expression of length m is first converted 124 139 to an NFA with O(m) nodes. It is possible to search a text of length n using the 125 NFA directly in O(mn) worst case time. Alternatively, a more eff cient choice140 NFA directly in O(mn) worst case time. Alternatively, a more efficient choice 126 141 is to convert the NFA into a DFA. A DFA has only a single active state and 127 allows to search the text at O(n) worstcase optimal. The problem with this 128 approach is that the DFA may have O($2^m$) states. 129 130 131 \section{Parallel Bitwise Data Streams} 132 \label{Parallel Bitwise Data Streams} 142 allows to search the text at O(n) worstcase optimal. It is well known that the 143 conversion of the NFA to the DFA may result in the state explosion problem. 144 That is the resultant DFA may have O($2^m$) states. 145 146 Thompson's original work marked the beginning of a long line of 147 regular expression implementations that 148 process an input string, characterbycharacter, and that change based on the current state and the 149 character read. Thompson created the first grep utility as a standalone adaptation 150 of the regular expression parser he had written for the UNIX ed utility. In 1976, 151 Aho improved upon Thompson's implementation that with a DFAbased implementation called egrep. 152 153 Bitparallel simulation of automata. 154 155 Suffix automata (BDM/BNDM) 156 157 Skip characters pattern length (occurence frequency and length). 158 159 160 \section{Bitparallel Data Streams} 161 \label{Bitparallel data streams} 162 163 The advantage of the parallel bit stream representation is that we can 164 use the 128bit SIMD registers commonly found on commodity processors 165 (e.g. SSE on Intel) to process 128 byte positions at a time using 166 bitwise logic, shifting and other operations. 167 168 Skip characters register width. 169 170 \subsection{Mask Star} 171 172 %Wikipedia 173 Backtracking is a general algorithm for finding all solutions to some computational problem, 174 that incrementally builds candidates to the solutions. 133 175 134 176 \section{Compiler Technology}
Note: See TracChangeset
for help on using the changeset viewer.