# Changeset 3138 for docs/Working

Ignore:
Timestamp:
May 14, 2013, 2:38:17 PM (6 years ago)
Message:

Rough notes.

Location:
docs/Working/re
Files:
2 edited

### Legend:

Unmodified
 r3135 A parallel regular expression matching method is introduced and studied in application to the problem of online pattern matching. The method is based on the concept of parallel bitstream technology, in which parallel streams of bits are formed such comparison with software tools designed for efficient on-line text searching such as the {\em grep} family. The method is based on the concept of bit-parallel data streams, in which parallel streams of bits are formed such that each stream comprises bits in one-to-one correspondence with the character code units of a source data stream. On processors supporting W-bit addition operations, the method processes W source characters character code units of a source data stream. An implementation of the method in the form of a regular expression compiler is discussed. The compiler accepts a regular expression and produces sequences of arbitrary-length bit-parallel equations. Bit-parallel equations are then transformed into a low-level C-based implementation. The C-based program accepts the text to be searched and signals a match each time the text matches the compiled regular expression. These low-level implementations take advantage of the SIMD (single-instruction multiple-data) capabilities of commodity processors to yield a dramatic speed-up over traditional byte-at-a-time approaches. On processors supporting W-bit addition operations, the method processes W source characters in parallel and performs up to W finite state transitions per clock cycle. Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. Additional parallelism is achieved through the introduction a new parallel scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure. We demonstrate the features and efficiency of our bit-parallel pattern matching appoach by searching text data sets for lines matching a regular expression. We evaluate the performance of our method against the widely known grep implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}, as well as {\em Google's re2} regular expression engine. Performance results show a dramatic speed-up over the traditional and state-of-the-art alternatives. \end{abstract} regular expression pattern matching techniques and provides insight into the efficiency of traditional regular expression software tools. Next, Section~\ref{Data Parallel Bitstreams} describes our data parallel regular expression matching techniques. Section~\ref{Compiler Technology} Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} presents a detailed performance analysis of our data parallel bitstream techniques against %\input{background.tex} % Background % History Historically, the origins of regular expression matching date back to automata theory The origins of regular expression matching date back to automata theory and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. In 1959, Dana and Scott demonstrated that NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA state corresponds to a set of NFA states. Thompson, in 1968, is credited with the first construction to convert regular expressions Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions to nondeterministic finite automata (NFA) for regular expression matching. Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of regular expression implementations that construct automata for pattern matching. Following Thompson's approach, a regular expression of length m is first converted to an NFA with O(m) nodes. It is possible to search a text of length n using the NFA directly in O(mn) worst case time. Alternatively, a more effcient choice NFA directly in O(mn) worst case time. Often, a more efficient choice is to convert the NFA into a DFA. A DFA has only a single active state and allows to search the text at O(n) worst-case optimal. It is well known that the That is the resultant DFA may have O($2^m$) states. \section{Parallel Bitwise Data Streams} \label{Parallel Bitwise Data Streams} Thompson's original work marked the beginning of a long line of regular expression implementations that process an input string, character-at-a-time, and that transition patterm matching state based on the current state and the next character read. Bit-parallel simulation of automata. The ideas presented up to now aim at a good implementation of the automa- ton, but they must inspect all the text characters. In many cases, however, the regular expression involves sets of relatively long substrings that must appear for the regular expression to match. Suffix automata (BDM/BNDM) Skip characters pattern length (occurence frequency and length). \section{Bit-parallel Data Streams} \label{Bit-parallel data streams} The advantage of the parallel bit stream representation is that we can use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on Intel) to process 128 byte positions at a time using bitwise logic, shifting and other operations. Skip characters register width. \subsection{Software Tools} %Thompson created the first grep (UNIX grep) as a standalone adaptation %of the regular expression parser he had written for the UNIX ed utility. %In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep. %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the %time it took to build a complete finite automaton for the regular expression before it could even start searching. %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. %It took zero set-up time and just one additional test in the inner loop. %http://pages.cs.wisc.edu/~mdant/cs520_4.html \subsection{Mask Star} %Wikipedia Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions. \section{Compiler Technology}
 r3135 A parallel regular expression matching method is introduced and studied in application to the problem of online pattern matching. The method is based on the concept of parallel bitstream technology, in which parallel streams of bits are formed such comparison with software tools designed for efficient on-line text searching such as the {\em grep} family. The method is based on the concept of bit-parallel data streams, in which parallel streams of bits are formed such that each stream comprises bits in one-to-one correspondence with the character code units of a source data stream. On processors supporting W-bit addition operations, the method processes W source characters character code units of a source data stream. An implementation of the method in the form of a regular expression compiler is discussed. The compiler accepts a regular expression and produces sequences of arbitrary-length bit-parallel equations. Bit-parallel equations are then transformed into a low-level C-based implementation. The C-based program accepts the text to be searched and signals a match each time the text matches the compiled regular expression. These low-level implementations take advantage of the SIMD (single-instruction multiple-data) capabilities of commodity processors to yield a dramatic speed-up over traditional byte-at-a-time approaches. On processors supporting W-bit addition operations, the method processes W source characters in parallel and performs up to W finite state transitions per clock cycle. Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. Additional parallelism is achieved through the introduction a new parallel scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure. We demonstrate the features and efficiency of our bit-parallel pattern matching appoach by searching text data sets for lines matching a regular expression. We evaluate the performance of our method against the widely known grep implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}, as well as {\em Google's re2} regular expression engine. Performance results show a dramatic speed-up over the traditional and state-of-the-art alternatives. \end{abstract} regular expression pattern matching techniques and provides insight into the efficiency of traditional regular expression software tools. Next, Section~\ref{Data Parallel Bitstreams} describes our data parallel regular expression matching techniques. Section~\ref{Compiler Technology} Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} presents a detailed performance analysis of our data parallel bitstream techniques against %\input{background.tex} % Background % History Historically, the origins of regular expression matching date back to automata theory The origins of regular expression matching date back to automata theory and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. In 1959, Dana and Scott demonstrated that NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA state corresponds to a set of NFA states. Thompson, in 1968, is credited with the first construction to convert regular expressions Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions to nondeterministic finite automata (NFA) for regular expression matching. Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of regular expression implementations that construct automata for pattern matching. Following Thompson's approach, a regular expression of length m is first converted to an NFA with O(m) nodes. It is possible to search a text of length n using the NFA directly in O(mn) worst case time. Alternatively, a more effcient choice NFA directly in O(mn) worst case time. Alternatively, a more efficient choice is to convert the NFA into a DFA. A DFA has only a single active state and allows to search the text at O(n) worst-case optimal. The problem with this approach is that the DFA may have O($2^m$) states. \section{Parallel Bitwise Data Streams} \label{Parallel Bitwise Data Streams} allows to search the text at O(n) worst-case optimal. It is well known that the conversion of the NFA to the DFA may result in the state explosion problem. That is the resultant DFA may have O($2^m$) states. Thompson's original work marked the beginning of a long line of regular expression implementations that process an input string, character-by-character, and that change based on the current state and the character read. Thompson created the first grep utility as a standalone adaptation of the regular expression parser he had written for the UNIX ed utility. In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep. Bit-parallel simulation of automata. Suffix automata (BDM/BNDM) Skip characters pattern length (occurence frequency and length). \section{Bit-parallel Data Streams} \label{Bit-parallel data streams} The advantage of the parallel bit stream representation is that we can use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on Intel) to process 128 byte positions at a time using bitwise logic, shifting and other operations. Skip characters register width. \subsection{Mask Star} %Wikipedia Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions. \section{Compiler Technology}