Changeset 3138 for docs/Working/re


Ignore:
Timestamp:
May 14, 2013, 2:38:17 PM (6 years ago)
Author:
ksherdy
Message:

Rough notes.

Location:
docs/Working/re
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/re-main.tex

    r3135 r3138  
    2323
    2424A parallel regular expression matching method is introduced and studied in
    25 application to the problem of online pattern matching.
    26 
    27 The method is based on the concept of parallel
    28 bitstream technology, in which parallel streams of bits are formed such
     25comparison with software tools designed for efficient on-line text searching such
     26as the {\em grep} family.
     27
     28The method is based on the concept of bit-parallel data streams,
     29in which parallel streams of bits are formed such
    2930that each stream comprises bits in one-to-one correspondence with the
    30 character code units of a source data stream.
    31 
    32 On processors supporting W-bit addition operations, the method processes W source characters 
     31character code units of a source data stream.
     32
     33An implementation of the method in the form of a regular expression
     34compiler is discussed. The compiler accepts a regular expression and produces
     35sequences of arbitrary-length bit-parallel equations.
     36Bit-parallel equations are then transformed
     37into a low-level C-based implementation.
     38
     39The C-based program accepts the text to be searched and
     40signals a match each time the text
     41matches the compiled regular expression.
     42
     43
     44These low-level implementations take advantage of
     45the SIMD (single-instruction multiple-data) capabilities of commodity
     46processors to yield a dramatic speed-up over
     47traditional byte-at-a-time approaches.
     48
     49On processors supporting W-bit
     50addition operations, the method processes W source characters 
    3351in parallel and performs up to W finite state transitions per clock cycle.
    3452
    35 Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives.
     53Additional parallelism is achieved through the introduction a new parallel
     54scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure.
     55
     56
     57We demonstrate the features and efficiency of our bit-parallel pattern
     58matching appoach by searching text data sets for lines matching a regular expression.
     59
     60We evaluate the performance of our method against the widely known grep implemenations,
     61{\em Gnu grep}, {\em agrep}, {\em nr-grep}, as well as {\em Google's re2} regular expression engine.
     62
     63Performance results show a dramatic speed-up over the traditional and state-of-the-art alternatives.
     64
    3665
    3766\end{abstract}
     
    92121regular expression pattern matching techniques and provides insight into the
    93122efficiency of traditional regular expression software tools. Next,
    94 Section~\ref{Data Parallel Bitstreams} describes our data parallel
    95 regular expression matching techniques.
    96 Section~\ref{Compiler Technology}
     123Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
     124Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
    97125Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
    98126presents a detailed performance analysis of our data parallel bitstream techniques against
     
    104132%\input{background.tex}
    105133
    106 % Background
    107 
    108 % History
    109 
    110 Historically, the origins of regular expression matching date back to automata theory
     134The origins of regular expression matching date back to automata theory
    111135and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
    112 
    113 In 1959, Dana and Scott demonstrated that
    114 NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA
    115 state corresponds to a set of NFA states.
    116 
    117 Thompson, in 1968, is credited with the first construction to convert regular expressions
     136Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
    118137to nondeterministic finite automata (NFA) for regular expression matching.
    119 
    120 Thompson’s publication \cite{thompson1968} marked the beginning of a long line of
    121 regular expression implementations that construct automata for pattern matching.
    122 
    123138Following Thompson's approach, a regular expression of length m is first converted
    124139to an NFA with O(m) nodes. It is possible to search a text of length n using the
    125 NFA directly in O(mn) worst case time. Alternatively, a more effcient choice
     140NFA directly in O(mn) worst case time. Often, a more efficient choice
    126141is to convert the NFA into a DFA. A DFA has only a single active state and
    127142allows to search the text at O(n) worst-case optimal. It is well known that the
     
    129144That is the resultant DFA may have O($2^m$) states.
    130145
    131 \section{Parallel Bitwise Data Streams}
    132 \label{Parallel Bitwise Data Streams}
     146Thompson's original work marked the beginning of a long line of
     147regular expression implementations that
     148process an input string, character-at-a-time, and that transition patterm matching state
     149based on the current state and the next character read.
     150
     151
     152Bit-parallel simulation of automata.
     153
     154The ideas presented up to now aim at a good implementation of the automa-
     155ton, but they must inspect all the text characters. In many cases, however, the
     156regular expression involves sets of relatively long substrings that must appear
     157for the regular expression to match.
     158
     159
     160Suffix automata (BDM/BNDM)
     161
     162Skip characters pattern length (occurence frequency and length).
     163
     164
     165\section{Bit-parallel Data Streams}
     166\label{Bit-parallel data streams}
     167
     168The advantage of the parallel bit stream representation is that we can
     169use the 128-bit SIMD registers commonly found on commodity processors
     170(e.g. SSE on Intel) to process 128 byte positions at a time using
     171bitwise logic, shifting and other operations.
     172
     173Skip characters register width.
     174
     175\subsection{Software Tools}
     176
     177%Thompson created the first grep (UNIX grep) as a standalone adaptation
     178%of the regular expression parser he had written for the UNIX ed utility.
     179%In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
     180%Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
     181%time it took to build a complete finite automaton for the regular expression before it could even start searching.
     182%Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
     183%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
     184%It took zero set-up time and just one additional test in the inner loop.
     185%http://pages.cs.wisc.edu/~mdant/cs520_4.html
     186
     187\subsection{Mask Star}
     188
     189%Wikipedia
     190Backtracking is a general algorithm for finding all solutions to some computational problem,
     191that incrementally builds candidates to the solutions.
    133192
    134193\section{Compiler Technology}
  • docs/Working/re/re-main.tex.backup

    r3135 r3138  
    2323
    2424A parallel regular expression matching method is introduced and studied in
    25 application to the problem of online pattern matching.
    26 
    27 The method is based on the concept of parallel
    28 bitstream technology, in which parallel streams of bits are formed such
     25comparison with software tools designed for efficient on-line text searching such
     26as the {\em grep} family.
     27
     28The method is based on the concept of bit-parallel data streams,
     29in which parallel streams of bits are formed such
    2930that each stream comprises bits in one-to-one correspondence with the
    30 character code units of a source data stream.
    31 
    32 On processors supporting W-bit addition operations, the method processes W source characters 
     31character code units of a source data stream.
     32
     33An implementation of the method in the form of a regular expression
     34compiler is discussed. The compiler accepts a regular expression and produces
     35sequences of arbitrary-length bit-parallel equations.
     36Bit-parallel equations are then transformed
     37into a low-level C-based implementation.
     38
     39The C-based program accepts the text to be searched and
     40signals a match each time the text
     41matches the compiled regular expression.
     42
     43
     44These low-level implementations take advantage of
     45the SIMD (single-instruction multiple-data) capabilities of commodity
     46processors to yield a dramatic speed-up over
     47traditional byte-at-a-time approaches.
     48
     49On processors supporting W-bit
     50addition operations, the method processes W source characters 
    3351in parallel and performs up to W finite state transitions per clock cycle.
    3452
    35 Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives.
     53Additional parallelism is achieved through the introduction a new parallel
     54scanning primitive termed {\em Match Star}. % which eliminates backtracking in Kleene Closure.
     55
     56
     57We demonstrate the features and efficiency of our bit-parallel pattern
     58matching appoach by searching text data sets for lines matching a regular expression.
     59
     60We evaluate the performance of our method against the widely known grep implemenations,
     61{\em Gnu grep}, {\em agrep}, {\em nr-grep}, as well as {\em Google's re2} regular expression engine.
     62
     63Performance results show a dramatic speed-up over the traditional and state-of-the-art alternatives.
     64
    3665
    3766\end{abstract}
     
    92121regular expression pattern matching techniques and provides insight into the
    93122efficiency of traditional regular expression software tools. Next,
    94 Section~\ref{Data Parallel Bitstreams} describes our data parallel
    95 regular expression matching techniques.
    96 Section~\ref{Compiler Technology}
     123Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
     124Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
    97125Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
    98126presents a detailed performance analysis of our data parallel bitstream techniques against
     
    104132%\input{background.tex}
    105133
    106 % Background
    107 
    108 % History
    109 
    110 Historically, the origins of regular expression matching date back to automata theory
     134The origins of regular expression matching date back to automata theory
    111135and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
    112 
    113 In 1959, Dana and Scott demonstrated that
    114 NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA
    115 state corresponds to a set of NFA states.
    116 
    117 Thompson, in 1968, is credited with the first construction to convert regular expressions
     136Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
    118137to nondeterministic finite automata (NFA) for regular expression matching.
    119 
    120 Thompson’s publication \cite{thompson1968} marked the beginning of a long line of
    121 regular expression implementations that construct automata for pattern matching.
    122 
    123138Following Thompson's approach, a regular expression of length m is first converted
    124139to an NFA with O(m) nodes. It is possible to search a text of length n using the
    125 NFA directly in O(mn) worst case time. Alternatively, a more effcient choice
     140NFA directly in O(mn) worst case time. Alternatively, a more efficient choice
    126141is to convert the NFA into a DFA. A DFA has only a single active state and
    127 allows to search the text at O(n) worst-case optimal. The problem with this
    128 approach is that the DFA may have O($2^m$) states.
    129 
    130 
    131 \section{Parallel Bitwise Data Streams}
    132 \label{Parallel Bitwise Data Streams}
     142allows to search the text at O(n) worst-case optimal. It is well known that the
     143conversion of the NFA to the DFA may result in the state explosion problem.
     144That is the resultant DFA may have O($2^m$) states.
     145
     146Thompson's original work marked the beginning of a long line of
     147regular expression implementations that
     148process an input string, character-by-character, and that change based on the current state and the
     149character read. Thompson created the first grep utility as a standalone adaptation
     150of the regular expression parser he had written for the UNIX ed utility. In 1976,
     151Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
     152
     153Bit-parallel simulation of automata.
     154
     155Suffix automata (BDM/BNDM)
     156
     157Skip characters pattern length (occurence frequency and length).
     158
     159
     160\section{Bit-parallel Data Streams}
     161\label{Bit-parallel data streams}
     162
     163The advantage of the parallel bit stream representation is that we can
     164use the 128-bit SIMD registers commonly found on commodity processors
     165(e.g. SSE on Intel) to process 128 byte positions at a time using
     166bitwise logic, shifting and other operations.
     167
     168Skip characters register width.
     169
     170\subsection{Mask Star}
     171
     172%Wikipedia
     173Backtracking is a general algorithm for finding all solutions to some computational problem,
     174that incrementally builds candidates to the solutions.
    133175
    134176\section{Compiler Technology}
Note: See TracChangeset for help on using the changeset viewer.