# Changeset 3124

Ignore:
Timestamp:
May 10, 2013, 1:10:27 PM (6 years ago)
Message:

Rough notes.

Location:
docs/Working/re
Files:
2 edited

### Legend:

Unmodified
 r3123 Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. \end{abstract} \label{Introduction} %\input{introduction.tex} Regular expresssion matching is an extensively studied problem with a multitude of algorithms and software tools developed to the demands of particular problem contexts. Given a text T$_{1..n}$ of n characters and a pattern P, the pattern matching problem can be The pattern P can be just a simple string, but it can also be, for example, a regular expression. In this paper our focus is regular expression matching. A regular expression, or pattern, is an expression that specifies a set of strings. A regular expression is composed of (i) basic strings and (ii) A regular expression is composed of (i) simple strings (ii) the empty or (ii) union, concatenation and Kleene closure of other regular expressions. To avoid parentheses it is assumed that the Kleene star has the highest priority, next concatenation and then alternation, however, most formalisms provides grouping operators to allow the definition of scope and operator precedence. Readers unfamiliar with the concept of regular expression matching are referred classical texts such as \cite{aho2007}. \subsection{Grep} Regular expression matching is commonly performed using a variety of publically available software tools. Namely, the UNIX grep family, the GNU grep family, agrep, cgrep, sgrep, nrgrep, and Perl regular publically available software tools. The most prominent, UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular expressions \cite{Abou-assaleh04surveyof}. % Grep % Unix grep % Gnu grep % agrep % nrgrep Amongst these Gnu grep, agrep, and nrgrep are widely known and considered as the fastest regular expression matching tools in practice \cite{}. Of particular interest are well-studied and performance oriented Gnu grep, agrep, and nrgrep. Of particular interest to this study are the performance oriented Gnu grep, agrep, and nrgrep. As such, we compare the performance of our parallel \bitstream{} techniques against % motivation / previous work Although the finite state machine methods used in the scanning and parsing of text streams is considered to be the hardest of the â13 dwarvesâ to parallelize [1], parallel bitstream technology shows considerable promise for these types of applications [3, 4]. In this approach, character streams are processed W positions at a time using the W-bit SIMD registers commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit position within the byte. These basis bitstreams are then combined with bitwise logic and shifting operations to compute further parallel bit streams of interest. We further increase the parallelism in our methods by introducing a new parallel scanning primitive which we have coined 'Match Star' that returns all matches in a single operation and eliminates the need for back tracking ... (ELABORATE) --- STATE the content of the next sections. --- We compare the performance of our parallel \bitstream{} techniques against various grep concentrate on the simpler case of reporting initial or final occurrence positions. \section{Background} \label{Background} %\input{background.tex} % Background % History Regular expresssion matching is an extensively studied problem with a multitude of algorithms and software tools developed to the demands of particular problem contexts. Historically, the origins of regular expression matching date back to automata theory may have O(2^m) states. Overall, the general process is first to build a NFA from the regular expression, convert the NFA into a DFA, minimize the number of states in the DFA, and finally use the DFA to scan the text. In general, the general process is first to build a NFA from the regular expression and simulate the NFA on text input, or alternatively to convert the NFA into a DFA, optionally minimize the number of states in the DFA, and finally simulate the DFA on text input. \section{Background} \label{Background} %\input{background.tex} \section{Methodology}
 r3123 Performance results show a dramatic speed-up over traditional and state-of-the-art alternatives. \end{abstract} \label{Introduction} %\input{introduction.tex} % parallel bitstream technology, parallelization, regular expressions % motivation / previous work Although the finite state machine methods used in the scanning and parsing of text streams is considered to be the hardest of the â13 dwarvesâ to parallelize [1], parallel bitstream technology shows considerable promise for these types of applications [3, 4]. In this approach, character streams are processed W positions at a time using the W-bit SIMD registers commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit position within the byte. These basis bitstreams are then combined with bitwise logic and shifting operations to compute further parallel bit streams of interest. We further increase the parallelism in our methods by introducing a new parallel scanning primitive which we have coined 'Match Star' that returns all matches in a single operation and eliminates the need for back tracking ... (ELABORATE) Given a text T$_{1..n}$ of n characters and a pattern P, the pattern matching problem can be The pattern P can be just a simple string, but it can also be, for example, a regular expression. In this paper our focus is regular expression matching. A regular expression, or pattern, is an expression that specifies a set of strings. A regular expression is composed of (i) basic strings and (ii) A regular expression is composed of (i) simple strings (ii) the empty or (ii) union, concatenation and Kleene closure of other regular expressions. To avoid parentheses it is assumed that the Kleene star has the highest priority, next concatenation and then alternation, however, most formalisms provides grouping operators to allow the definition of scope and operator precedence. Readers unfamiliar with the concept of regular expression matching are referred classical texts such as \cite{aho2007}. \subsection{Grep} Regular expression matching is commonly performed using a variety of publically available software tools. Namely, the UNIX grep family, the GNU grep family, agrep, cgrep, sgrep, nrgrep, and Perl regular publically available software tools. The most prominent, UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular expressions \cite{Abou-assaleh04surveyof}. % Grep Amongst these Gnu grep, agrep, and nrgrep are widely known and considered as the fastest regular expression matching tools in practice \cite{}. Of particular interest to this study are the performance oriented Gnu grep, agrep, and nrgrep. % Unix grep % Gnu grep % nrgrep Of particular interest are well-studied and performance oriented Gnu grep, agrep, and nrgrep. Regular expresssion matching is an extensively studied problem with a multitude of algorithms and software tools developed to the demands of particular problem contexts. As such, we compare the performance of our parallel \bitstream{} techniques against reporting initial or final occurrence positions. \section{Background} \label{Background} %\input{background.tex} % Background % History Regular expresssion matching is an extensively studied problem with a multitude of algorithms and software tools developed to the demands of particular problem contexts. Historically, the origins of regular expression matching date back to automata theory and formal language theory developed by Kleene in the 1950s \cite{kleene1951representation}. The traditional technique [16] to search a regular expression of length m in a text of length n is to first convert the expression into a non-deterministic automaton (NFA) with O(m) nodes. In 1959, Dana and Scott demonstrated that NFAs can be simulated using Deterministic Finite Automata (DFA) in which each DFA state corresponds to a set of NFA states. Thompson, in 1968, is credited with the first construction to convert regular expressions to nondeterministic finite automata (NFA) for regular expression matching. Thompsonâs publication \cite{thompson1968} marked the beginning of a long line of regular expression implementations that construct automata for pattern matching. The general process is first to build a NFA. Automaton (NFA) from the regular expression, then convert the NFA into a Deterministic Finite Automaton (DFA), and finally use the DFA to scan the text. It is possible to search the text using the NFA directly in O(mn) worst case time. The cost comes from the fact that The traditional technique [16] to search a regular expression of length m in a text of length n is to first convert the expression into a non-deterministic automaton (NFA) with O(m) nodes. It is possible to search the text using the NFA directly in O(mn) worst case time. The cost comes from the fact that more than one state of the NFA may be active at each step, and therefore all may need to be updated. A more effcient choice is to convert the NFA into a deterministic finite automaton (DFA). A DFA has only a single active state and allows to search the text at O(n) worst-case optimal. The problem with this approach is that the DFA may have O(2^m) states. DFA. A DFA has only a single active state and allows to search the text at O(n) worst-case optimal. The problem with this approach is that the DFA may have O(2^m) states. In general, the general process is first to build a NFA from the regular expression and simulate the NFA on text input, or alternatively to convert the NFA into a DFA, optionally minimize the number of states in the DFA, and finally simulate the DFA on text input. \section{Background} \label{Background} %\input{background.tex} \section{Methodology}