Changeset 3154


Ignore:
Timestamp:
May 18, 2013, 12:45:44 AM (6 years ago)
Author:
ksherdy
Message:

Added references. Cleaned up background and abstract.

Location:
docs/Working/re
Files:
8 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/re-main.aux

    r3149 r3154  
    1212\citation{thompson1968}
    1313\citation{boyer1977fast}
    14 \citation{navarro2002flexible}
     14\citation{Navarro02patternmatching}
    1515\@writefile{toc}{\contentsline {section}{\numberline {2}Basic Concepts}{3}}
    1616\newlabel{Basic Concepts}{{2}{3}}
     
    1919\@writefile{toc}{\contentsline {section}{\numberline {3}Background}{3}}
    2020\newlabel{Background}{{3}{3}}
    21 \@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Classical Methods}{3}}
    22 \@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Regular Expression and Finite Automata}{3}}
     21\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Regular Expressions and Finite Automata}{3}}
     22\citation{Baeza-yates_anew}
    2323\citation{wu1992fast}
    24 \citation{navarro1998bit}
    2524\citation{navarro1998bit}
    2625\bibstyle{acm}
     
    2827\bibcite{abou-assaleh2004}{1}
    2928\bibcite{aho2007}{2}
    30 \bibcite{boyer1977fast}{3}
    31 \bibcite{cameron2009parallel}{4}
    32 \@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Bit-parallel Simulation of Automata}{4}}
    33 \@writefile{toc}{\contentsline {subsection}{\numberline {3.4}Software Tools}{4}}
     29\bibcite{Baeza-yates_anew}{3}
     30\@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Bit-parallel Simulation of Automata}{4}}
     31\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Software Tools}{4}}
    3432\@writefile{toc}{\contentsline {section}{\numberline {4}Bit-parallel Data Streams}{4}}
    3533\newlabel{Bit-parallel Data Streams}{{4}{4}}
     
    4341\@writefile{toc}{\contentsline {section}{\numberline {8}Conclusion}{4}}
    4442\newlabel{Conclusion}{{8}{4}}
    45 \bibcite{cameron2011parallel}{5}
    46 \bibcite{cameron2008high}{6}
    47 \bibcite{kleene1951}{7}
    48 \bibcite{lin2012parabix}{8}
    49 \bibcite{navarro2000}{9}
    50 \bibcite{navarro1998bit}{10}
    51 \bibcite{navarro2002flexible}{11}
    52 \bibcite{thompson1968}{12}
    53 \bibcite{wu1992fast}{13}
     43\bibcite{boyer1977fast}{4}
     44\bibcite{cameron2009parallel}{5}
     45\bibcite{cameron2011parallel}{6}
     46\bibcite{cameron2008high}{7}
     47\bibcite{kleene1951}{8}
     48\bibcite{lin2012parabix}{9}
     49\bibcite{navarro2000}{10}
     50\bibcite{Navarro02patternmatching}{11}
     51\bibcite{navarro1998bit}{12}
     52\bibcite{thompson1968}{13}
     53\bibcite{wu1992fast}{14}
  • docs/Working/re/re-main.bbl

    r3149 r3154  
    1010\newblock {\em Compilers: principles, techniques, and tools}.
    1111\newblock Addison Wesley, 2007.
     12
     13\bibitem{Baeza-yates_anew}
     14{\sc Baeza-yates, R.~A., Encalada, B., and Gonnet, G.~H.}
     15\newblock A new approach to text searching.
    1216
    1317\bibitem{boyer1977fast}
     
    5256\newblock {\em Software Practice and Experience (SPE 31\/} (2000), 2001.
    5357
     58\bibitem{Navarro02patternmatching}
     59{\sc Navarro, G.}
     60\newblock Pattern matching.
     61\newblock {\em Journal of Applied Statistics 31\/} (2002).
     62
    5463\bibitem{navarro1998bit}
    5564{\sc Navarro, G., and Raffinot, M.}
     
    5867\newblock In {\em Combinatorial Pattern Matching\/} (1998), Springer,
    5968  pp.~14--33.
    60 
    61 \bibitem{navarro2002flexible}
    62 {\sc Navarro, G., and Raffinot, M.}
    63 \newblock {\em Flexible pattern matching in strings: practical on-line search
    64   algorithms for texts and biological sequences}.
    65 \newblock Cambridge University Press, 2002.
    6669
    6770\bibitem{thompson1968}
  • docs/Working/re/re-main.blg

    r3149 r3154  
    55Warning--empty institution in abou-assaleh2004
    66Warning--empty journal in kleene1951
    7 You've used 13 entries,
     7You've used 14 entries,
    88            2253 wiz_defined-function locations,
    9             606 strings with 5962 characters,
    10 and the built_in function-call counts, 4215 in all, are:
    11 = -- 398
    12 > -- 191
     9            611 strings with 6029 characters,
     10and the built_in function-call counts, 4431 in all, are:
     11= -- 416
     12> -- 203
    1313< -- 0
    14 + -- 81
    15 - -- 66
    16 * -- 305
    17 := -- 692
    18 add.period$ -- 39
    19 call.type$ -- 13
    20 change.case$ -- 71
     14+ -- 87
     15- -- 70
     16* -- 320
     17:= -- 727
     18add.period$ -- 41
     19call.type$ -- 14
     20change.case$ -- 77
    2121chr.to.int$ -- 0
    22 cite$ -- 15
    23 duplicate$ -- 176
    24 empty$ -- 353
    25 format.name$ -- 66
    26 if$ -- 885
     22cite$ -- 16
     23duplicate$ -- 180
     24empty$ -- 372
     25format.name$ -- 70
     26if$ -- 928
    2727int.to.chr$ -- 0
    28 int.to.str$ -- 13
    29 missing$ -- 14
    30 newline$ -- 67
    31 num.names$ -- 26
    32 pop$ -- 89
     28int.to.str$ -- 14
     29missing$ -- 13
     30newline$ -- 71
     31num.names$ -- 28
     32pop$ -- 93
    3333preamble$ -- 1
    34 purify$ -- 59
     34purify$ -- 63
    3535quote$ -- 0
    36 skip$ -- 102
     36skip$ -- 111
    3737stack$ -- 0
    38 substring$ -- 210
    39 swap$ -- 42
     38substring$ -- 216
     39swap$ -- 44
    4040text.length$ -- 0
    4141text.prefix$ -- 0
    4242top$ -- 0
    43 type$ -- 48
     43type$ -- 54
    4444warning$ -- 2
    45 while$ -- 42
    46 width$ -- 15
    47 write$ -- 134
     45while$ -- 44
     46width$ -- 16
     47write$ -- 140
    4848(There were 2 warnings)
  • docs/Working/re/re-main.log

    r3149 r3154  
    1 This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2013.1.29)  16 MAY 2013 17:20
     1This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2013.1.29)  18 MAY 2013 00:16
    22entering extended mode
    33 %&-line parsing enabled.
     
    274274
    275275{/var/lib/texmf/fonts/map/pdftex/updmap/pdftex.map}]
    276 LaTeX Font Info:    Try loading font information for OMS+cmr on input line 123.
     276LaTeX Font Info:    Try loading font information for OMS+cmr on input line 119.
    277277
    278278
     
    281281)
    282282LaTeX Font Info:    Font shape `OMS/cmr/m/n' in size <10> not available
    283 (Font)              Font shape `OMS/cmsy/m/n' tried instead on input line 123.
     283(Font)              Font shape `OMS/cmsy/m/n' tried instead on input line 119.
    284284 [2] [3] (./re-main.bbl
    285285[4]) [5] (./re-main.aux) )
    286286Here is how much of TeX's memory you used:
    287  487 strings out of 493848
    288  5012 string characters out of 1152823
     287 488 strings out of 493848
     288 5035 string characters out of 1152823
    289289 55050 words of memory out of 3000000
    290  3812 multiletter control sequences out of 15000+50000
     290 3813 multiletter control sequences out of 15000+50000
    291291 9186 words of font info for 32 fonts, out of 3000000 for 9000
    292292 714 hyphenation exceptions out of 8191
    293  23i,6n,19p,193b,209s stack positions out of 5000i,500n,10000p,200000b,50000s
     293 23i,6n,19p,193b,207s stack positions out of 5000i,500n,10000p,200000b,50000s
    294294</usr/share/texmf-texlive/fonts/type1/public/amsfonts
    295295/cm/cmbx12.pfb></usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmbx9.p
     
    304304e/fonts/type1/public/amsfonts/cm/cmti10.pfb></usr/share/texmf-texlive/fonts/typ
    305305e1/public/amsfonts/cm/cmti9.pfb>
    306 Output written on re-main.pdf (5 pages, 191559 bytes).
     306Output written on re-main.pdf (5 pages, 191719 bytes).
    307307PDF statistics:
    308308 70 PDF objects out of 1000 (max. 8388607)
  • docs/Working/re/re-main.tex

    r3149 r3154  
    2222\begin{abstract}
    2323
    24 An parallel regular expression matching pattern method is
    25 introduced and compared with the state of the art in software tools designed for
    26 efficient on-line search. %such as the {\em grep} family pattern matching tools.
     24A parallel regular expression matching pattern method is introduced and compared
     25with existing approaches for efficient on-line search.
    2726The method is based on the concept of bit-parallel data streams,
    2827in which parallel streams of bits are formed such
     
    3029character code units of a source data stream.
    3130
    32 An implementation of the method in the form of a regular expression
     31An implementation of the techniques in the form of a regular expression
    3332compiler is discussed. The compiler accepts a regular expression and
    34 forms unbounded bit-parallel data stream operations. Bit-parallel operations
     33forms unbounded bit-parallel data stream statements. Bit-parallel operations
    3534are then transformed into a low-level C-based implementation for compilation into native
    36 pattern matching applications. These low-level C-based implementations take advantage of
     35pattern matching application code. These low-level C-based implementations take advantage of
    3736the SIMD (single-instruction multiple-data) capabilities of commodity
    3837processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
    3938On processors supporting W-bit addition operations, the method processes W source characters
    4039in parallel and performs up to W finite state transitions per clock cycle.
    41 
    42 We introduce a new bit-parallel scanning primitive, {\em Match Star}.
    43 which performs parallel Kleene closure over character classes
    44 and without eliminates backtracking. % expand and rephrase description of Match Star
     40We further improve our methods through the introduce a new bit-parallel scanning primitive, {\em Match Star}.
     41which performs a parallel Kleene closure operation over character classes
     42and without backtracking. % expand and rephrase description of Match Star
    4543
    4644We evaluate the performance of our method in comparison with several widely known {\em grep}
    47 family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep},
    48 and regular expression engines such as {\em Google's re2}.
     45implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}.
    4946Performance results are analyzed using the performance monitoring counters
    50 of commodity hardware. Overall, our results demonstrate a dramatic speed-up
    51 over publically available alternatives.
     47of commodity hardware. Overall, our results demonstrate dramatic speed-ups
     48over other publically available software.
    5249
    5350\end{abstract}
     
    5855
    5956Regular expresssion matching is an extensively studied problem with application to
    60 text processing and bioinformatics and with numerous algorithms
    61 and software tools developed to the address the particular
    62 processing demands. % reword
     57various problem domains, for example, text-processing and bioinformatics. A multitude of algorithms
     58and software tools exist to address the specific demands of the various problem domains. % reword
    6359
    6460The pattern matching problem can be
     
    7874classical texts such as \cite{aho2007}.
    7975
    80 Regular expression matching is commonly performed using a wide variety of
    81 publically available software tools for on-line pattern matching. For instance,
    82 UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
     76Regular expression matching is typically performed using a wide variety of
     77publically available software tools designed for efficient on-line pattern matching.
     78For example, UNIX grep, Gnu grep, agrep, cgrep, nr-grep, and Perl regular
    8379expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
    84 agrep, and nrgrep are widely known and considered as
    85 the fastest regular expression matching tools in practice \cite{navarro2000}.
    86 and are of particular interest to this study.
    87 
    88 % simple patterns, extended patterns, regular expressions
    89 
    90 % motivation / previous work
    91 Although tradi finite state machine methods used in the scanning and parsing of
     80agrep, and nr-grep are widely considered the fastest regular expression
     81matching tools in practice \cite{navarro2000}.
     82
     83% source: Parallel Scanning with BitStream Addition: An XML Case Study
     84Although traditional finite state machine methods used in the scanning and parsing of
    9285text streams is considered to be the hardest of the “13 dwarves” to parallelize
    93 [1], parallel bitstream technology shows considerable promise for these types of
    94 applications [3, 4]. In this approach, character streams are processed W positions
    95 at a time using the W-bit SIMD registers commonly found on commodity processors
    96 (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
    97 first slicing the byte streams into eight separate basis bitstreams, one for each bit
     86[1], bit-parallel data stream technology shows considerable promise as a general
     87approach to regular expression matching. In this approach, character
     88streams are processed W positions at a time using the W-bit SIMD registers
     89commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips).
     90This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit
    9891position within the byte. These basis bitstreams are then combined with bitwise
    9992logic and shifting operations to compute further parallel bit streams of interest.
    10093
    101 We further increase the parallelism in our methods by introducing a new parallel
    102 scanning primitive which we have coined Match Star. Match Star returns all matches
    103 in a single operation and eliminates backtracking
    104 when a partially successful search path fails.
     94We further increase the performance of our approach with the introduction of a new parallel
     95scanning primitive coined {\em Match Star}. % expand and reword desc. of Match Star
    10596
    10697The remainder of this paper is organized as follows.
    10798Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
    108 Section~\ref{Background} presents background material on classical and state-of-the-art approaches
    109 to high performance regular expression matching. In addition, this section provides insight into the
    110 efficiency of traditional on-line pattern matching tools. Next,
    111 Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
     99Section~\ref{Background} presents background material on classical automata-based approaches 
     100to regular expression matching and describes the {\em grep} family utilities.
     101Next, section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
    112102Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
    113103Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
    114104presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
    115 Gnu grep, agrep, and nr-grep, and re2. Section~\ref{Conclusion} concludes the paper.
     105several software tools. Section~\ref{Conclusion} concludes the paper.
    116106
    117107The fundamental contribution of this paper is fully general approach to regular expression matching
    118 using bit-parallel data streams. The algorithmic aspects of this paper build upon
    119 the fundamental concepts of our previous work
    120 in some cases \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
    121  Individual contributions include:
     108using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
     109the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
     110Specific contributions include:
    122111\begin{itemize}
    123112\item compilation of regular expressions into unbounded bit-parallel data stream equations;
    124113\item documentation of character classes compilation into bit-parallel character class data streams;
    125 % \item methods to select optimal subpatterns;
    126 \item bit-parallel and backtrack-free scanning primitive termed Match Star;
    127 \item bit-parallel data stream support for unicode.
     114\item Match Star parallel scanning primitive;
     115\item efficient support for unicode characters.
    128116\end{itemize}
    129117
    130118\section{Basic Concepts}
    131119\label{Basic Concepts}
    132 We define the notations and basic concepts used throughout this paper.
     120In this section, we define the notation and basic concepts used throughout this paper.
    133121
    134122\subsection{Notation}
    135123We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
    136124be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
    137 be a text of length n defined over a finite alphabet $sigma$ of size $alpha$.
    138 The task of regular expression matching
    139 is to find all the text positions of T that follow an occurrence of P. We use
    140 C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
     125be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
     126The goal of simple pattern expression matching is to find all the text positions of T that follow
     127an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
     128
     129We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
    141130represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
    142131logical left shift, and logical right shift, respectively and are further modulated by
     
    146135\subsection{Regular Expressions}
    147136
    148 % TODO - Define in terms of recursive defintion, include character classes, bounded (repeated) repetition
     137% Define regular expressions (a recursive def), character classes, bounded repetition
    149138
    150139\section{Background}
     
    152141%\input{background.tex}
    153142
    154 \subsection{Classical Methods}
    155 
    156 \subsection{Regular Expression and Finite Automata}
     143\subsection{Regular Expressions and Finite Automata}
    157144
    158145The origins of regular expression matching date back to automata theory
     
    169156
    170157Thompson's original work marked the beginning of a long line of
    171 regular expression implementations that
    172 process an input string, character-at-a-time, and that transition patterm matching state
    173 based on the current state and the next character read.
    174 
    175 The Boyer-Moore family of algorithms \cite{boyer1977fast} , Horspool, ... skip characters
    176 
    177 Suffix automata (BDM)
    178 
    179 The ideas presented up to now aim at a good implementation of the automa-
    180 ton, but they must inspect all the text characters. In many cases, however, the
    181 regular expression involves sets of relatively long substrings that must appear
    182 for the regular expression to match.
     158regular expression pattern matching methods that
     159process an input string, character-at-a-time,
     160and that transition from state to state
     161according to the current state and the next input character.
     162
     163Whereas traditional automata techniques acheive
     164O(n) worst-case optimal efficiency, simple string matching algorithms,
     165such as the Boyer-Moore family of algorithms, skip input characters
     166to achieve sublinear times in the average case \cite{boyer1977fast}.
     167
     168Boyer-Moore methods, begin comparison from the end of the pattern instead of the
     169beginning and precompute skip information
     170to determine how far ahead a pattern search can skip in the input whenever
     171a non-matching character is encountered. Generally, Boyer-Moore family
     172algorithms improve faster as the pattern being searched for becomes longer.
     173In many cases, the techniques used to skip characters in simple string matching
     174approaches can be extended to regular expression matching.
     175Widely known techniques used to facilitate character skipping in regular
     176expression matching include necessary strings and backward suffix matching
     177inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
    183178
    184179\subsection{Bit-parallel Simulation of Automata}
    185180
    186 Define bit-parallelism \cite{navarro2002flexible}
    187 
    188 Shift-Or \cite{wu1992fast}
    189 
    190 Backward Dawg Matching (BDM) \cite{navarro1998bit}
     181bit-parallelism \cite{Baeza-yates_anew}
     182
     183Shift-Or / Shift-And \cite{wu1992fast}
    191184
    192185Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
    193186
    194 Ability to skip characters pattern length (occurence frequency and length).
    195 
    196 \subsection{Software Tools}
     187i.e. bit parallel simulation of automata
     188
     189\subsection{Software Tools} % Methodology section?
     190
     191%This subsection relates the performance of well known tools (Gnu grep, agrep, nr-grep)
     192%to the above pattern matching algorithms.
     193
     194%Source: http://pages.cs.wisc.edu/~mdant/cs520_4.html
    197195
    198196%Thompson created the first grep (UNIX grep) as a standalone adaptation
     
    204202%to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
    205203%It took zero set-up time and just one additional test in the inner loop.
    206 %http://pages.cs.wisc.edu/~mdant/cs520_4.html
    207 
    208 %Given a regular expression R and a test T the regular expression matching
    209 %problem finds all ending position of substrings in Q that matches a string in
    210 %the language denoted by R.
    211 %The behaviour of Gnu grep, agrep, and nr-grep are differ ...
     204
     205%Source: Chapter 7 Flexible Pattern Matching \cite{navarro2000}. \cite{navarro2000}
     206
    212207%Gnu grep
    213208%agrep
     
    215210%re2
    216211
    217 
    218212\section{Bit-parallel Data Streams}
    219213\label{Bit-parallel Data Streams}
     214
     215% Contrast 'bit-parallel data streams' with the bit-parallel simulation of automata.
    220216
    221217The bit-parallel data streams use the wide
     
    225221
    226222A signficant advantage of the bit-parallel
    227 data stream method over other
     223data stream methods over other
    228224pattern matching methods that rely on
    229225bit-parallel automata simulation
    230 is the potential to skip full register width
    231 number of characters in low occurence frequency text. % reword
    232 
    233 
    234 Skip characters register width.
     226is the potential to skip register width
     227characters. % reword
     228
     229Skip SIMD register width.
    235230
    236231\subsection{Match Star}
    237232
    238 %Wikipedia
     233%Source: http://en.wikipedia.org/wiki/Backtracking
    239234Backtracking is a general algorithm for finding all solutions to some computational problem,
    240235that incrementally builds candidates to the solutions.
     
    242237\section{Compiler Technology}
    243238\label{Compiler Technology}
     239%\input{compilertechnology.tex}
    244240
    245241\section{Methodology}
     
    247243%\input{methodology.tex}
    248244
    249 We compare the performance of our bit-parallel data stream techniques against
     245We compare the performance of our bit-parallel data stream techniques against
     246several fast regular expression matching implemenations,
    250247Gnu grep, agrep, nr-grep, and re2.
    251 
    252 
    253248
    254249\section{Experimental Results}
  • docs/Working/re/re-main.tex.backup

    r3149 r3154  
    4242We introduce a new bit-parallel scanning primitive, {\em Match Star}.
    4343which performs parallel Kleene closure over character classes
    44 and without eliminates backtracking. % expand and rephrase description of Match Star
     44and without backtracking. % expand and rephrase description of Match Star
    4545
    4646We evaluate the performance of our method in comparison with several widely known {\em grep}
     
    9999logic and shifting operations to compute further parallel bit streams of interest.
    100100
    101 We further increase the parallelism in our methods by introducing a new parallel
    102 scanning primitive which we have coined Match Star. Match Star returns all matches
    103 in a single operation and eliminates backtracking
    104 when a partially successful search path fails.
     101We further increase the potential parallelism in our approach by introducing a new parallel
     102scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star
    105103
    106104The remainder of this paper is organized as follows.
    107105Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
    108 Section~\ref{Background} presents background material on classical and state-of-the-art approaches
    109 to high performance regular expression matching. In addition, this section provides insight into the
    110 efficiency of traditional on-line pattern matching tools. Next,
    111 Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
     106Section~\ref{Background} presents background material on classical automata-based approaches
     107to regular expression matching as well as describes the efficient {\em grep} family utilities
     108and regular expression pattern matching software libraries.
     109Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
    112110Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
    113111Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
    114112presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
    115 Gnu grep, agrep, and nr-grep, and re2. Section~\ref{Conclusion} concludes the paper.
     113several software tools. Section~\ref{Conclusion} concludes the paper.
    116114
    117115The fundamental contribution of this paper is fully general approach to regular expression matching
    118 using bit-parallel data streams. The algorithmic aspects of this paper build upon
    119 the fundamental concepts of our previous work
    120 in some cases \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
    121  Individual contributions include:
     116using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
     117the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
     118Specific contributions include:
    122119\begin{itemize}
    123120\item compilation of regular expressions into unbounded bit-parallel data stream equations;
    124121\item documentation of character classes compilation into bit-parallel character class data streams;
    125 % \item methods to select optimal subpatterns;
    126 \item bit-parallel and backtrack-free scanning primitive termed Match Star;
    127 \item bit-parallel data stream support for unicode.
     122\item the Match Star parallel scanning primitive;
     123\item efficient support for unicode characters.
    128124\end{itemize}
    129125
    130126\section{Basic Concepts}
    131127\label{Basic Concepts}
    132 We define the notations and basic concepts used throughout this paper.
     128In this section, we define the notation and basic concepts used throughout this paper.
    133129
    134130\subsection{Notation}
    135131We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
    136132be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
    137 be a text of length n defined over a finite alphabet $sigma$ of size $alpha$.
    138 The task of regular expression matching
    139 is to find all the text positions of T that follow an occurrence of P. We use
    140 C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$, $\ll{}$, and $\lgg{}$
    141 represent bitwise NOT, OR, AND, XOR, k-bit left shift and k-bit right shift, respectively.
     133be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
     134The goal of simple pattern expression matching is to find all the text positions of T that follow
     135an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
     136
     137We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
     138represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
     139logical left shift, and logical right shift, respectively and are further modulated by
     140the number of bit positions a given value shall be shifted by, for example ``shift left by n''.
     141Vacant bit-positions are filled in with zeroes.
    142142
    143143\subsection{Regular Expressions}
    144144
    145 % TODO
     145% Define regular expressions (a recursive def), character classes, bounded repetition
    146146
    147147\section{Background}
     
    166166
    167167Thompson's original work marked the beginning of a long line of
    168 regular expression implementations that
    169 process an input string, character-at-a-time, and that transition patterm matching state
    170 based on the current state and the next character read.
    171 
    172 The Boyer-Moore family of algorithms \cite{boyer1977fast} , Horspool, ... skip characters
    173 
    174 Suffix automata (BDM)
    175 
    176 The ideas presented up to now aim at a good implementation of the automa-
    177 ton, but they must inspect all the text characters. In many cases, however, the
    178 regular expression involves sets of relatively long substrings that must appear
    179 for the regular expression to match.
     168regular expression pattern matching methods that
     169process an input string, character-at-a-time,
     170and that transition from state to state
     171according to the current state and the next input character.
     172
     173Whereas traditional automata techniques acheive
     174O(n) worst-case optimal efficiency, simple string matching algorithms,
     175such as the Boyer-Moore family of algorithms, skip input characters
     176to achieve sublinear times in the average case \cite{boyer1977fast}.
     177
     178Boyer-Moore methods, begin comparison from the end of the pattern instead of the
     179beginning and precompute skip information
     180to determine how far ahead a pattern search can skip in the input whenever
     181a non-matching character is encountered. Generally, Boyer-Moore family
     182algorithms improve faster as the pattern being searched for becomes longer.
     183In many cases, the techniques used to skip characters in simple string matching
     184approaches can be extended to regular expression matching.
     185Widely known techniques used to facilitate character skipping in regular
     186expression matching include necessary strings and backward suffix matching
     187inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
     188
    180189
    181190\subsection{Bit-parallel Simulation of Automata}
    182191
    183 Define bit-parallelism \cite{navarro2002flexible}
    184 
    185 Shift-Or \cite{}
    186 
    187 Backward Dawg Matching (BDM) \cite{navarro1998bit}
     192Define bit-parallelism \cite{Baeza-yates_anew}
     193
     194Shift-Or / Shift-And \cite{wu1992fast}
    188195
    189196Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
    190 
    191 Skip characters pattern length (occurence frequency and length).
    192 
    193 
    194 \section{Bit-parallel Data Streams}
    195 \label{Bit-parallel Data Streams}
    196 
    197 The bit-parallel data streams use the wide
    198 SIMD registers commonly found on commodity processors
    199 to process  byte positions at a time using
    200 bitwise logic, shifting and other operations.
    201 
    202 A signficant advantage of the bit-parallel
    203 data stream method over other
    204 pattern matching methods that rely on
    205 bit-parallel automata simulation
    206 is the potential to skip full register width
    207 number of characters in low occurence frequency text. % reword
    208 
    209 
    210 Skip characters register width.
    211197
    212198\subsection{Software Tools}
     
    231217%re2
    232218
     219
     220\section{Bit-parallel Data Streams}
     221\label{Bit-parallel Data Streams}
     222
     223The bit-parallel data streams use the wide
     224SIMD registers commonly found on commodity processors
     225to process  byte positions at a time using
     226bitwise logic, shifting and other operations.
     227
     228A signficant advantage of the bit-parallel
     229data stream method over other
     230pattern matching methods that rely on
     231bit-parallel automata simulation
     232is the potential to skip full register width
     233number of characters in low occurence frequency text. % reword
     234
     235
     236Skip characters register width.
     237
    233238\subsection{Match Star}
    234239
  • docs/Working/re/reference.bib

    r3149 r3154  
    5353  year={2002},
    5454  publisher={Cambridge University Press}
     55}
     56
     57@ARTICLE{Navarro02patternmatching,
     58    author = {Gonzalo Navarro},
     59    title = {Pattern Matching},
     60    journal = {Journal of Applied Statistics},
     61    year = {2002},
     62    volume = {31}
     63}
     64
     65@MISC{Baeza-yates_anew,
     66    author = {Ricardo A. Baeza-yates and Blanco Encalada and Gaston H. Gonnet},
     67    title = {A New Approach to Text Searching},
     68    year = {}
    5569}
    5670
Note: See TracChangeset for help on using the changeset viewer.