Changeset 3862

Jun 9, 2014, 11:17:12 PM (5 years ago)

Updates - mostly addressing transposition overhead and analysis of SSE2 results

4 edited


  • docs/Working/re/avx2.tex

    r3659 r3862  
    143143Nevertheless, the overall results on our AVX2 machine were quite encouraging,
    144144demonstrating very good scalability of the bitwise data-parallel approach.
     145Significantly, the @ regular expression is matched at 0.63 cycles/byte
     146using our AVX2 implementation indicating a significant reduction
     147in the overhead cost of the Parabix transform.
  • docs/Working/re/re-main.tex

    r3667 r3862  
    6565combines the advantages of the Shift-Or approach
    6666with the ability to skip characters. %character skipping property of BDM algorithms.
    67 The nrgrep tool is based on the BNDM algorithm. Prior to the bitwise
    68 data parallel approach presented herein, nrgrep
    69 was the fastest grep tool
    70 for matching complex patterns, and achieved similar performance
     67The nrgrep tool is based on the BNDM algorithm. It is generally
     69the fastest grep tool
     70for matching complex patterns, and achieves similar performance
    7171to the fastest existing string
    7272matching tools for simple patterns \cite{navarro2000}.
    9090string matching algorithm \cite{aho1975} is well-suited for parallel
    9191implementation on multicore CPUs, GPGPUs and the Cell BE.
    92 On each hardware, both thread-level parallelism (cores) and data-level parallelism
     92On each system, both thread-level parallelism (cores) and data-level parallelism
    9393(SIMD units) were leveraged for performance.
    9494Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
    1081084 Gbps on the Cell BE.
    109109% GPU
    110 In more recent work, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
     110On GPGPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
    111111implementation of the AC algorithm for
    112112accelerating string matching on GPGPUs. Lin et al., proposed
    114114algorithm to accelerate pattern matching on GPGPU hardware and
    115115achieved 143 Gbps raw data throughput,
    116 although system throughput was limited to 15 Gbps \cite{lin2013accelerating}.
     116although system throughput was limited to 15 Gbps
     118Most recently, Mytkowicz et al have developed a method for combining
     119SIMD parallellism and data parallelism
    118121Whereas the existing approaches to parallelization have been
    624627regular expressions is dependent on the number of
    625628long-stream addition operations and the total number of matches
    626 of a given input.
     629of a given input.   Perhaps surprisingly, the overhead of the Parabix
     630transformation was not a dominant factor, coming in at 0.08 ms/MB.
  • docs/Working/re/reference.bib

    r3664 r3862  
    365365  organization={ACM}
     369  author    = {Todd Mytkowicz and
     370               Madanlal Musuvathi and
     371               Wolfram Schulte},
     372  title     = {Data-parallel finite-state machines},
     373  booktitle = {ASPLOS},
     374  year      = {2014},
     375  pages     = {529-542},
     376  ee        = {},
     377  crossref  = {DBLP:conf/asplos/2014},
     378  bibsource = {DBLP,}
     382  author    = {Zhijia Zhao and
     383               Bo Wu and
     384               Xipeng Shen},
     385  title     = {Challenging the "embarrassingly sequential": parallelizing
     386               finite state machine-based computations through principled
     387               speculation},
     388  booktitle = {ASPLOS},
     389  year      = {2014},
     390  pages     = {543-558},
     391  ee        = {},
     392  crossref  = {DBLP:conf/asplos/2014},
     393  bibsource = {DBLP,}
     397  editor    = {Rajeev Balasubramonian and
     398               Al Davis and
     399               Sarita V. Adve},
     400  title     = {Architectural Support for Programming Languages and Operating
     401               Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1-5,
     402               2014},
     403  booktitle = {ASPLOS},
     404  publisher = {ACM},
     405  year      = {2014},
     406  isbn      = {978-1-4503-2305-5},
     407  ee        = {},
     408  bibsource = {DBLP,}
  • docs/Working/re/sse2.tex

    r3666 r3862  
    6161running the 64-bit version of Ubuntu 12.04 (Linux).
     63Our performance evaluation focuses on the running time of
     64the regular expression matching process itself, excluding the
     65preprocessing time for regular expression compilation.   
     67the overhead of the Parabix transform to bit stream form is
     68included in our reported results.
    6471\paragraph{Test Expressions.}
    6673against the five regular expressions shown
    6774in Table \ref{RegularExpressions}. 
    68 @ matches the ampersand character.
     75@ matches the at-sign character.
    6976This expression demonstrates the overhead involved
    7077in matching the simplest possible regular expression, a single character.
    119126each of the grep implementations,
    120127with relative performance reported in CPU cycles per byte.
    121 Both the bitstreams and the nrgrep implementation
    122 outperformed the DFA-based gre2p, with the bitstreams
    123 implementation consistently an order of magnitude faster.
    124 For the Date expression, nrgrep outperformed the bitstreams grep
    125 implementation.  In this case, nrgrep
    126 takes advantage of character skipping. That is,
    127 each time nrgrep encounters a character
    128 that cannot appear in a date
    129 it jumps six character positions rather
    130 than searching every character in the input text.   
    131 For the more complicated expressions nrgrep loses
    132 this performance advantage.
    134 The bitstreams implementation demonstrated stable performance
    135 across each of the test expressions, with increases in
    136 regular expression complexity producing
    137 relatively small increases in CPU cycle counts.
    138 The largest ratio in performance for the bitstreams
    139 implementation was 2 to 1. In comparison,
    140 gre2p exhibited a ratio of 4 to 1,
    141 whereas nrgrep exhibited a ratio of 10 to 1.
     129The performance in matching the @ regular expression establishes
     130the base line cost for regular expression processing.   All programs
     131report 15,788 matching lines of the 1,086,077 lines in the file.
     132The Parabix SSE2 implementation is clearly the fastest in this case with
     133a cost of 0.95 CPU cycles per byte.    The bulk of this represents
     134the overhead of the Parabix transform, the bitwise logic to
     135calculate the single {\tt [@]} character class stream is relatively
     136trivial.   It is interesting to note that
     137this example does not represent a baseline cost for either
     138nrgrep or gre2p, each of these benefit from character skipping
     139optimizations in their implementations.
     141Our results for the matching the Date expression to find the
     142668 lines containing dates show an increase
     143from 0.95 to 1.22 cycles per byte, corresponding to the additional
     144logic for the regular expression matching steps according to our
     145algorithm.    For this relatively simple expression, however, nrgrep
     146outperforms our implementation by taking significant advantage
     147of character skipping.   Each time that nrgrep encounters a character
     148that cannot appear in a date it jumps six character positions rather
     149than searching every character in the input text.  gre2p also shows
     150a significant benefit from the character skipping optimization.
     153The results for the Email expression illustrate the relative
     154advantage of the Parabix method when the expression to be matched
     155does not permit character skipping in the NFA- or DFA-based
     156implementations.   In this example, our implementation outperforms
     157nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
     158lines matching the Email regex.
     162The results for HexBytes expression illustrate Parabix performance
     163involving a Kleene-+ operator compiled to a while loop. 
     164Our implementation uses just 1.49 cycles per byte to find the
     165130,243 matching lines.    Performance is again dramatically better
     166than that of nrgrep or gre2p.
Note: See TracChangeset for help on using the changeset viewer.