Changeset 3665 for docs/Working


Ignore:
Timestamp:
Mar 4, 2014, 4:34:49 PM (5 years ago)
Author:
ksherdy
Message:

Minor edits, spelling, grammar, consistency.

Location:
docs/Working/re
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/analysis.tex

    r3641 r3665  
    1111The bitstream method starts with a preprocessing step: the compilation
    1212of the
    13 regular expression using the parabix toolchain.
     13regular expression using the Parabix toolchain.
    1414Compilation is an offline process whose time is not counted in our
    1515performance
    16 measures, as parabix is a experimental research compiler that is not
     16measures, as Parabix is a experimental research compiler that is not
    1717optimized.
    1818This leads to a bias in our results, as our timings for nrgrep and grep
     
    2121inputs, so that the text-scanning costs dominate the preprocessing costs.
    2222We furthermore believe that, if a special-purpose optimized compiler for
    23 regular expressions were built, that its inclusion in bitstream grep
     23regular expressions were built, that its inclusion in bitstreams grep
    2424would not
    2525substantially increase the running time, particularly for large input
     
    5858and $\log \sigma$ at least 7.
    5959Thus, $m \log \sigma$ will dominate $\log w$
    60 with current and forseeable technology--we do not expect to see $\log w$
     60with current and foreseeable technology--we do not expect to see $\log w$
    6161skyrocket.
    6262So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per
     
    187187The running time is predicted to be a small constant times the
    188188$\frac{m}{k} >= \frac{m}{4}$ lookups.
    189 The small constant, which we will underapproximate with 4 cycles, is for the
     189The small constant, which we will under approximate with 4 cycles, is for the
    190190table addressing computation, combining the lookups with boolean OR, and final state
    191191detection and handling.
  • docs/Working/re/conclusion.tex

    r3648 r3665  
    88found in Perl-compatible backtracking implementations.
    99Taking advantage of the SIMD features available on commodity
    10 processors, its implementation in a grep offers consistently
     10processors, its implementation in grep offers consistently
    1111good performance in contrast to available alternatives. 
    1212For moderately complex expressions, 10X or better
     
    4242An important area of future work is to develop and assess
    4343multicore versions of the algorithm to handle regular expression
    44 matching problems involving larger rulesets than are
     44matching problems involving larger rule sets than are
    4545typically encountered in the grep problem.  Such implementations
    4646could have useful application in tokenization and network
  • docs/Working/re/re-main.tex

    r3661 r3665  
    5050presented a new approach to string search based on bit-level parallelism.
    5151This technique takes advantage of the intrinsic parallelism of bitwise operations
    52 within a computer word.
    53 Given a $w$-bit word, the number of operations that a string search algorithms
     52within a computer word. Thus,
     53given a $w$-bit word, the number of operations that a string search algorithms
    5454performs can be reduced by a factor $w$.
    55 Using this fact, the Shift-Or algorithm simulates an NFA using
     55Building on this observation, the Shift-Or algorithm simulates an NFA using
    5656bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}.
    5757A disadvantage of the Shift-Or approach
    5858is an inability to skip input characters.
    5959Simple string matching algorithms,
    60 such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
     60such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, skip input characters
    6161to achieve sublinear times in the average case.
    6262% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
    6363% based on suffix automata are able to skip characters.
    6464The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast}
    65 combines the bit-parallel advantages of the Shift-Or approach
     65combines the advantages of the Shift-Or approach
    6666with the ability to skip characters. %character skipping property of BDM algorithms.
    67 The nrgrep pattern matching tool,
    68 is based on the BNDM algorithm. Prior to the bitwise
     67The nrgrep tool is based on the BNDM algorithm. Prior to the bitwise
    6968data parallel approach presented herein, nrgrep
    70 was by far the fastest grep tool
     69was the fastest grep tool
    7170for matching complex patterns, and achieved similar performance
    72 to that of the fastest existing string
     71to the fastest existing string
    7372matching tools for simple patterns \cite{navarro2000}.
    7473
     
    89882.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
    9089Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
    91 string matching algorithm \cite{aho1975} is well suited for parallel
     90string matching algorithm \cite{aho1975} is well-suited for parallel
    9291implementation on multicore CPUs, GPGPUs and the Cell BE.
    9392On each hardware, both thread-level parallelism (cores) and data-level parallelism
     
    9594Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
    9695in business analytics applications and leveraged the SIMD hardware available
    97 on multi-core processors to acheive a speedup of greater than 1.8 over a
     96on multi-core processors to achieve a speedup of greater than 1.8 over a
    9897range of data sizes of interest.
    9998%Cell
     
    105104implementation that delivered a throughput of 40
    106105Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
    107 Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008}
     106Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008}
    108107presented a string matching implementation for automata that achieved
    1091084 Gbps on the Cell BE.
     
    209208set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
    210209The alternation operator \verb:|: allows a pattern to be
    211 defined to have to alternative forms, thus \verb:cat|dog:
     210defined to have two alternative forms, thus \verb:cat|dog:
    212211matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
    213212takes precedence over alternation, but parenthesis may be
     
    217216Repetition operators may be appended to a pattern to specify
    218217a variable number of occurrences of that pattern. 
    219 The Kleene \verb:*: specifies zero-or-more occurrences
    220 matching the previous pattern, while \verb:+: specifies one-or
     218The Kleene star operator \verb:*: specifies zero-or-more occurrences
     219matching the previous pattern, while Kleene plus \verb:+: specifies one-or
    221220more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
    222221both specify the same set: strings of at least one lower-case
     
    259258implementation may divide an input stream into blocks  and process
    260259the blocks sequentially.   Within each block  all elements of the
    261 input stream are processed together, relying the availability of
     260input stream are processed together, relying on the availability of
    262261bitwise logic and addition scaled to the block size.   On commodity
    263262Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
     
    452451\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
    453452in two vectors to produce a result vector of $f$ 64-bit fields.
    454 \item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
     453\item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields
    455454of \verb:x: each with the constant value -1 (all bits 1), producing
    456455an $f$-bit mask value,
    457 \item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
     456\item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit
    458457field into a single compressed $f$-bit mask value, and
    459458\item normal bitwise logic operations on $f$-bit masks, and
    460 \item  \verb#simd<64>::spread(x)# : distributing the bits of
     459\item  \verb#simd<64>::spread(X)#: distributing the bits of
    461460an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
    462461\end{itemize}
    463462
    464463In this model, the \verb#hsimd<64>::mask(X)# and
    465 \verb#simd<64>::spread(x)# model the minimum
     464\verb#simd<64>::spread(X)# model the minimum
    466465communication requirements between the parallel processing units
    467466(SIMD lanes or SIMT processors).    In essence, we just need
     
    493492
    494493\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
    495 an incoming carry bit.  This is the {\em bubble mask}.
     494an incoming carry bit.  This is called the {\em bubble mask}.
    496495\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
    497496
     
    505504with the next digit.  At the incoming position, the carry will
    506505increment the 64-bit digit.   However, if this digit is all ones (as
    507 signalled by the corresponding bit of bubble mask {\tt b}, then the addition
     506signaled by the corresponding bit of bubble mask {\tt b}, then the addition
    508507will generate another carry.  In fact, if there is a sequence of
    509508digits that are all ones, then the carry must bubble through
     
    55855764 groups each performing 4096-bit long additions in a two-level structure.
    559558However, whether there are reasonable architectures that can support fine-grained
    560 SIMT style coordinate at this level is an open question.
     559SIMT style at this level is an open question.
    561560
    562561Using the methods outlined, it is quite conceivable that instruction
     
    585584We arranged for 64 work groups each having 64 threads.
    586585The size of work group and number of work groups is chosen
    587 to provide the best occupancy calculated by AMD App Profiler.
     586to provide the best occupancy as calculated by the AMD App Profiler.
    588587Input files are divided in data parallel fashion among
    589588the 64 work groups.  Each work group carries out the regular
     
    603602On the AMD Fusion systems, the input buffer is allocated in
    604603pinned memory to take advantage of the zero-copy memory regions
    605 where data can be read directly into this region by CPU
    606 and also accessed by GPGPU for further processing. Therefore,
    607 the expensive data transferring time that needed by traditional
     604where data can be read directly into this region by the CPU
     605and also accessed by the GPGPU for further processing. Therefore,
     606the expensive data transferring time that is needed by traditional
    608607discrete GPGPUs is hidden and we compare only the kernel execution
    609608time with our SSE2 and AVX implementations as shown in Figure
Note: See TracChangeset for help on using the changeset viewer.