Changeset 3645 for docs/Working/re


Ignore:
Timestamp:
Feb 24, 2014, 11:11:19 AM (5 years ago)
Author:
ksherdy
Message:

Improved introduction section.

Location:
docs/Working/re
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/abstract.tex

    r3623 r3645  
    1414of Intel AVX2 or future 512-bit extensions.   Our AVX2 implementation
    1515showed dramatic reduction in instruction count and significant
    16 improvement in speed.   Our GPU implementations show substantial
     16improvement in speed.   Our GPGPU implementations show substantial
    1717further acceleration. 
  • docs/Working/re/re-main.tex

    r3642 r3645  
    4141is to convert an NFA into a DFA. A DFA has a single active state at any time
    4242throughout the matching process and
    43 hence it is possible to search a text at of length $n$ in $O(n)$ time
    44 \footnote{It is well known that the conversion of an NFA to an equivalent DFA may result
     43hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well
     44known that the conversion of an NFA to an equivalent DFA may result
    4545in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
    4646
    47 A significant portion of the research in fast regular expression matching can be
     47A significant proportion of the research in fast regular expression matching can be
    4848regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
    4949In \cite{baeza1992new}, Baeza-Yates and Gonnet
    50 presented a new approach to string search based on bit-parallelism.
     50presented a new approach to string search based on bit-level parallelism.
    5151This technique takes advantage of the intrinsic parallelism of bitwise operations
    5252within a computer word.
     
    5454performs can be reduced by a factor $w$.
    5555Using this fact, the Shift-Or algorithm simulates an NFA using
    56 bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}.
    57 A disadvantage of the bit-parallel Shift-Or pattern matching approach
     56bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}.
     57A disadvantage of the Shift-Or approach
    5858is an inability to skip input characters.
    5959Simple string matching algorithms,
    6060such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
    6161to achieve sublinear times in the average case.
    62 Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
    63 based on suffix automata are able to skip characters.
     62% Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
     63% based on suffix automata are able to skip characters.
    6464The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast}
    6565combines the bit-parallel advantages of the Shift-Or approach
    66 with the character skipping property of BDM algorithms. The nrgrep pattern matching tool,
    67 is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous
    68 processing of data stream elements as discussed herein, nrgrep
     66with the ability to skip characters. %character skipping property of BDM algorithms.
     67The nrgrep pattern matching tool,
     68is based on the BNDM algorithm. Prior to the bitwise
     69data parallel approach presented herein, nrgrep
    6970was by far the fastest grep tool
    70 for matching complex patterns and achieved similar performance
     71for matching complex patterns, and achieved similar performance
    7172to that of the fastest existing string
    7273matching tools for simple patterns \cite{navarro2000}.
    7374
    74 There has been considerable
     75There has been considerable recent
    7576interest in accelerating regular expression matching
    7677on parallel hardware
    77 such as multi-core processors (CPUs), graphics processing units (GPUs),
    78 field-programmable gate arrays (FPGAs), and specialized architectures such as
     78such as multicore processors (CPUs),
     79general purpose graphics processing units (GPGPUs),
     80field-programmable gate arrays (FPGAs),
     81and specialized architectures such as
    7982the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
    8083%CPU
     
    8790Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
    8891string matching algorithm \cite{aho1975} is well suited for parallel
    89 implementation on multi-core CPUs, GPUs and the Cell BE.
     92implementation on multicore CPUs, GPGPUs and the Cell BE.
    9093On each hardware, both thread-level parallelism (cores) and data-level parallelism
    9194(SIMD units) were leveraged for performance.
     
    108111In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
    109112implementation of the AC algorithm for
    110 accelerating string matching on GPUs. Lin et al., proposed
     113accelerating string matching on GPGPUs. Lin et al., proposed
    111114the Parallel Failureless Aho-Corasick (PFAC)
    112 algorithm to accelerate pattern matching on GPU hardware and
     115algorithm to accelerate pattern matching on GPGPU hardware and
    113116achieved 143 Gbps throughput, 14.74 times faster
    114117than the AC algorithm performed on a four core
     
    118121based on adapting traditional sequential algorithms to emergent
    119122parallel architectures, we introduce both a new algorithmic
    120 approach and its implementation on SIMD and GPU architectures.
     123approach and its implementation on SIMD and GPGPU architectures.
    121124This approach relies on a bitwise data parallel view of text
    122125streams as well as a surprising use of addition to match
     
    134137in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
    135138for example, to perform
    136 synchronized 4096-bit addition on GPU warps of 64 threads.
     139synchronized 4096-bit addition on GPGPU warps of 64 threads.
    137140
    138141There is also a strong keyword match between the bit-parallel
     
    156159notation and the grep problem.
    157160Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
    158 using a model of arbitrary-length bit-parallel data streams.
     161primitive using a model of arbitrary-length bit-parallel data streams.
    159162Section \ref{sec:blockwise} discusses the block-by-block
    160163implementation of our techniques including the long stream
     
    274277We also have adapted our long-stream addition technique
    275278to perform 4096-bit additions using 64 threads working in lock-step
    276 SIMT fashion.  A similar technique is known to the GPU programming
     279SIMT fashion.  A similar technique is known to the GPGPU programming
    277280community\cite{}.   
    278281 
     
    430433chain is to incorporate the technique of long-stream addition described below.
    431434Otherwise, we were able to use Pablo directly in compiling our
    432 SSE2 and AVX2 implementations.   Our GPU implementation required
     435SSE2 and AVX2 implementations.   Our GPGPU implementation required
    433436some scripting to modify the output of the Pablo compiler for our
    434437purpose.
     
    552555
    553556As described subsequently, we use a two-level long-stream addition technique
    554 in both our AVX2 and GPU implementations.  In principle, one can extend
     557in both our AVX2 and GPGPU implementations.  In principle, one can extend
    555558the technique to additional levels.  Using 64-bit adders throughout,
    556559$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
     
    562565Using the methods outlined, it is quite conceivable that instruction
    563566set extensions to support long-stream addition could be added for
    564 future SIMD and GPU processors.   Given the fundamental nature
     567future SIMD and GPGPU processors.   Given the fundamental nature
    565568of addition as a primitive and its particular application to regular
    566569expression matching as shown herein, it seems reasonable to expect
     
    578581
    579582
    580 \section{GPU Implementation}\label{sec:GPU}
     583\section{GPGPU Implementation}\label{sec:GPU}
    581584
    582585To further assess the scalability of our regular expression matching
     
    589592the 64 work groups.  Each work group carries out the regular
    590593expression matching operations 4096 bytes at a time using SIMT
    591 processing.   Although the GPU
     594processing.   Although the GPGPU
    592595does not directly support the mask and spread operations required
    593596by our long-stream addition model,
     
    597600synchronized updates with the other threads using a six-step
    598601parallel-prefix style process.  Others have implemented
    599 long-stream addition on the GPU using similar techniques,
     602long-stream addition on the GPGPU using similar techniques,
    600603as noted previously.
    601604
     
    604607pinned memory to take advantage of the zero-copy memory regions
    605608where data can be read directly into this region by CPU
    606 and also accessed by GPU for further processing. Therefore,
     609and also accessed by GPGPU for further processing. Therefore,
    607610the expensive data transferring time that needed by traditional
    608 discrete GPUs is hidden and we compare only the kernel execution
     611discrete GPGPUs is hidden and we compare only the kernel execution
    609612time with our SSE2 and AVX implementations as shown in Figure
    610 \ref{fig:SSE-AVX-GPU}. The GPU version gives 30\% to 55\% performance
     613\ref{fig:SSE-AVX-GPU}. The GPGPU version gives 30\% to 55\% performance
    611614improvement over SSE version and 10\% to 40\% performance
    612615improvement over AVX version. Although we intended to process
     
    617620and not all the work groups can be scheduled at the same time.
    618621The second reason is that the long-stream addition implemented
    619 on GPU is more expensive than the implementations on SSE or AVX.
     622on GPGPU is more expensive than the implementations on SSE or AVX.
    620623Another important reason is the control flow. When a possible
    621624match is found in one thread, the rest of the threads in the
Note: See TracChangeset for help on using the changeset viewer.