# Changeset 3645 for docs

Ignore:
Timestamp:
Feb 24, 2014, 11:11:19 AM (5 years ago)
Message:

Improved introduction section.

Location:
docs/Working/re
Files:
3 edited

### Legend:

Unmodified
 r3642 is to convert an NFA into a DFA. A DFA has a single active state at any time throughout the matching process and hence it is possible to search a text at of length $n$ in $O(n)$ time \footnote{It is well known that the conversion of an NFA to an equivalent DFA may result hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well known that the conversion of an NFA to an equivalent DFA may result in state explosion. That is, the number of resultant DFA states may increase exponentially.}. A significant portion of the research in fast regular expression matching can be A significant proportion of the research in fast regular expression matching can be regarded as the quest for efficient automata'' \cite{navarro98fastand}. In \cite{baeza1992new}, Baeza-Yates and Gonnet presented a new approach to string search based on bit-parallelism. presented a new approach to string search based on bit-level parallelism. This technique takes advantage of the intrinsic parallelism of bitwise operations within a computer word. performs can be reduced by a factor $w$. Using this fact, the Shift-Or algorithm simulates an NFA using bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}. A disadvantage of the bit-parallel Shift-Or pattern matching approach bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}. A disadvantage of the Shift-Or approach is an inability to skip input characters. Simple string matching algorithms, such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters to achieve sublinear times in the average case. Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters. % Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} % based on suffix automata are able to skip characters. The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} combines the bit-parallel advantages of the Shift-Or approach with the character skipping property of BDM algorithms. The nrgrep pattern matching tool, is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous processing of data stream elements as discussed herein, nrgrep with the ability to skip characters. %character skipping property of BDM algorithms. The nrgrep pattern matching tool, is based on the BNDM algorithm. Prior to the bitwise data parallel approach presented herein, nrgrep was by far the fastest grep tool for matching complex patterns and achieved similar performance for matching complex patterns, and achieved similar performance to that of the fastest existing string matching tools for simple patterns \cite{navarro2000}. There has been considerable There has been considerable recent interest in accelerating regular expression matching on parallel hardware such as multi-core processors (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and specialized architectures such as such as multicore processors (CPUs), general purpose graphics processing units (GPGPUs), field-programmable gate arrays (FPGAs), and specialized architectures such as the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) %CPU Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) string matching algorithm \cite{aho1975} is well suited for parallel implementation on multi-core CPUs, GPUs and the Cell BE. implementation on multicore CPUs, GPGPUs and the Cell BE. On each hardware, both thread-level parallelism (cores) and data-level parallelism (SIMD units) were leveraged for performance. In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based implementation of the AC algorithm for accelerating string matching on GPUs. Lin et al., proposed accelerating string matching on GPGPUs. Lin et al., proposed the Parallel Failureless Aho-Corasick (PFAC) algorithm to accelerate pattern matching on GPU hardware and algorithm to accelerate pattern matching on GPGPU hardware and achieved 143 Gbps throughput, 14.74 times faster than the AC algorithm performed on a four core based on adapting traditional sequential algorithms to emergent parallel architectures, we introduce both a new algorithmic approach and its implementation on SIMD and GPU architectures. approach and its implementation on SIMD and GPGPU architectures. This approach relies on a bitwise data parallel view of text streams as well as a surprising use of addition to match in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique, for example, to perform synchronized 4096-bit addition on GPU warps of 64 threads. synchronized 4096-bit addition on GPGPU warps of 64 threads. There is also a strong keyword match between the bit-parallel notation and the grep problem. Section \ref{sec:bitwise} presents our basic algorithm and MatchStar using a model of arbitrary-length bit-parallel data streams. primitive using a model of arbitrary-length bit-parallel data streams. Section \ref{sec:blockwise} discusses the block-by-block implementation of our techniques including the long stream We also have adapted our long-stream addition technique to perform 4096-bit additions using 64 threads working in lock-step SIMT fashion.  A similar technique is known to the GPU programming SIMT fashion.  A similar technique is known to the GPGPU programming community\cite{}. chain is to incorporate the technique of long-stream addition described below. Otherwise, we were able to use Pablo directly in compiling our SSE2 and AVX2 implementations.   Our GPU implementation required SSE2 and AVX2 implementations.   Our GPGPU implementation required some scripting to modify the output of the Pablo compiler for our purpose. As described subsequently, we use a two-level long-stream addition technique in both our AVX2 and GPU implementations.  In principle, one can extend in both our AVX2 and GPGPU implementations.  In principle, one can extend the technique to additional levels.  Using 64-bit adders throughout, $\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition. Using the methods outlined, it is quite conceivable that instruction set extensions to support long-stream addition could be added for future SIMD and GPU processors.   Given the fundamental nature future SIMD and GPGPU processors.   Given the fundamental nature of addition as a primitive and its particular application to regular expression matching as shown herein, it seems reasonable to expect \section{GPU Implementation}\label{sec:GPU} \section{GPGPU Implementation}\label{sec:GPU} To further assess the scalability of our regular expression matching the 64 work groups.  Each work group carries out the regular expression matching operations 4096 bytes at a time using SIMT processing.   Although the GPU processing.   Although the GPGPU does not directly support the mask and spread operations required by our long-stream addition model, synchronized updates with the other threads using a six-step parallel-prefix style process.  Others have implemented long-stream addition on the GPU using similar techniques, long-stream addition on the GPGPU using similar techniques, as noted previously. pinned memory to take advantage of the zero-copy memory regions where data can be read directly into this region by CPU and also accessed by GPU for further processing. Therefore, and also accessed by GPGPU for further processing. Therefore, the expensive data transferring time that needed by traditional discrete GPUs is hidden and we compare only the kernel execution discrete GPGPUs is hidden and we compare only the kernel execution time with our SSE2 and AVX implementations as shown in Figure \ref{fig:SSE-AVX-GPU}. The GPU version gives 30\% to 55\% performance \ref{fig:SSE-AVX-GPU}. The GPGPU version gives 30\% to 55\% performance improvement over SSE version and 10\% to 40\% performance improvement over AVX version. Although we intended to process and not all the work groups can be scheduled at the same time. The second reason is that the long-stream addition implemented on GPU is more expensive than the implementations on SSE or AVX. on GPGPU is more expensive than the implementations on SSE or AVX. Another important reason is the control flow. When a possible match is found in one thread, the rest of the threads in the