# Changeset 3665 for docs

Ignore:
Timestamp:
Mar 4, 2014, 4:34:49 PM (5 years ago)
Message:

Minor edits, spelling, grammar, consistency.

Location:
docs/Working/re
Files:
4 edited

### Legend:

Unmodified
 r3641 The bitstream method starts with a preprocessing step: the compilation of the regular expression using the parabix toolchain. regular expression using the Parabix toolchain. Compilation is an offline process whose time is not counted in our performance measures, as parabix is a experimental research compiler that is not measures, as Parabix is a experimental research compiler that is not optimized. This leads to a bias in our results, as our timings for nrgrep and grep inputs, so that the text-scanning costs dominate the preprocessing costs. We furthermore believe that, if a special-purpose optimized compiler for regular expressions were built, that its inclusion in bitstream grep regular expressions were built, that its inclusion in bitstreams grep would not substantially increase the running time, particularly for large input and $\log \sigma$ at least 7. Thus, $m \log \sigma$ will dominate $\log w$ with current and forseeable technology--we do not expect to see $\log w$ with current and foreseeable technology--we do not expect to see $\log w$ skyrocket. So we can absorb the $\log w$ term and state the work as $O(m \log \sigma)$ per The running time is predicted to be a small constant times the $\frac{m}{k} >= \frac{m}{4}$ lookups. The small constant, which we will underapproximate with 4 cycles, is for the The small constant, which we will under approximate with 4 cycles, is for the table addressing computation, combining the lookups with boolean OR, and final state detection and handling.
 r3661 presented a new approach to string search based on bit-level parallelism. This technique takes advantage of the intrinsic parallelism of bitwise operations within a computer word. Given a $w$-bit word, the number of operations that a string search algorithms within a computer word. Thus, given a $w$-bit word, the number of operations that a string search algorithms performs can be reduced by a factor $w$. Using this fact, the Shift-Or algorithm simulates an NFA using Building on this observation, the Shift-Or algorithm simulates an NFA using bitwise operations and achieves $O(\frac{nm}{w})$ worst-case time \cite{navarro2000}. A disadvantage of the Shift-Or approach is an inability to skip input characters. Simple string matching algorithms, such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical}, skip input characters to achieve sublinear times in the average case. % Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} % based on suffix automata are able to skip characters. The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} combines the bit-parallel advantages of the Shift-Or approach combines the advantages of the Shift-Or approach with the ability to skip characters. %character skipping property of BDM algorithms. The nrgrep pattern matching tool, is based on the BNDM algorithm. Prior to the bitwise The nrgrep tool is based on the BNDM algorithm. Prior to the bitwise data parallel approach presented herein, nrgrep was by far the fastest grep tool was the fastest grep tool for matching complex patterns, and achieved similar performance to that of the fastest existing string to the fastest existing string matching tools for simple patterns \cite{navarro2000}. 2.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}. Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) string matching algorithm \cite{aho1975} is well suited for parallel string matching algorithm \cite{aho1975} is well-suited for parallel implementation on multicore CPUs, GPGPUs and the Cell BE. On each hardware, both thread-level parallelism (cores) and data-level parallelism Salapura et al \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions in business analytics applications and leveraged the SIMD hardware available on multi-core processors to acheive a speedup of greater than 1.8 over a on multi-core processors to achieve a speedup of greater than 1.8 over a range of data sizes of interest. %Cell implementation that delivered a throughput of 40 Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4 Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} Gbps for a larger dictionary containing thousands of patterns. Iorio and van Lunteren \cite{iorio2008} presented a string matching implementation for automata that achieved 4 Gbps on the Cell BE. set of 8 three-letter strings \verb:cat:'' through \verb:dog:''. The alternation operator \verb:|: allows a pattern to be defined to have to alternative forms, thus \verb:cat|dog: defined to have two alternative forms, thus \verb:cat|dog: matches either \verb:cat:'' or \verb:dog:''.  Concatenation takes precedence over alternation, but parenthesis may be Repetition operators may be appended to a pattern to specify a variable number of occurrences of that pattern. The Kleene \verb:*: specifies zero-or-more occurrences matching the previous pattern, while \verb:+: specifies one-or The Kleene star operator \verb:*: specifies zero-or-more occurrences matching the previous pattern, while Kleene plus \verb:+: specifies one-or more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+: both specify the same set: strings of at least one lower-case implementation may divide an input stream into blocks  and process the blocks sequentially.   Within each block  all elements of the input stream are processed together, relying the availability of input stream are processed together, relying on the availability of bitwise logic and addition scaled to the block size.   On commodity Intel and AMD processors with 128-bit SIMD capabilities (SSE2), \item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields in two vectors to produce a result vector of $f$ 64-bit fields. \item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields \item  \verb#simd<64>::eq(X, -1)#:  comparison of the 64-bit fields of \verb:x: each with the constant value -1 (all bits 1), producing an $f$-bit mask value, \item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit \item  \verb#hsimd<64>::mask(X)#: gathering the high bit of each 64-bit field into a single compressed $f$-bit mask value, and \item normal bitwise logic operations on $f$-bit masks, and \item  \verb#simd<64>::spread(x)# : distributing the bits of \item  \verb#simd<64>::spread(X)#: distributing the bits of an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector. \end{itemize} In this model, the \verb#hsimd<64>::mask(X)# and \verb#simd<64>::spread(x)# model the minimum \verb#simd<64>::spread(X)# model the minimum communication requirements between the parallel processing units (SIMD lanes or SIMT processors).    In essence, we just need \item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with an incoming carry bit.  This is the {\em bubble mask}. an incoming carry bit.  This is called the {\em bubble mask}. $\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$ with the next digit.  At the incoming position, the carry will increment the 64-bit digit.   However, if this digit is all ones (as signalled by the corresponding bit of bubble mask {\tt b}, then the addition signaled by the corresponding bit of bubble mask {\tt b}, then the addition will generate another carry.  In fact, if there is a sequence of digits that are all ones, then the carry must bubble through 64 groups each performing 4096-bit long additions in a two-level structure. However, whether there are reasonable architectures that can support fine-grained SIMT style coordinate at this level is an open question. SIMT style at this level is an open question. Using the methods outlined, it is quite conceivable that instruction We arranged for 64 work groups each having 64 threads. The size of work group and number of work groups is chosen to provide the best occupancy calculated by AMD App Profiler. to provide the best occupancy as calculated by the AMD App Profiler. Input files are divided in data parallel fashion among the 64 work groups.  Each work group carries out the regular On the AMD Fusion systems, the input buffer is allocated in pinned memory to take advantage of the zero-copy memory regions where data can be read directly into this region by CPU and also accessed by GPGPU for further processing. Therefore, the expensive data transferring time that needed by traditional where data can be read directly into this region by the CPU and also accessed by the GPGPU for further processing. Therefore, the expensive data transferring time that is needed by traditional discrete GPGPUs is hidden and we compare only the kernel execution time with our SSE2 and AVX implementations as shown in Figure