 Timestamp:
 Feb 24, 2014, 11:11:19 AM (5 years ago)
 Location:
 docs/Working/re
 Files:

 3 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/re/abstract.tex
r3623 r3645 14 14 of Intel AVX2 or future 512bit extensions. Our AVX2 implementation 15 15 showed dramatic reduction in instruction count and significant 16 improvement in speed. Our GP U implementations show substantial16 improvement in speed. Our GPGPU implementations show substantial 17 17 further acceleration. 
docs/Working/re/remain.tex
r3642 r3645 41 41 is to convert an NFA into a DFA. A DFA has a single active state at any time 42 42 throughout the matching process and 43 hence it is possible to search a text at of length $n$ in $O(n)$ time44 \footnote{It is wellknown that the conversion of an NFA to an equivalent DFA may result43 hence it is possible to search a text of length $n$ in $O(n)$ time\footnote{It is well 44 known that the conversion of an NFA to an equivalent DFA may result 45 45 in state explosion. That is, the number of resultant DFA states may increase exponentially.}. 46 46 47 A significant p ortion of the research in fast regular expression matching can be47 A significant proportion of the research in fast regular expression matching can be 48 48 regarded as the ``quest for efficient automata'' \cite{navarro98fastand}. 49 49 In \cite{baeza1992new}, BaezaYates and Gonnet 50 presented a new approach to string search based on bit parallelism.50 presented a new approach to string search based on bitlevel parallelism. 51 51 This technique takes advantage of the intrinsic parallelism of bitwise operations 52 52 within a computer word. … … 54 54 performs can be reduced by a factor $w$. 55 55 Using this fact, the ShiftOr algorithm simulates an NFA using 56 bitwise operations and achieves $O( nm/w)$ worstcase time \cite{navarro2000}.57 A disadvantage of the bitparallel ShiftOr pattern matchingapproach56 bitwise operations and achieves $O(\frac{nm}{w})$ worstcase time \cite{navarro2000}. 57 A disadvantage of the ShiftOr approach 58 58 is an inability to skip input characters. 59 59 Simple string matching algorithms, 60 60 such as the BoyerMoore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters 61 61 to achieve sublinear times in the average case. 62 Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}63 based on suffix automata are able to skip characters.62 % Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} 63 % based on suffix automata are able to skip characters. 64 64 The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} 65 65 combines the bitparallel advantages of the ShiftOr approach 66 with the character skipping property of BDM algorithms. The nrgrep pattern matching tool, 67 is built over the BNDM algorithm. Prior to the dataparallel approach to simultaneous 68 processing of data stream elements as discussed herein, nrgrep 66 with the ability to skip characters. %character skipping property of BDM algorithms. 67 The nrgrep pattern matching tool, 68 is based on the BNDM algorithm. Prior to the bitwise 69 data parallel approach presented herein, nrgrep 69 70 was by far the fastest grep tool 70 for matching complex patterns and achieved similar performance71 for matching complex patterns, and achieved similar performance 71 72 to that of the fastest existing string 72 73 matching tools for simple patterns \cite{navarro2000}. 73 74 74 There has been considerable 75 There has been considerable recent 75 76 interest in accelerating regular expression matching 76 77 on parallel hardware 77 such as multicore processors (CPUs), graphics processing units (GPUs), 78 fieldprogrammable gate arrays (FPGAs), and specialized architectures such as 78 such as multicore processors (CPUs), 79 general purpose graphics processing units (GPGPUs), 80 fieldprogrammable gate arrays (FPGAs), 81 and specialized architectures such as 79 82 the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) 80 83 %CPU … … 87 90 Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the AhoCorasick (AC) 88 91 string matching algorithm \cite{aho1975} is well suited for parallel 89 implementation on multi core CPUs,GPUs and the Cell BE.92 implementation on multicore CPUs, GPGPUs and the Cell BE. 90 93 On each hardware, both threadlevel parallelism (cores) and datalevel parallelism 91 94 (SIMD units) were leveraged for performance. … … 108 111 In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunkbased 109 112 implementation of the AC algorithm for 110 accelerating string matching on GP Us. Lin et al., proposed113 accelerating string matching on GPGPUs. Lin et al., proposed 111 114 the Parallel Failureless AhoCorasick (PFAC) 112 algorithm to accelerate pattern matching on GP U hardware and115 algorithm to accelerate pattern matching on GPGPU hardware and 113 116 achieved 143 Gbps throughput, 14.74 times faster 114 117 than the AC algorithm performed on a four core … … 118 121 based on adapting traditional sequential algorithms to emergent 119 122 parallel architectures, we introduce both a new algorithmic 120 approach and its implementation on SIMD and GP U architectures.123 approach and its implementation on SIMD and GPGPU architectures. 121 124 This approach relies on a bitwise data parallel view of text 122 125 streams as well as a surprising use of addition to match … … 134 137 in $\lceil\log_{64}{n}\rceil$ steps. We ultimately apply this technique, 135 138 for example, to perform 136 synchronized 4096bit addition on GP U warps of 64 threads.139 synchronized 4096bit addition on GPGPU warps of 64 threads. 137 140 138 141 There is also a strong keyword match between the bitparallel … … 156 159 notation and the grep problem. 157 160 Section \ref{sec:bitwise} presents our basic algorithm and MatchStar 158 using a model of arbitrarylength bitparallel data streams.161 primitive using a model of arbitrarylength bitparallel data streams. 159 162 Section \ref{sec:blockwise} discusses the blockbyblock 160 163 implementation of our techniques including the long stream … … 274 277 We also have adapted our longstream addition technique 275 278 to perform 4096bit additions using 64 threads working in lockstep 276 SIMT fashion. A similar technique is known to the GP U programming279 SIMT fashion. A similar technique is known to the GPGPU programming 277 280 community\cite{}. 278 281 … … 430 433 chain is to incorporate the technique of longstream addition described below. 431 434 Otherwise, we were able to use Pablo directly in compiling our 432 SSE2 and AVX2 implementations. Our GP U implementation required435 SSE2 and AVX2 implementations. Our GPGPU implementation required 433 436 some scripting to modify the output of the Pablo compiler for our 434 437 purpose. … … 552 555 553 556 As described subsequently, we use a twolevel longstream addition technique 554 in both our AVX2 and GP U implementations. In principle, one can extend557 in both our AVX2 and GPGPU implementations. In principle, one can extend 555 558 the technique to additional levels. Using 64bit adders throughout, 556 559 $\lceil\log_{64}{n}\rceil$ steps are needed for $n$bit addition. … … 562 565 Using the methods outlined, it is quite conceivable that instruction 563 566 set extensions to support longstream addition could be added for 564 future SIMD and GP U processors. Given the fundamental nature567 future SIMD and GPGPU processors. Given the fundamental nature 565 568 of addition as a primitive and its particular application to regular 566 569 expression matching as shown herein, it seems reasonable to expect … … 578 581 579 582 580 \section{GP U Implementation}\label{sec:GPU}583 \section{GPGPU Implementation}\label{sec:GPU} 581 584 582 585 To further assess the scalability of our regular expression matching … … 589 592 the 64 work groups. Each work group carries out the regular 590 593 expression matching operations 4096 bytes at a time using SIMT 591 processing. Although the GP U594 processing. Although the GPGPU 592 595 does not directly support the mask and spread operations required 593 596 by our longstream addition model, … … 597 600 synchronized updates with the other threads using a sixstep 598 601 parallelprefix style process. Others have implemented 599 longstream addition on the GP U using similar techniques,602 longstream addition on the GPGPU using similar techniques, 600 603 as noted previously. 601 604 … … 604 607 pinned memory to take advantage of the zerocopy memory regions 605 608 where data can be read directly into this region by CPU 606 and also accessed by GP U for further processing. Therefore,609 and also accessed by GPGPU for further processing. Therefore, 607 610 the expensive data transferring time that needed by traditional 608 discrete GP Us is hidden and we compare only the kernel execution611 discrete GPGPUs is hidden and we compare only the kernel execution 609 612 time with our SSE2 and AVX implementations as shown in Figure 610 \ref{fig:SSEAVXGPU}. The GP U version gives 30\% to 55\% performance613 \ref{fig:SSEAVXGPU}. The GPGPU version gives 30\% to 55\% performance 611 614 improvement over SSE version and 10\% to 40\% performance 612 615 improvement over AVX version. Although we intended to process … … 617 620 and not all the work groups can be scheduled at the same time. 618 621 The second reason is that the longstream addition implemented 619 on GP U is more expensive than the implementations on SSE or AVX.622 on GPGPU is more expensive than the implementations on SSE or AVX. 620 623 Another important reason is the control flow. When a possible 621 624 match is found in one thread, the rest of the threads in the
Note: See TracChangeset
for help on using the changeset viewer.