# Changeset 3883 for docs/Working

Ignore:
Timestamp:
Jun 18, 2014, 1:07:02 AM (5 years ago)
Message:

Reference cleanups; use GPU instead of GPGPU

Location:
docs/Working/re
Files:
6 edited

Unmodified
Removed
• ## docs/Working/re/abstract.tex

 r3645 of Intel AVX2 or future 512-bit extensions.   Our AVX2 implementation showed dramatic reduction in instruction count and significant improvement in speed.   Our GPGPU implementations show substantial improvement in speed.   Our GPU implementations show further acceleration.
• ## docs/Working/re/conclusion.tex

 r3880 continue to scale well up for SIMD vectors up to 4096 bytes in length based on 64-bit additions.    The model also supports GPGPU implementation with some additional supports GPU implementation with some additional effort. grep version with a dynamic code generator implemented with LLVM is being developed by another team working with the Parabix technology (Dale Denis, Nigel Medforth, Nick Sumner and Rob Cameron). technology (Dale Denis, Nick Sumner and Rob Cameron). An initial version is available at {\tt http://parabix.costar.sfu.ca/icGREP}. properties, for example. Future work also includes the dvelopment of multicore versions of the Future work also includes the development of multicore versions of the underlying algorithms to further accelerate performance and to handle regular expression matching problems involving larger rule sets than are typically encountered in the grep problem.  Such implementations could have useful application in tokenization and network intrusion detection for example.   Additional GPGPU implementation intrusion detection for example.   Additional GPU implementation work could take advantage of specialized instructions available on particular platforms but not generally avaiable through OpenCL. For both multicore and GPGPU implementations, For both multicore and GPU implementations, data-parallel division of input streams could benefit from techniques such as the principled speculation of Zhao et al
• ## docs/Working/re/pact051-cameron.tex

 r3882 interest in the use of parallel hardware such as multicore processors (CPUs), general purpose graphics processing units (GPGPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), or specialized architectures such as Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) string matching algorithm \cite{aho1975} is well-suited for parallel implementation on multicore CPUs, GPGPUs and the Cell BE. implementation on multicore CPUs, GPUs and the Cell BE. % On each system, both thread-level parallelism (cores) and data-level parallelism % (SIMD units) were leveraged for performance. range of data sizes. %\paragraph{Cell Broadband Engine} On the Cell Broadband Engine, Scarpazza and Russell presented a SIMD tokenizer that delivered 1.00--1.78 Gbps on a single chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee instruction set \cite{scarpazza2009larrabee}. Furthermore, in \cite{scarpazza2009cell} Scarpazza, described a pattern matching On the Cell Broadband Engine, Scarpazza and Russell %presented a SIMD tokenizer %that delivered 1.00--1.78 Gbps on a single %chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee %instruction set \cite{scarpazza2009larrabee}. %Furthermore, \cite{scarpazza2009cell} described a pattern matching implementation that delivered a throughput of 40 Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4 presented a string matching implementation for automata that achieved 4 Gbps on the Cell BE. %\paragraph{GPGPUs} On GPGPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based On GPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based implementation of the AC algorithm for accelerating string matching on GPGPUs. Lin et al., proposed accelerating string matching on GPUs. Lin et al., proposed the Parallel Failureless Aho-Corasick (PFAC) algorithm to accelerate pattern matching on GPGPU hardware and algorithm to accelerate pattern matching on GPU hardware and achieved 143 Gbps raw data throughput, although system throughput was limited to 15 Gbps based on adapting traditional sequential algorithms to emergent parallel architectures, we introduce both a new algorithmic approach and its implementation on SIMD and GPGPU architectures. approach and its implementation on SIMD and GPU architectures. This approach relies on a bitwise data parallel view of text streams as well as a surprising use of addition to match in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique, for example, to perform synchronized 4096-bit addition on GPGPU wavefronts of 64 threads. synchronized 4096-bit addition on GPU wavefronts of 64 threads. There is also a strong keyword match between the bit-parallel implementation of our techniques including the long stream addition techniques for 256-bit addition with AVX2 and 4096-bit additions with GPGPU SIMT. 4096-bit additions with GPU SIMT. Section \ref{sec:SSE2} describes our overall SSE2 implementation and carries out a performance study in comparison with To further investigate scalability, Section \ref{sec:GPU} addresses the implementation of our matcher using groups of 64 threads working together SIMT-style on a GPGPU system. groups of 64 threads working together SIMT-style on a GPU system. Section \ref{sec:Concl} concludes the paper with a discussion of results and areas for future work. algorithms. \section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise} \section{Bit-Parallel Data Streams}\label{sec:bitwise} Whereas the traditional approaches to regular expression matching long-stream addition technique to avoid the sequential chaining of 4 64-bit additions. Our GPGPU implementation uses scripts to modify the output Our GPU implementation uses scripts to modify the output of the Parabix tools, effectively dividing the input into blocks of 4K bytes. chain is to incorporate the technique of long-stream addition described below. Otherwise, we were able to use Pablo directly in compiling our SSE2 and AVX2 implementations.   Our GPGPU implementation required SSE2 and AVX2 implementations.   Our GPU implementation required some scripting to modify the output of the Pablo compiler for our purpose. We use the following general model using SIMD methods for constant-time long-stream addition up to 4096 bits.   Related GPGPU solutions have been long-stream addition up to 4096 bits.   Related GPU solutions have been independently developed\cite{Crovella2012}, however our model is intended to be a more broadly applicable abstraction. the parallel units.   There are a variety of ways in which these facilities may be implemented depending on the underlying architecture; details of our AVX2 and GPGPU implementations underlying architecture; details of our AVX2 and GPU implementations are presented later. As described subsequently, we use a two-level long-stream addition technique in both our AVX2 and GPGPU implementations.  In principle, one can extend in both our AVX2 and GPU implementations.  In principle, one can extend the technique to additional levels.  Using 64-bit adders throughout, $\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition. Using the methods outlined, it is quite conceivable that instruction set extensions to support long-stream addition could be added for future SIMD and GPGPU processors.   Given the fundamental nature future SIMD and GPU processors.   Given the fundamental nature of addition as a primitive and its particular application to regular expression matching as shown herein, it seems reasonable to expect \section{GPGPU Implementation}\label{sec:GPU} \section{GPU Implementation}\label{sec:GPU} To further assess the scalability of our regular expression matching using bit-parallel data streams, we implemented a GPGPU version using bit-parallel data streams, we implemented a GPU version in OpenCL. We arranged for 64 work groups each having 64 threads. the 64 work groups.  Each work group carries out the regular expression matching operations 4096 bytes at a time using SIMT processing.   Although the GPGPU processing.   Although the GPU does not directly support the mask and spread operations required by our long-stream addition model, synchronized updates with the other threads using a six-step parallel-prefix style process.  Others have implemented long-stream addition on the GPGPU using similar techniques, long-stream addition on the GPU using similar techniques, as noted previously. pinned memory to take advantage of the zero-copy memory regions where data can be read directly into this region by the CPU and also accessed by the GPGPU for further processing. Therefore, and also accessed by the GPU for further processing. Therefore, the expensive data transferring time that is needed by traditional discrete GPGPUs is hidden and we compare only the kernel execution discrete GPUs is hidden and we compare only the kernel execution time with our SSE2 and AVX implementations as shown in Figure \ref{fig:SSE-AVX-GPU}. The GPGPU version gives up to 55\% performance \ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance improvement over SSE version and up to 40\% performance improvement over AVX version. Although we intended to process and not all the work groups can be scheduled at the same time. The second reason is that the long-stream addition implemented on GPGPU is more expensive than the implementations on SSE or AVX. on GPU is more expensive than the implementations on SSE or AVX. Another important reason is the control flow. When a possible match is found in one thread, the rest of the threads in the file {data/gputime.dat}; \legend{SSE2,AVX2,GPGPU,Annot} \legend{SSE2,AVX2,GPU,Annot} \end{axis} \end{tikzpicture}
• ## docs/Working/re/reference.bib

 r3862 title={Accelerating business analytics applications}, author={Salapura, Valentina and Karkhanis, Tejas and Nagpurkar, Priya and Moreira, Jos{\'e}}, booktitle={High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on}, booktitle={18th International Symposium on High Performance Computer Architecture (HPCA)}, pages={1--10}, year={2012}, title={High-performance regular expression scanning on the Cell/BE processor}, author={Scarpazza, Daniele Paolo and Russell, Gregory F}, booktitle={Proceedings of the 23rd International Conference on Supercomputing}, booktitle={23rd International Conference on Supercomputing}, pages={14--25}, year={2009}, title={A case study in {SIMD} text processing with parallel bit streams: {UTF}-8 to {UTF}-16 transcoding}, author={Cameron, Robert D}, booktitle={Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, booktitle={13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)}, pages={91--98}, year={2008}, title={High performance {XML} parsing using parallel bit stream technology}, author={Cameron, Robert D and Herdy, Kenneth S and Lin, Dan}, booktitle={Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds}, booktitle={2008 Conference of the Center for Advanced Studies on Collaborative Research (CASCON)}, pages={17}, year={2008}, title={Parabix: Boosting the efficiency of text processing on commodity processors}, author={Lin, Dan and Medforth, Nigel and Herdy, Kenneth S and Shriraman, Arrvindh and Cameron, Rob}, booktitle={High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on}, booktitle={18th International Symposium on High Performance Computer Architecture (HPCA)}, pages={1--12}, year={2012}, } @INPROCEEDINGS{Wu92agrep-, author = {Sun Wu and Udi Manber}, title = {Agrep - A Fast Approximate Pattern-Matching Tool}, booktitle = {In Proc. of USENIX Technical Conference}, year = {1992}, pages = {153--162} } @article{wu1992agrep, title={âAgrepâA Fast Approximate Pattern-Matching Tool}, title={Agrep - A Fast Approximate Pattern-Matching Tool}, author={Wu, Sun and Manber, Udi}, journal={Usenix Winter 1992}, } @inproceedings{scarpazza2008, author={D. P. Scarpazza and G. F. Russell}, year={2009}, title={High-performance regular expression scanning on the Cell/BE processor}, booktitle={Proceedings of the 23rd International Conference on Supercomputing}, publisher={ACM}, pages={14-25} } @misc{scarpazza2008fast, author={Daniele Paolo Scarpazza and Oreste Villa and Fabrizio Petrinni}, author={D. P. Scarpazza and G. F. Russell}, year={2009}, title={High-performance regular expression scanning on the Cell/BE processor}, title={High-performance regular expression scanning on the {Cell/BE} processor}, booktitle={Proceedings of the 23rd International Conference on Supercomputing}, publisher={ACM}, month={June}, title={Efficient string matching: an aid to bibliographic search}, journal={Commun.ACM}, journal={Communications of the ACM}, volume={18}, number={6}, pages={333-340}, keywords={bibliographic search; computational complexity; finite state machines; information retrieval; keywords and phrases; string pattern matching; text-editing}, isbn={0001-0782}, url={http://doi.acm.org/10.1145/360825.360855} pages={333-340} } pages = {26--37}, numpages = {12}, url = {http://dl.acm.org/citation.cfm?id=1966221.1966225}, acmid = {1966225}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, title={Architectural support for {SWAR} text processing with parallel bit streams: the inductive doubling principle}, author={Cameron, Robert D and Lin, Dan}, booktitle={Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, booktitle={14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, pages={337--348}, year={2009}, Wolfram Schulte}, title     = {Data-parallel finite-state machines}, booktitle = {ASPLOS}, booktitle={19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, year      = {2014}, pages     = {529-542}, ee        = {http://doi.acm.org/10.1145/2541940.2541988}, crossref  = {DBLP:conf/asplos/2014}, bibsource = {DBLP, http://dblp.uni-trier.de} pages     = {529-542} } Bo Wu and Xipeng Shen}, title     = {Challenging the "embarrassingly sequential": parallelizing title     = {Challenging the embarrassingly sequential'': parallelizing finite state machine-based computations through principled speculation}, booktitle = {ASPLOS}, booktitle={19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, year      = {2014}, pages     = {543-558}, ee        = {http://doi.acm.org/10.1145/2541940.2541989}, crossref  = {DBLP:conf/asplos/2014}, bibsource = {DBLP, http://dblp.uni-trier.de} } @proceedings{DBLP:conf/asplos/2014, editor    = {Rajeev Balasubramonian and Al Davis and Sarita V. Adve}, title     = {Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1-5, 2014}, booktitle = {ASPLOS}, publisher = {ACM}, year      = {2014}, isbn      = {978-1-4503-2305-5}, ee        = {http://dl.acm.org/citation.cfm?id=2541940}, bibsource = {DBLP, http://dblp.uni-trier.de} } pages     = {543-558} }
• ## docs/Working/re/sse2.tex

 r3880 \section{SSE2 Implementation and Evaluation}\label{sec:SSE2} \section{SSE2 Implementation}\label{sec:SSE2} \begin{table*}[htbp]
Note: See TracChangeset for help on using the changeset viewer.