Changeset 3883 for docs


Ignore:
Timestamp:
Jun 18, 2014, 1:07:02 AM (5 years ago)
Author:
cameron
Message:

Reference cleanups; use GPU instead of GPGPU

Location:
docs/Working/re
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/abstract.tex

    r3645 r3883  
    1414of Intel AVX2 or future 512-bit extensions.   Our AVX2 implementation
    1515showed dramatic reduction in instruction count and significant
    16 improvement in speed.   Our GPGPU implementations show substantial
     16improvement in speed.   Our GPU implementations show
    1717further acceleration. 
  • docs/Working/re/conclusion.tex

    r3880 r3883  
    2525continue to scale well up for SIMD vectors up to 4096 bytes
    2626in length based on 64-bit additions.    The model also
    27 supports GPGPU implementation with some additional
     27supports GPU implementation with some additional
    2828effort.
    2929
     
    7676grep version with a dynamic code generator implemented with LLVM
    7777is being developed by another team working with the Parabix
    78 technology (Dale Denis, Nigel Medforth, Nick Sumner and Rob Cameron).
     78technology (Dale Denis, Nick Sumner and Rob Cameron).
    7979An initial version is available at
    8080{\tt http://parabix.costar.sfu.ca/icGREP}.
     
    8888properties, for example.
    8989
    90 Future work also includes the dvelopment of multicore versions of the
     90Future work also includes the development of multicore versions of the
    9191underlying algorithms to further accelerate performance and to
    9292handle regular expression matching problems involving larger rule sets than are
    9393typically encountered in the grep problem.  Such implementations
    9494could have useful application in tokenization and network
    95 intrusion detection for example.   Additional GPGPU implementation
     95intrusion detection for example.   Additional GPU implementation
    9696work could take advantage of specialized instructions available
    9797on particular platforms but not generally avaiable through OpenCL.
    98 For both multicore and GPGPU implementations,
     98For both multicore and GPU implementations,
    9999data-parallel division of input streams could benefit from
    100100techniques such as the principled speculation of Zhao et al
  • docs/Working/re/pact051-cameron.tex

    r3882 r3883  
    119119interest in the use of
    120120parallel hardware such as multicore processors (CPUs),
    121 general purpose graphics processing units (GPGPUs),
     121 graphics processing units (GPUs),
    122122field-programmable gate arrays (FPGAs),
    123123or specialized architectures such as
     
    142142Naghmouchi et al \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
    143143string matching algorithm \cite{aho1975} is well-suited for parallel
    144 implementation on multicore CPUs, GPGPUs and the Cell BE.
     144implementation on multicore CPUs, GPUs and the Cell BE.
    145145% On each system, both thread-level parallelism (cores) and data-level parallelism
    146146% (SIMD units) were leveraged for performance.
     
    151151range of data sizes.
    152152%\paragraph{Cell Broadband Engine}
    153 On the Cell Broadband Engine, Scarpazza and Russell presented a SIMD tokenizer
    154 that delivered 1.00--1.78 Gbps on a single
    155 chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee
    156 instruction set \cite{scarpazza2009larrabee}.
    157 Furthermore, in \cite{scarpazza2009cell} Scarpazza, described a pattern matching
     153On the Cell Broadband Engine, Scarpazza and Russell
     154%presented a SIMD tokenizer
     155%that delivered 1.00--1.78 Gbps on a single
     156%chip \cite{scarpazza2008} and extended this approach for emulation on the Intel Larrabee
     157%instruction set \cite{scarpazza2009larrabee}.
     158%Furthermore,
     159\cite{scarpazza2009cell}
     160described a pattern matching
    158161implementation that delivered a throughput of 40
    159162Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
     
    161164presented a string matching implementation for automata that achieved
    1621654 Gbps on the Cell BE.
    163 %\paragraph{GPGPUs}
    164 On GPGPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
     166On GPUs, Tumeo et al \cite{tumeo2010efficient} presented a chunk-based
    165167implementation of the AC algorithm for
    166 accelerating string matching on GPGPUs. Lin et al., proposed
     168accelerating string matching on GPUs. Lin et al., proposed
    167169the Parallel Failureless Aho-Corasick (PFAC)
    168 algorithm to accelerate pattern matching on GPGPU hardware and
     170algorithm to accelerate pattern matching on GPU hardware and
    169171achieved 143 Gbps raw data throughput,
    170172although system throughput was limited to 15 Gbps
     
    180182based on adapting traditional sequential algorithms to emergent
    181183parallel architectures, we introduce both a new algorithmic
    182 approach and its implementation on SIMD and GPGPU architectures.
     184approach and its implementation on SIMD and GPU architectures.
    183185This approach relies on a bitwise data parallel view of text
    184186streams as well as a surprising use of addition to match
     
    196198in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
    197199for example, to perform
    198 synchronized 4096-bit addition on GPGPU wavefronts of 64 threads.
     200synchronized 4096-bit addition on GPU wavefronts of 64 threads.
    199201
    200202There is also a strong keyword match between the bit-parallel
     
    222224implementation of our techniques including the long stream
    223225addition techniques for 256-bit addition with AVX2 and
    224 4096-bit additions with GPGPU SIMT.
     2264096-bit additions with GPU SIMT.
    225227Section \ref{sec:SSE2} describes our overall SSE2 implementation
    226228and carries out a performance study in comparison with
     
    236238To further investigate scalability, Section \ref{sec:GPU}
    237239addresses the implementation of our matcher using
    238 groups of 64 threads working together SIMT-style on a GPGPU system.
     240groups of 64 threads working together SIMT-style on a GPU system.
    239241Section \ref{sec:Concl} concludes the paper with a discussion of results and
    240242areas for future work.
     
    304306algorithms.
    305307
    306 \section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
     308\section{Bit-Parallel Data Streams}\label{sec:bitwise}
    307309
    308310Whereas the traditional approaches to regular expression matching
     
    331333long-stream addition technique to avoid the sequential chaining
    332334of 4 64-bit additions.
    333 Our GPGPU implementation uses scripts to modify the output
     335Our GPU implementation uses scripts to modify the output
    334336of the Parabix tools, effectively dividing the input into blocks
    335337of 4K bytes.   
     
    490492chain is to incorporate the technique of long-stream addition described below.
    491493Otherwise, we were able to use Pablo directly in compiling our
    492 SSE2 and AVX2 implementations.   Our GPGPU implementation required
     494SSE2 and AVX2 implementations.   Our GPU implementation required
    493495some scripting to modify the output of the Pablo compiler for our
    494496purpose.
     
    503505
    504506We use the following general model using SIMD methods for constant-time
    505 long-stream addition up to 4096 bits.   Related GPGPU solutions have been
     507long-stream addition up to 4096 bits.   Related GPU solutions have been
    506508independently developed\cite{Crovella2012},
    507509however our model is intended to be a more broadly applicable abstraction.
     
    533535the parallel units.   There are a variety of ways in which
    534536these facilities may be implemented depending on the
    535 underlying architecture; details of our AVX2 and GPGPU implementations
     537underlying architecture; details of our AVX2 and GPU implementations
    536538are presented later.   
    537539
     
    611613
    612614As described subsequently, we use a two-level long-stream addition technique
    613 in both our AVX2 and GPGPU implementations.  In principle, one can extend
     615in both our AVX2 and GPU implementations.  In principle, one can extend
    614616the technique to additional levels.  Using 64-bit adders throughout,
    615617$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
     
    621623Using the methods outlined, it is quite conceivable that instruction
    622624set extensions to support long-stream addition could be added for
    623 future SIMD and GPGPU processors.   Given the fundamental nature
     625future SIMD and GPU processors.   Given the fundamental nature
    624626of addition as a primitive and its particular application to regular
    625627expression matching as shown herein, it seems reasonable to expect
     
    637639
    638640
    639 \section{GPGPU Implementation}\label{sec:GPU}
     641\section{GPU Implementation}\label{sec:GPU}
    640642
    641643To further assess the scalability of our regular expression matching
    642 using bit-parallel data streams, we implemented a GPGPU version
     644using bit-parallel data streams, we implemented a GPU version
    643645in OpenCL.   
    644646We arranged for 64 work groups each having 64 threads.
     
    648650the 64 work groups.  Each work group carries out the regular
    649651expression matching operations 4096 bytes at a time using SIMT
    650 processing.   Although the GPGPU
     652processing.   Although the GPU
    651653does not directly support the mask and spread operations required
    652654by our long-stream addition model,
     
    656658synchronized updates with the other threads using a six-step
    657659parallel-prefix style process.  Others have implemented
    658 long-stream addition on the GPGPU using similar techniques,
     660long-stream addition on the GPU using similar techniques,
    659661as noted previously.
    660662
     
    663665pinned memory to take advantage of the zero-copy memory regions
    664666where data can be read directly into this region by the CPU
    665 and also accessed by the GPGPU for further processing. Therefore,
     667and also accessed by the GPU for further processing. Therefore,
    666668the expensive data transferring time that is needed by traditional
    667 discrete GPGPUs is hidden and we compare only the kernel execution
     669discrete GPUs is hidden and we compare only the kernel execution
    668670time with our SSE2 and AVX implementations as shown in Figure
    669 \ref{fig:SSE-AVX-GPU}. The GPGPU version gives up to 55\% performance
     671\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
    670672improvement over SSE version and up to 40\% performance
    671673improvement over AVX version. Although we intended to process
     
    676678and not all the work groups can be scheduled at the same time.
    677679The second reason is that the long-stream addition implemented
    678 on GPGPU is more expensive than the implementations on SSE or AVX.
     680on GPU is more expensive than the implementations on SSE or AVX.
    679681Another important reason is the control flow. When a possible
    680682match is found in one thread, the rest of the threads in the
     
    711713file {data/gputime.dat};
    712714
    713 \legend{SSE2,AVX2,GPGPU,Annot}
     715\legend{SSE2,AVX2,GPU,Annot}
    714716\end{axis}
    715717\end{tikzpicture}
  • docs/Working/re/reference.bib

    r3862 r3883  
    1010  title={Accelerating business analytics applications},
    1111  author={Salapura, Valentina and Karkhanis, Tejas and Nagpurkar, Priya and Moreira, Jos{\'e}},
    12   booktitle={High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on},
     12  booktitle={18th International Symposium on High Performance Computer Architecture (HPCA)},
    1313  pages={1--10},
    1414  year={2012},
     
    2828  title={High-performance regular expression scanning on the Cell/BE processor},
    2929  author={Scarpazza, Daniele Paolo and Russell, Gregory F},
    30   booktitle={Proceedings of the 23rd International Conference on Supercomputing},
     30  booktitle={23rd International Conference on Supercomputing},
    3131  pages={14--25},
    3232  year={2009},
     
    123123  title={A case study in {SIMD} text processing with parallel bit streams: {UTF}-8 to {UTF}-16 transcoding},
    124124  author={Cameron, Robert D},
    125   booktitle={Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
     125  booktitle={13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)},
    126126  pages={91--98},
    127127  year={2008},
     
    131131  title={High performance {XML} parsing using parallel bit stream technology},
    132132  author={Cameron, Robert D and Herdy, Kenneth S and Lin, Dan},
    133   booktitle={Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds},
     133  booktitle={2008 Conference of the Center for Advanced Studies on Collaborative Research (CASCON)},
    134134  pages={17},
    135135  year={2008},
     
    157157  title={Parabix: Boosting the efficiency of text processing on commodity processors},
    158158  author={Lin, Dan and Medforth, Nigel and Herdy, Kenneth S and Shriraman, Arrvindh and Cameron, Rob},
    159   booktitle={High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on},
     159  booktitle={18th International Symposium on High Performance Computer Architecture (HPCA)},
    160160  pages={1--12},
    161161  year={2012},
     
    174174}
    175175
    176 @INPROCEEDINGS{Wu92agrep-,
    177     author = {Sun Wu and Udi Manber},
    178     title = {Agrep - A Fast Approximate Pattern-Matching Tool},
    179     booktitle = {In Proc. of USENIX Technical Conference},
    180     year = {1992},
    181     pages = {153--162}
    182 }
    183 
    184176@article{wu1992agrep,
    185   title={‘Agrep—A Fast Approximate Pattern-Matching Tool},
     177  title={Agrep - A Fast Approximate Pattern-Matching Tool},
    186178  author={Wu, Sun and Manber, Udi},
    187179  journal={Usenix Winter 1992},
     
    258250}
    259251
    260 @inproceedings{scarpazza2008,
    261         author={D. P. Scarpazza and G. F. Russell},
    262         year={2009},
    263         title={High-performance regular expression scanning on the Cell/BE processor},
    264         booktitle={Proceedings of the 23rd International Conference on Supercomputing},
    265         publisher={ACM},
    266         pages={14-25}
    267 }
    268 
    269252@misc{scarpazza2008fast,
    270253        author={Daniele Paolo Scarpazza and Oreste Villa and Fabrizio Petrinni},
     
    279262        author={D. P. Scarpazza and G. F. Russell},
    280263        year={2009},
    281         title={High-performance regular expression scanning on the Cell/BE processor},
     264        title={High-performance regular expression scanning on the {Cell/BE} processor},
    282265        booktitle={Proceedings of the 23rd International Conference on Supercomputing},
    283266        publisher={ACM},
     
    290273        month={June},
    291274        title={Efficient string matching: an aid to bibliographic search},
    292         journal={Commun.ACM},
     275        journal={Communications of the ACM},
    293276        volume={18},
    294277        number={6},
    295         pages={333-340},
    296         keywords={bibliographic search; computational complexity; finite state machines; information retrieval; keywords and phrases; string pattern matching; text-editing},
    297         isbn={0001-0782},
    298         url={http://doi.acm.org/10.1145/360825.360855}
     278        pages={333-340}
    299279}
    300280
     
    318298 pages = {26--37},
    319299 numpages = {12},
    320  url = {http://dl.acm.org/citation.cfm?id=1966221.1966225},
    321  acmid = {1966225},
    322300 publisher = {Springer-Verlag},
    323301 address = {Berlin, Heidelberg},
     
    360338  title={Architectural support for {SWAR} text processing with parallel bit streams: the inductive doubling principle},
    361339  author={Cameron, Robert D and Lin, Dan},
    362   booktitle={Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
     340  booktitle={14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    363341  pages={337--348},
    364342  year={2009},
     
    371349               Wolfram Schulte},
    372350  title     = {Data-parallel finite-state machines},
    373   booktitle = {ASPLOS},
     351  booktitle={19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    374352  year      = {2014},
    375   pages     = {529-542},
    376   ee        = {http://doi.acm.org/10.1145/2541940.2541988},
    377   crossref  = {DBLP:conf/asplos/2014},
    378   bibsource = {DBLP, http://dblp.uni-trier.de}
     353  pages     = {529-542}
    379354}
    380355
     
    383358               Bo Wu and
    384359               Xipeng Shen},
    385   title     = {Challenging the "embarrassingly sequential": parallelizing
     360  title     = {Challenging the ``embarrassingly sequential'': parallelizing
    386361               finite state machine-based computations through principled
    387362               speculation},
    388   booktitle = {ASPLOS},
     363  booktitle={19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    389364  year      = {2014},
    390   pages     = {543-558},
    391   ee        = {http://doi.acm.org/10.1145/2541940.2541989},
    392   crossref  = {DBLP:conf/asplos/2014},
    393   bibsource = {DBLP, http://dblp.uni-trier.de}
    394 }
    395 
    396 @proceedings{DBLP:conf/asplos/2014,
    397   editor    = {Rajeev Balasubramonian and
    398                Al Davis and
    399                Sarita V. Adve},
    400   title     = {Architectural Support for Programming Languages and Operating
    401                Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1-5,
    402                2014},
    403   booktitle = {ASPLOS},
    404   publisher = {ACM},
    405   year      = {2014},
    406   isbn      = {978-1-4503-2305-5},
    407   ee        = {http://dl.acm.org/citation.cfm?id=2541940},
    408   bibsource = {DBLP, http://dblp.uni-trier.de}
    409 }
    410 
     365  pages     = {543-558}
     366}
     367
  • docs/Working/re/sse2.tex

    r3880 r3883  
    1 \section{SSE2 Implementation and Evaluation}\label{sec:SSE2}
     1\section{SSE2 Implementation}\label{sec:SSE2}
    22
    33\begin{table*}[htbp]
Note: See TracChangeset for help on using the changeset viewer.