Ignore:
Timestamp:
Feb 22, 2014, 2:23:52 PM (5 years ago)
Author:
lindanl
Message:

GPU section

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/re-main.tex

    r3633 r3634  
    594594\ref{fig:SSE-AVX-GPU}. The GPU version gives 30\% to 55\% performance
    595595improvement over SSE version and 10\% to 40\% performance
    596 improvement over AVX version.
     596improvement over AVX version. Although we intended to process
     59764 work groups with 4096 bytes each at a time rather than 128 bytes
     598at a time on SSE or 256 bytes at a time on AVX, the performance
     599improvement is less than 55\%. The first reason is hardware
     600limitations. Our kernel occupancy is limited by register usage
     601and not all the work groups can be scheduled at the same time.
     602The second reason is that the long-stream addition implemented
     603on GPU is more expensive than the implementations on SSE or AVX.
     604Another important reason is the control flow. When a possible
     605match is found in one thread, the rest of the threads in the
     606same work group have to execute the same instructions for
     607further processing rather than jump to the next block with a
     608simple IF test. Therefore, the performance of different
     609regular expresions is depended on the number of function calls
     610to the long-stream addition and the total number of matches
     611of a given input.
    597612
    598613
Note: See TracChangeset for help on using the changeset viewer.