Ignore:
Timestamp:
Jun 25, 2014, 6:06:33 PM (5 years ago)
Author:
cameron
Message:

Little clean-ups

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/pact051-cameron.tex

    r3896 r3897  
    643643
    644644
    645 
    646 \section{GPU Implementation}\label{sec:GPU}
    647 
    648 To further assess the scalability of our regular expression matching
    649 using bit-parallel data streams, we implemented a GPU version
    650 in OpenCL.   
    651 We arranged for 64 work groups each having 64 threads.
    652 The size of work group and number of work groups is chosen
    653 to provide the best occupancy as calculated by the AMD App Profiler.
    654 Input files are divided in data parallel fashion among
    655 the 64 work groups.  Each work group carries out the regular
    656 expression matching operations 4096 bytes at a time using SIMT
    657 processing.   Although the GPU
    658 does not directly support the mask and spread operations required
    659 by our long-stream addition model,
    660 we are able to simulate them using shared memory.
    661 Each thread maintains
    662 its own carry and bubble values in shared memory and performs
    663 synchronized updates with the other threads using a six-step
    664 parallel-prefix style process.  Others have implemented
    665 long-stream addition on the GPU using similar techniques,
    666 as noted previously.
    667 
    668 We performed our test on an AMD Radeon HD A10-6800K APU machine.
    669 On the AMD Fusion systems, the input buffer is allocated in
    670 pinned memory to take advantage of the zero-copy memory regions
    671 where data can be read directly into this region by the CPU
    672 and also accessed by the GPU for further processing. Therefore,
    673 the expensive data transferring time that is needed by traditional
    674 discrete GPUs is hidden and we compare only the kernel execution
    675 time with our SSE2 and AVX implementations as shown in Figure
    676 \ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
    677 improvement over SSE version and up to 40\% performance
    678 improvement over AVX version.   However, because of
    679 implementation complexities of the triply-nested while loop for
    680 the StarHeight expression, it has been omitted.
    681 
    682 Although we intended to process
    683 64 work groups with 4096 bytes each at a time rather than 128 bytes
    684 at a time on SSE or 256 bytes at a time on AVX, the performance
    685 improvement is less than 60\%. The first reason is hardware
    686 limitations. Our kernel occupancy is limited by register usage
    687 and not all the work groups can be scheduled at the same time.
    688 The second reason is that the long-stream addition implemented
    689 on GPU is more expensive than the implementations on SSE or AVX.
    690 Another important reason is the control flow. When a possible
    691 match is found in one thread, the rest of the threads in the
    692 same work group have to execute the same instructions for
    693 further processing rather than jump to the next block with a
    694 simple IF test. Therefore, the performance of different
    695 regular expressions is dependent on the number of
    696 long-stream addition operations and the total number of matches
    697 of a given input.   Perhaps surprisingly, the overhead of the Parabix
    698 transformation was not a dominant factor, coming in at 0.08 ms/MB.
    699 
    700 
    701 \begin{figure}
    702 \begin{center}
    703 \begin{tikzpicture}
    704 \begin{axis}[
    705 xtick=data,
    706 ylabel=Running Time (ms per megabyte),
    707 xticklabels={@,Date,Email,URI,Hex,StarHeight},
    708 tick label style={font=\tiny},
    709 enlarge x limits=0.15,
    710 %enlarge y limits={0.15, upper},
    711 ymin=0,
    712 legend style={at={(0.5,-0.15)},
    713 anchor=north,legend columns=-1},
    714 ybar,
    715 bar width=7pt,
    716 cycle list = {black,black!70,black!40,black!10}
    717 ]
    718 \addplot+[]
    719 file {data/ssetime.dat};
    720 \addplot+[fill,text=black]
    721 file {data/avxtime.dat};
    722 \addplot+[fill,,text=black]
    723 file {data/gputime.dat};
    724 
    725 \legend{SSE2,AVX2,GPU,Annot}
    726 \end{axis}
    727 \end{tikzpicture}
    728 \end{center}
    729 \caption{Running Time}\label{fig:SSE-AVX-GPU}
    730 \end{figure}
    731 
    732 
    733 
    734 
    735 
    736 
    737 
    738 
    739645\input{conclusion}
    740646
Note: See TracChangeset for help on using the changeset viewer.