Jun 25, 2014, 6:06:33 PM (5 years ago)

Little clean-ups

1 edited


  • docs/Working/re/avx2.tex

    r3893 r3897  
    61 \caption{Instruction Reduction}\label{fig:AVXInstrReduction}
     61\caption{AVX2/SSE2 Instruction Reduction}\label{fig:AVXInstrReduction}
    114 \caption{AVX Speedup}\label{fig:AVXSpeedup}
     114\caption{AVX2/SSE2 Speedup}\label{fig:AVXSpeedup}
    133133demonstrating very good scalability of the bitwise data-parallel approach.
    134134Significantly, the @ regular expression is matched at 0.63 cycles/byte
    135 using our AVX2 implementation indicating a significant reduction
     135using our AVX2 implementation indicating a considerable reduction
    136136in the overhead cost of the Parabix transform.
    138 Table \ref{Xfactors} shows the final performance results
    139 showing the speedup factors achieved by the Parabix/AVX2 implementation
    140 vs nrgrep and gre2p.  We have also added comparison with GNU grep
    141 (version 2.16),
    142 as it is well known and sometimes used as a basis for comparisons.
    145140\begin{tabular}{|c|c|c|c|} \hline
    146 \multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream grep Speedup} \\ \cline{2-4}
     141\multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream/AVX2 grep Speedup} \\ \cline{2-4}
    147142& vs. nrgrep & vs. gre2p & vs. GNU grep -e\\ \hline \hline
    148143At & 3.5X & 34X & 1.6X\\ \hline
    149144Date & 0.76X & 13X & 48X\\ \hline
    150145Email & 9.5X & 28X & 12X\\ \hline
    151 URIorEmail & 6.6X & 27X & 518X\\ \hline
     146URI & 6.6X & 27X & 518X\\ \hline
    152147Hex & 8.1X & 105X & 267X\\ \hline
    153148StarHeight & 1.9X & 7.6X & 97X\\ \hline
    156 \caption{Speedups Obtained}\label{Xfactors}
     151\caption{Bitsream Speedup vs. Comparators}\label{Xfactors}
     159ylabel=Running Time (ms per megabyte),
     161tick label style={font=\tiny},
     162enlarge x limits=0.15,
     163%enlarge y limits={0.15, upper},
     165legend style={at={(0.5,-0.15)},
     166anchor=north,legend columns=-1},
     168bar width=7pt,
     169cycle list = {black,black!70,black!40,black!10}
     172file {data/ssetime.dat};
     174file {data/avxtime.dat};
     176file {data/gputime.dat};
     182\caption{Running Time}\label{fig:SSE-AVX-GPU}
     187Table \ref{Xfactors} shows the final performance results
     188showing the speedup factors achieved by the bitstreams/AVX2 implementation
     189vs nrgrep and gre2p.  We have also added comparison with GNU grep
     190(version 2.16),
     191as it is well known and sometimes used as a basis for comparisons.
     194\section{GPU Implementation}\label{sec:GPU}
     196To further assess the scalability of our regular expression matching
     197using bit-parallel data streams, we implemented a GPU version
     198in OpenCL.   
     199We arranged for 64 work groups each having 64 threads.
     200The size of work group and number of work groups is chosen
     201to provide the best occupancy as calculated by the AMD App Profiler.
     202Input files are divided in data parallel fashion among
     203the 64 work groups.  Each work group carries out the regular
     204expression matching operations 4096 bytes at a time using SIMT
     205processing.   Although the GPU
     206does not directly support the mask and spread operations required
     207by our long-stream addition model,
     208we are able to simulate them using shared memory.
     209Each thread maintains
     210its own carry and bubble values in shared memory and performs
     211synchronized updates with the other threads using a six-step
     212parallel-prefix style process.  Others have implemented
     213long-stream addition on the GPU using similar techniques,
     214as noted previously.
     216We performed our test on an AMD Radeon HD A10-6800K APU machine.
     217On the AMD Fusion systems, the input buffer is allocated in
     218pinned memory to take advantage of the zero-copy memory regions
     219where data can be read directly into this region by the CPU
     220and also accessed by the GPU for further processing. Therefore,
     221the expensive data transferring time that is needed by traditional
     222discrete GPUs is hidden and we compare only the kernel execution
     223time with our SSE2 and AVX implementations as shown in Figure
     224\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
     225improvement over SSE version and up to 40\% performance
     226improvement over AVX version.   However, because of
     227implementation complexities of the triply-nested while loop for
     228the StarHeight expression, it has been omitted.
     230Although we intended to process
     23164 work groups with 4096 bytes each at a time rather than 128 bytes
     232at a time on SSE or 256 bytes at a time on AVX, the performance
     233improvement is less than 60\%. The first reason is hardware
     234limitations. Our kernel occupancy is limited by register usage
     235and not all the work groups can be scheduled at the same time.
     236The second reason is that the long-stream addition implemented
     237on GPU is more expensive than the implementations on SSE or AVX.
     238Another important reason is the control flow. When a possible
     239match is found in one thread, the rest of the threads in the
     240same work group have to execute the same instructions for
     241further processing rather than jump to the next block with a
     242simple IF test. Therefore, the performance of different
     243regular expressions is dependent on the number of
     244long-stream addition operations and the total number of matches
     245of a given input.   Perhaps surprisingly, the overhead of the Parabix
     246transformation was not a dominant factor, coming in at 0.08 ms/MB.
Note: See TracChangeset for help on using the changeset viewer.