Changeset 3897


Ignore:
Timestamp:
Jun 25, 2014, 6:06:33 PM (5 years ago)
Author:
cameron
Message:

Little clean-ups

Location:
docs/Working/re
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/analysis.tex

    r3889 r3897  
    225225that for processors with a 256-bit word, we expect an 16x improvement, and for
    226226processors with a 512-bit word, we expect a 32x improvement.
    227 In practice, there is some reduction in these improvement factors due to
    228 the instruction sets of larger-width processors not yet being as versatile as
    229 those of 128-bit processors.  (For example, full-width addition is not yet
    230 supported.)
    231 
    232 
    233 
    234 
    235 
     227In practice, we expect some reduction in these improvement ratios
     228for various reasons, including the need for while-loops to
     229implement nested Kleene closures as well as processor
     230limitations in memory bandwidth, for example.
     231
     232
     233
     234
  • docs/Working/re/avx2.tex

    r3893 r3897  
    5959\end{tikzpicture}
    6060\end{center}
    61 \caption{Instruction Reduction}\label{fig:AVXInstrReduction}
     61\caption{AVX2/SSE2 Instruction Reduction}\label{fig:AVXInstrReduction}
    6262\end{figure}
    6363
     
    112112\end{tikzpicture}
    113113\end{center}
    114 \caption{AVX Speedup}\label{fig:AVXSpeedup}
     114\caption{AVX2/SSE2 Speedup}\label{fig:AVXSpeedup}
    115115\end{figure}
    116116
     
    133133demonstrating very good scalability of the bitwise data-parallel approach.
    134134Significantly, the @ regular expression is matched at 0.63 cycles/byte
    135 using our AVX2 implementation indicating a significant reduction
     135using our AVX2 implementation indicating a considerable reduction
    136136in the overhead cost of the Parabix transform.
    137137
    138 Table \ref{Xfactors} shows the final performance results
    139 showing the speedup factors achieved by the Parabix/AVX2 implementation
    140 vs nrgrep and gre2p.  We have also added comparison with GNU grep
    141 (version 2.16),
    142 as it is well known and sometimes used as a basis for comparisons.
    143138\begin{table}
    144139\begin{center}
    145140\begin{tabular}{|c|c|c|c|} \hline
    146 \multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream grep Speedup} \\ \cline{2-4}
     141\multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream/AVX2 grep Speedup} \\ \cline{2-4}
    147142& vs. nrgrep & vs. gre2p & vs. GNU grep -e\\ \hline \hline
    148143At & 3.5X & 34X & 1.6X\\ \hline
    149144Date & 0.76X & 13X & 48X\\ \hline
    150145Email & 9.5X & 28X & 12X\\ \hline
    151 URIorEmail & 6.6X & 27X & 518X\\ \hline
     146URI & 6.6X & 27X & 518X\\ \hline
    152147Hex & 8.1X & 105X & 267X\\ \hline
    153148StarHeight & 1.9X & 7.6X & 97X\\ \hline
    154149\end{tabular}
    155150\end{center}
    156 \caption{Speedups Obtained}\label{Xfactors}
     151\caption{Bitsream Speedup vs. Comparators}\label{Xfactors}
    157152\end{table}
     153
     154\begin{figure}
     155\begin{center}
     156\begin{tikzpicture}
     157\begin{axis}[
     158xtick=data,
     159ylabel=Running Time (ms per megabyte),
     160xticklabels={@,Date,Email,URI,Hex,StarHeight},
     161tick label style={font=\tiny},
     162enlarge x limits=0.15,
     163%enlarge y limits={0.15, upper},
     164ymin=0,
     165legend style={at={(0.5,-0.15)},
     166anchor=north,legend columns=-1},
     167ybar,
     168bar width=7pt,
     169cycle list = {black,black!70,black!40,black!10}
     170]
     171\addplot+[]
     172file {data/ssetime.dat};
     173\addplot+[fill,text=black]
     174file {data/avxtime.dat};
     175\addplot+[fill,,text=black]
     176file {data/gputime.dat};
     177
     178\legend{SSE2,AVX2,GPU,Annot}
     179\end{axis}
     180\end{tikzpicture}
     181\end{center}
     182\caption{Running Time}\label{fig:SSE-AVX-GPU}
     183\end{figure}
     184
     185
     186
     187Table \ref{Xfactors} shows the final performance results
     188showing the speedup factors achieved by the bitstreams/AVX2 implementation
     189vs nrgrep and gre2p.  We have also added comparison with GNU grep
     190(version 2.16),
     191as it is well known and sometimes used as a basis for comparisons.
     192
     193
     194\section{GPU Implementation}\label{sec:GPU}
     195
     196To further assess the scalability of our regular expression matching
     197using bit-parallel data streams, we implemented a GPU version
     198in OpenCL.   
     199We arranged for 64 work groups each having 64 threads.
     200The size of work group and number of work groups is chosen
     201to provide the best occupancy as calculated by the AMD App Profiler.
     202Input files are divided in data parallel fashion among
     203the 64 work groups.  Each work group carries out the regular
     204expression matching operations 4096 bytes at a time using SIMT
     205processing.   Although the GPU
     206does not directly support the mask and spread operations required
     207by our long-stream addition model,
     208we are able to simulate them using shared memory.
     209Each thread maintains
     210its own carry and bubble values in shared memory and performs
     211synchronized updates with the other threads using a six-step
     212parallel-prefix style process.  Others have implemented
     213long-stream addition on the GPU using similar techniques,
     214as noted previously.
     215
     216We performed our test on an AMD Radeon HD A10-6800K APU machine.
     217On the AMD Fusion systems, the input buffer is allocated in
     218pinned memory to take advantage of the zero-copy memory regions
     219where data can be read directly into this region by the CPU
     220and also accessed by the GPU for further processing. Therefore,
     221the expensive data transferring time that is needed by traditional
     222discrete GPUs is hidden and we compare only the kernel execution
     223time with our SSE2 and AVX implementations as shown in Figure
     224\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
     225improvement over SSE version and up to 40\% performance
     226improvement over AVX version.   However, because of
     227implementation complexities of the triply-nested while loop for
     228the StarHeight expression, it has been omitted.
     229
     230Although we intended to process
     23164 work groups with 4096 bytes each at a time rather than 128 bytes
     232at a time on SSE or 256 bytes at a time on AVX, the performance
     233improvement is less than 60\%. The first reason is hardware
     234limitations. Our kernel occupancy is limited by register usage
     235and not all the work groups can be scheduled at the same time.
     236The second reason is that the long-stream addition implemented
     237on GPU is more expensive than the implementations on SSE or AVX.
     238Another important reason is the control flow. When a possible
     239match is found in one thread, the rest of the threads in the
     240same work group have to execute the same instructions for
     241further processing rather than jump to the next block with a
     242simple IF test. Therefore, the performance of different
     243regular expressions is dependent on the number of
     244long-stream addition operations and the total number of matches
     245of a given input.   Perhaps surprisingly, the overhead of the Parabix
     246transformation was not a dominant factor, coming in at 0.08 ms/MB.
     247
  • docs/Working/re/pact051-cameron.tex

    r3896 r3897  
    643643
    644644
    645 
    646 \section{GPU Implementation}\label{sec:GPU}
    647 
    648 To further assess the scalability of our regular expression matching
    649 using bit-parallel data streams, we implemented a GPU version
    650 in OpenCL.   
    651 We arranged for 64 work groups each having 64 threads.
    652 The size of work group and number of work groups is chosen
    653 to provide the best occupancy as calculated by the AMD App Profiler.
    654 Input files are divided in data parallel fashion among
    655 the 64 work groups.  Each work group carries out the regular
    656 expression matching operations 4096 bytes at a time using SIMT
    657 processing.   Although the GPU
    658 does not directly support the mask and spread operations required
    659 by our long-stream addition model,
    660 we are able to simulate them using shared memory.
    661 Each thread maintains
    662 its own carry and bubble values in shared memory and performs
    663 synchronized updates with the other threads using a six-step
    664 parallel-prefix style process.  Others have implemented
    665 long-stream addition on the GPU using similar techniques,
    666 as noted previously.
    667 
    668 We performed our test on an AMD Radeon HD A10-6800K APU machine.
    669 On the AMD Fusion systems, the input buffer is allocated in
    670 pinned memory to take advantage of the zero-copy memory regions
    671 where data can be read directly into this region by the CPU
    672 and also accessed by the GPU for further processing. Therefore,
    673 the expensive data transferring time that is needed by traditional
    674 discrete GPUs is hidden and we compare only the kernel execution
    675 time with our SSE2 and AVX implementations as shown in Figure
    676 \ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
    677 improvement over SSE version and up to 40\% performance
    678 improvement over AVX version.   However, because of
    679 implementation complexities of the triply-nested while loop for
    680 the StarHeight expression, it has been omitted.
    681 
    682 Although we intended to process
    683 64 work groups with 4096 bytes each at a time rather than 128 bytes
    684 at a time on SSE or 256 bytes at a time on AVX, the performance
    685 improvement is less than 60\%. The first reason is hardware
    686 limitations. Our kernel occupancy is limited by register usage
    687 and not all the work groups can be scheduled at the same time.
    688 The second reason is that the long-stream addition implemented
    689 on GPU is more expensive than the implementations on SSE or AVX.
    690 Another important reason is the control flow. When a possible
    691 match is found in one thread, the rest of the threads in the
    692 same work group have to execute the same instructions for
    693 further processing rather than jump to the next block with a
    694 simple IF test. Therefore, the performance of different
    695 regular expressions is dependent on the number of
    696 long-stream addition operations and the total number of matches
    697 of a given input.   Perhaps surprisingly, the overhead of the Parabix
    698 transformation was not a dominant factor, coming in at 0.08 ms/MB.
    699 
    700 
    701 \begin{figure}
    702 \begin{center}
    703 \begin{tikzpicture}
    704 \begin{axis}[
    705 xtick=data,
    706 ylabel=Running Time (ms per megabyte),
    707 xticklabels={@,Date,Email,URI,Hex,StarHeight},
    708 tick label style={font=\tiny},
    709 enlarge x limits=0.15,
    710 %enlarge y limits={0.15, upper},
    711 ymin=0,
    712 legend style={at={(0.5,-0.15)},
    713 anchor=north,legend columns=-1},
    714 ybar,
    715 bar width=7pt,
    716 cycle list = {black,black!70,black!40,black!10}
    717 ]
    718 \addplot+[]
    719 file {data/ssetime.dat};
    720 \addplot+[fill,text=black]
    721 file {data/avxtime.dat};
    722 \addplot+[fill,,text=black]
    723 file {data/gputime.dat};
    724 
    725 \legend{SSE2,AVX2,GPU,Annot}
    726 \end{axis}
    727 \end{tikzpicture}
    728 \end{center}
    729 \caption{Running Time}\label{fig:SSE-AVX-GPU}
    730 \end{figure}
    731 
    732 
    733 
    734 
    735 
    736 
    737 
    738 
    739645\input{conclusion}
    740646
Note: See TracChangeset for help on using the changeset viewer.