Changeset 3493 for docs


Ignore:
Timestamp:
Sep 15, 2013, 12:23:19 PM (6 years ago)
Author:
cameron
Message:

Basic GPU description.

Location:
docs/Working/re
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/ppopp-re.tex

    r3491 r3493  
    405405
    406406
    407 Figure \ref{fig:matchstar} illustrates the MatchStar method.  The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data. 
    408 
    409 It is important to note that our bitstreams are shown in natural left-to-right order reflecting the
     407Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
     408it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
    410409conventional presentation of our character data input.   However, this reverses the normal
    411410order of presentation when considering bitstreams as numeric values.  The key point here is
    412411that when we perform bitstream addition, we will show bit movement from left-to-right.
    413 That $\verb:111.: + \verb:1...: = \verb:...1:$
     412For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
     413
     414The first row of the figure is the input data,
     415the second and third rows are the input bitstreams: the initial marker position bitstream and the
     416character class bitstream for digits derived from input data. 
    414417
    415418In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
     
    424427of characters in class $C$ from each position in $M$. 
    425428
    426 
     429\paragraph*{Compilation Algorithm.}
    427430
    428431
     
    812815\begin{center}
    813816\begin{verbatim}
    814 void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {
    815   bitblock256_t all_ones = simd256<1>::constant<1>();
    816   bitblock256_t gen = simd_and(x, y);
    817   bitblock256_t prop = simd_xor(x, y);
    818   bitblock256_t partial_sum = simd256<64>::add(x, y);
    819   bitblock256_t carry = simd_or(gen, simd_andc(prop, partial_sum));
    820   bitblock256_t bubble = simd256<64>::eq(partial_sum, all_ones);
     817void add_ci_co(bitblock_t x, bitblock_t y, carry_t carry_in, carry_t & carry_out, bitblock_t & sum) {
     818  bitblock_t all_ones = simd256<1>::constant<1>();
     819  bitblock_t gen = simd_and(x, y);
     820  bitblock_t prop = simd_xor(x, y);
     821  bitblock_t partial_sum = simd256<64>::add(x, y);
     822  bitblock_t carry = simd_or(gen, simd_andc(prop, partial_sum));
     823  bitblock_t bubble = simd256<64>::eq(partial_sum, all_ones);
    821824  uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
    822825  uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
     
    835838\end{figure*}
    836839
     840
    837841\section{GPU Implementation}\label{sec:GPU}
     842
     843To further assess the scalability of our regular expression matching
     844using bit-parallel data streams, we implemented a GPGPU version
     845in OpenCL.   We arranged for 64 work groups each having 64
     846threads.  Input files are divided in data parallel fashion among
     847the 64 work groups.  Each work group carries out the regular
     848expression matching operations 4096 bytes at a time using SIMT
     849processing.  Figure \ref{fig:GPUadd} shows our implementation
     850of long-stream addition on the GPU.  Each thread maintains
     851its own carry and bubble values and performs synchronized
     852updates with the other threads using a six-step parallel-prefix
     853style process.
     854
     855
     856
     857\begin{figure*}[tbh]
     858\begin{center}
     859\begin{verbatim}
     860inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
     861                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
     862        BitBlock carry_mask;
     863        BitBlock bubble_mask;
     864
     865        BitBlock partial_sum = a+b;
     866        BitBlock gen = a&b;
     867        BitBlock prop = a^b;
     868        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
     869        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
     870       
     871        barrier(CLK_LOCAL_MEM_FENCE);
     872        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
     873                carry[idx] = carry[idx]|carry[idx^offset];
     874                bubble[idx] = bubble[idx]|bubble[idx^offset];
     875                barrier(CLK_LOCAL_MEM_FENCE);
     876        }
     877       
     878        carry_mask = (carry[0]<<1)|group_carry[carryno];
     879        bubble_mask = bubble[0];
     880       
     881        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
     882        BitBlock inc = s | (s-carry_mask);
     883        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
     884        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
     885        return rslt;
     886}
     887\end{verbatim}
     888
     889Figure \ref{fig:SSE-AVX-GPU} compares the performance of
     890our SSE2, AVX and GPU implementations.   
     891
     892\end{center}
     893\caption{OpenCL 4096-bit Addition}
     894\label{fig:GPUadd}
     895\end{figure*}
     896
     897
     898
    838899\begin{figure}
    839900\begin{center}
     
    861922\end{tikzpicture}
    862923\end{center}
    863 \caption{Running Time}
     924\caption{Running Time}\label{fig:SSE-AVX-GPU}
    864925\end{figure}
    865926
Note: See TracChangeset for help on using the changeset viewer.