# Changeset 3493 for docs

Ignore:
Timestamp:
Sep 15, 2013, 12:23:19 PM (6 years ago)
Message:

Basic GPU description.

Location:
docs/Working/re
Files:
2 edited

### Legend:

Unmodified
 r3491 Figure \ref{fig:matchstar} illustrates the MatchStar method.  The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data. It is important to note that our bitstreams are shown in natural left-to-right order reflecting the Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure, it is important to note that our bitstreams are shown in natural left-to-right order reflecting the conventional presentation of our character data input.   However, this reverses the normal order of presentation when considering bitstreams as numeric values.  The key point here is that when we perform bitstream addition, we will show bit movement from left-to-right. That $\verb:111.: + \verb:1...: = \verb:...1:$ For example, $\verb:111.: + \verb:1...: = \verb:...1:$. The first row of the figure is the input data, the second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream for digits derived from input data. In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. of characters in class $C$ from each position in $M$. \paragraph*{Compilation Algorithm.} \begin{center} \begin{verbatim} void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) { bitblock256_t all_ones = simd256<1>::constant<1>(); bitblock256_t gen = simd_and(x, y); bitblock256_t prop = simd_xor(x, y); bitblock256_t partial_sum = simd256<64>::add(x, y); bitblock256_t carry = simd_or(gen, simd_andc(prop, partial_sum)); bitblock256_t bubble = simd256<64>::eq(partial_sum, all_ones); void add_ci_co(bitblock_t x, bitblock_t y, carry_t carry_in, carry_t & carry_out, bitblock_t & sum) { bitblock_t all_ones = simd256<1>::constant<1>(); bitblock_t gen = simd_and(x, y); bitblock_t prop = simd_xor(x, y); bitblock_t partial_sum = simd256<64>::add(x, y); bitblock_t carry = simd_or(gen, simd_andc(prop, partial_sum)); bitblock_t bubble = simd256<64>::eq(partial_sum, all_ones); uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in); uint64_t bubble_mask = hsimd256<64>::signmask(bubble); \end{figure*} \section{GPU Implementation}\label{sec:GPU} To further assess the scalability of our regular expression matching using bit-parallel data streams, we implemented a GPGPU version in OpenCL.   We arranged for 64 work groups each having 64 threads.  Input files are divided in data parallel fashion among the 64 work groups.  Each work group carries out the regular expression matching operations 4096 bytes at a time using SIMT processing.  Figure \ref{fig:GPUadd} shows our implementation of long-stream addition on the GPU.  Each thread maintains its own carry and bubble values and performs synchronized updates with the other threads using a six-step parallel-prefix style process. \begin{figure*}[tbh] \begin{center} \begin{verbatim} inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _ _local BitBlock *bubble, BitBlock *group_carry, const int carryno){ BitBlock carry_mask; BitBlock bubble_mask; BitBlock partial_sum = a+b; BitBlock gen = a&b; BitBlock prop = a^b; carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx); bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<0; offset=offset>>1){ carry[idx] = carry[idx]|carry[idx^offset]; bubble[idx] = bubble[idx]|bubble[idx^offset]; barrier(CLK_LOCAL_MEM_FENCE); } carry_mask = (carry[0]<<1)|group_carry[carryno]; bubble_mask = bubble[0]; BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask; BitBlock inc = s | (s-carry_mask); BitBlock rslt = partial_sum + ((inc>>idx)&0x1); group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63; return rslt; } \end{verbatim} Figure \ref{fig:SSE-AVX-GPU} compares the performance of our SSE2, AVX and GPU implementations. \end{center} \caption{OpenCL 4096-bit Addition} \label{fig:GPUadd} \end{figure*} \begin{figure} \begin{center} \end{tikzpicture} \end{center} \caption{Running Time} \caption{Running Time}\label{fig:SSE-AVX-GPU} \end{figure}