 Timestamp:
 Sep 15, 2013, 12:23:19 PM (6 years ago)
 Location:
 docs/Working/re
 Files:

 2 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/re/ppoppre.tex
r3491 r3493 405 405 406 406 407 Figure \ref{fig:matchstar} illustrates the MatchStar method. The second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream derived from the source data. 408 409 It is important to note that our bitstreams are shown in natural lefttoright order reflecting the 407 Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure, 408 it is important to note that our bitstreams are shown in natural lefttoright order reflecting the 410 409 conventional presentation of our character data input. However, this reverses the normal 411 410 order of presentation when considering bitstreams as numeric values. The key point here is 412 411 that when we perform bitstream addition, we will show bit movement from lefttoright. 413 That $\verb:111.: + \verb:1...: = \verb:...1:$ 412 For example, $\verb:111.: + \verb:1...: = \verb:...1:$. 413 414 The first row of the figure is the input data, 415 the second and third rows are the input bitstreams: the initial marker position bitstream and the 416 character class bitstream for digits derived from input data. 414 417 415 418 In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic. Next, the temporary marker bitstream is added to the character class bitstream. … … 424 427 of characters in class $C$ from each position in $M$. 425 428 426 429 \paragraph*{Compilation Algorithm.} 427 430 428 431 … … 812 815 \begin{center} 813 816 \begin{verbatim} 814 void add_ci_co(bitblock 256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {815 bitblock 256_t all_ones = simd256<1>::constant<1>();816 bitblock 256_t gen = simd_and(x, y);817 bitblock 256_t prop = simd_xor(x, y);818 bitblock 256_t partial_sum = simd256<64>::add(x, y);819 bitblock 256_t carry = simd_or(gen, simd_andc(prop, partial_sum));820 bitblock 256_t bubble = simd256<64>::eq(partial_sum, all_ones);817 void add_ci_co(bitblock_t x, bitblock_t y, carry_t carry_in, carry_t & carry_out, bitblock_t & sum) { 818 bitblock_t all_ones = simd256<1>::constant<1>(); 819 bitblock_t gen = simd_and(x, y); 820 bitblock_t prop = simd_xor(x, y); 821 bitblock_t partial_sum = simd256<64>::add(x, y); 822 bitblock_t carry = simd_or(gen, simd_andc(prop, partial_sum)); 823 bitblock_t bubble = simd256<64>::eq(partial_sum, all_ones); 821 824 uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in); 822 825 uint64_t bubble_mask = hsimd256<64>::signmask(bubble); … … 835 838 \end{figure*} 836 839 840 837 841 \section{GPU Implementation}\label{sec:GPU} 842 843 To further assess the scalability of our regular expression matching 844 using bitparallel data streams, we implemented a GPGPU version 845 in OpenCL. We arranged for 64 work groups each having 64 846 threads. Input files are divided in data parallel fashion among 847 the 64 work groups. Each work group carries out the regular 848 expression matching operations 4096 bytes at a time using SIMT 849 processing. Figure \ref{fig:GPUadd} shows our implementation 850 of longstream addition on the GPU. Each thread maintains 851 its own carry and bubble values and performs synchronized 852 updates with the other threads using a sixstep parallelprefix 853 style process. 854 855 856 857 \begin{figure*}[tbh] 858 \begin{center} 859 \begin{verbatim} 860 inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _ 861 _local BitBlock *bubble, BitBlock *group_carry, const int carryno){ 862 BitBlock carry_mask; 863 BitBlock bubble_mask; 864 865 BitBlock partial_sum = a+b; 866 BitBlock gen = a&b; 867 BitBlock prop = a^b; 868 carry[idx] = ((gen  (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE1idx); 869 bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx); 870 871 barrier(CLK_LOCAL_MEM_FENCE); 872 for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){ 873 carry[idx] = carry[idx]carry[idx^offset]; 874 bubble[idx] = bubble[idx]bubble[idx^offset]; 875 barrier(CLK_LOCAL_MEM_FENCE); 876 } 877 878 carry_mask = (carry[0]<<1)group_carry[carryno]; 879 bubble_mask = bubble[0]; 880 881 BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask; 882 BitBlock inc = s  (scarry_mask); 883 BitBlock rslt = partial_sum + ((inc>>idx)&0x1); 884 group_carry[carryno] = (carry[0](bubble_mask & inc))>>63; 885 return rslt; 886 } 887 \end{verbatim} 888 889 Figure \ref{fig:SSEAVXGPU} compares the performance of 890 our SSE2, AVX and GPU implementations. 891 892 \end{center} 893 \caption{OpenCL 4096bit Addition} 894 \label{fig:GPUadd} 895 \end{figure*} 896 897 898 838 899 \begin{figure} 839 900 \begin{center} … … 861 922 \end{tikzpicture} 862 923 \end{center} 863 \caption{Running Time} 924 \caption{Running Time}\label{fig:SSEAVXGPU} 864 925 \end{figure} 865 926
Note: See TracChangeset
for help on using the changeset viewer.