Jan 4, 2009, 3:54:20 PM (11 years ago)

References, cleanups

1 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r254 r255  
    3232\authorinfo{Robert D. Cameron \and Dan Lin}
    3333           {School of Computing Science, Simon Fraser University}
    34            {\{cameron, lindanl\}@cs.sfu.ca}
     34           {\tt \{cameron, lindanl\}@cs.sfu.ca}
    159159design.  First we introduce a simple model and notation for
    160160SWAR operations in general and then present the four key
    161 features of an idealized architecture in support of parallel
    162 bit stream programming.
     161features of IDISA.
    190189the simultaneous calculation of individual field values in
    191190accord with the following equation.
    192 \[ r_i = F_n(a_i, b_i) \]
     192 r_i &=& F_n(a_i, b_i)
    194195For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
    200201\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
    201202\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
    202 \end{eqnarray}Throughout this paper, we focus on SWAR implementations using a {\em
    203 three-register model} involving two input registers
    204 and one output register, each of size $N=2^K$ bits.
    205 %\doublespace
    207 The SSE and Altivec instruction sets support
    208 each of the addition and subtraction operations in SWAR form
    209 for 8, 16 and 32-bit fields, while the SSE set also includes
    210 the 64-bit forms.  For example, \verb#simd<8>::add# in our
    211 nomenclature is provided by the operation \verb:paddb:
    212 on SSE and the operation \verb:vaddubm: on Altivec.
    213 The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
    214 \verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
    215 more constrained, requiring that all field shifts by the same amount.
     205The Altivec instruction set includes each of these operations
     206for 8, 16 and 32-bit fields directly following the three-register
     207model.   The SSE set uses a two-register model with the result
     208being copied back to one of the input operands.   However, the
     209C language intrinsics commonly used to access these instructions
     210reflect the three-register model.  The SSE set extends these
     211operations to include operations on 64-bit fields, but
     212constrains the shift instructions, requiring that all field shifts by the same amount.
    217214Given these definitions and notation, we now present
    231228For the purpose of direct and efficient support for inductive
    232 doubling algorithms on bit streams, the provision of
     229doubling algorithms, the provision of
    233230a core set of operations at field widths of 2 and 4 as
    234231well as the more traditional field widths of 8, 16 and 32
    235 is critical for elegant and efficient expression of the
    236 algorithms.  In essence, inductive doubling algorithms
     232is key.  In essence, inductive doubling algorithms
    237233work by establishing some base property at either single
    238234or 2-bit fields.  Each iteration of the algorithm then
    305301operations.  Thus packing with conversion by masking off all
    306302but the low $n/2$ bits of each field may be
    307 be performed using the operation \verb#simd<n>::pack<l,l>#
     303be performed using the operation \verb#simd<n>::pack<l,l>#.
    310305The final facility of the inductive doubling architecture is
    311 a set of merging operations \verb#simd<n>::mergeh# and
    312 \verb#simd<n>::mergel# that produce $2n$-bit fields
     306a set of merging operations that produce $2n$-bit fields
    313307by concatenating corresponding $n$-bit fields from the
    314 operand registers.   The respective
     308operand registers.   The
    315309operations \verb#r=simd<n>::mergeh(a,b)# and
    364358a constant load operation \verb#simd::constant<n>(c)#
    365359loads the constant value $c$ into each $n$ bit field.
     360The pack and merge facilities of SSE will also be assumed.
    367362Reference architecture B (RefB) consists of a register-rich
    372367logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
    373368of $a$ by the rotation count in the corresponding field of $b$.
    374 A three-register \verb#simd<1>::if(a,b,c)# bitwise logical
    375 operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned
     369A three-input \verb#simd<1>::if(a,b,c)# bitwise logical
     370operator implements the logic $r=a \wedge b \vee \neg a \wedge c$, patterned
    376371after the Altivec \verb:vec_sel: operation.  Finally,
    377372a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
    396391access.  This assumption makes for
    397392straightforward performance evaluation based on instruction
    398 count for straight-line computational kernels.  Furthermore,
    399 the assumption also eliminates artifacts due to possibly different
    400 latencies in reference and IDISA architectures.  Because
    401 the same assumption is made for reference and IDISA
     393count for straight-line computational kernels. 
     394Furthermore, the assumption also eliminates artifacts due to
     395possibly different latencies in reference and IDISA architectures. 
     396Because the same assumption is made for reference and IDISA
    402397architectures, determination of the additional circuit
    403398complexity due to IDISA features is unaffected by the
    408401In the remainder of this paper, then, IDISA-A and IDISA-B
    426419\section{Population Count}
    428 \begin{figure}[h]
    512505An inductive doubling algorithm of $n$ steps typically applies
    513506mask or shift operations at each step for each of the
    514 two operands being combined in the step.  If there is
    515 one such operation for each operand, and the combination
    516 can be implemented in a single SWAR operation, then a total
     507two operands being combined in the step.  In general,
     508the mask constants shown in Figure \ref{HD-pop} recur; these
     509are termed ``magic masks'' by Knuth \cite{v4pf1a}.
     510If the algorithm employs a single operation at each step, then a total
    517511of $3n$ operations are the required in a RefB implementation,
    518512and possibly $4n$ for a RefA implementation including the
    544538that this is optimal for any three-register instruction set model.
    546 \begin{figure}[tbh]
    548 \includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
    549 \caption{Input/Output Model for Serial to Parallel Transposition}
     542\includegraphics[width=87mm, trim= 40 50 0 50]{S2P_IO.pdf}
     543\caption{Serial to Parallel Transposition}
    635 \begin{figure}[tbh]
    638 even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30};
    639 odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
     633odd ={1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
    640634mask = simd<8>::constant(0xAA);
    641635t0 = simd<8>::permute(s0, s1, even);
    687681dividing the bitpair streams into bit streams in the third stage.
    689 \begin{figure}[tbh]
    690 \small
    691 \begin{verbatim}
    692 p0 = simd<8>::pack<h,h>(s0, s1);
    693 p1 = simd<8>::pack<l,l>(s0, s1);
    694 \end{verbatim}
    695 \caption{Stage 1 Transposition Step in the Inductive Halving Algorithm}
    696 \label{halvingstep}
    697 \end{figure}
    699683Figure \ref{halvingstep} shows one step in stage 1 of the inductive
    700684halving algorithm, comprising just two IDISA-A operations.
    709693of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
    710694low pair of each nybble.  Four applications of this step complete stage 2.
     699p0 = simd<8>::pack<h,h>(s0, s1);
     700p1 = simd<8>::pack<l,l>(s0, s1);
     702\caption{Step in Inductive Halving Algorithm Stage 1}
    712707Stage 3 similarly uses four applications of a step that uses a
    866 \begin{figure*}[t]
    955950left instruction.  Right justification by shifting an $n$ bit field
    956951$i$ positions to the right is equivalent to a left rotate of $n-i$
    957 positions.  These rotation amounts are computed by the operation
    958 \verb#rj=simd<8>::sub<x,l>(simd<8>::constant(8), cts_4)# as shown in row 5,
     952positions.  Given a register value \verb:c8: preloaded with
     953the value 8 in each 8-bit field, the right rotation
     954amounts are computed by the operation
     955\verb#rj=simd<8>::sub<x,l>(c8, cts_4)# producing values shown in row 5,
    959956except that don't care fields (which won't be subsequently used)
    960957are marked \verb:XX:. 
    962959The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
    963 as shown in row 6, and are combined with the right shift amounts
    964 by the selection operation  \verb#rot_8=simd_if(simd<16>::constant(0xFF00), rj, lj)#
     960producing the values shown in row 6, and are then combined with the right shift amounts
     961by the selection operation  \verb#rot_8=simd_if(mask0xFF00, rj, lj)#
    965962as shown in row 7.  Using these computed values, the inductive step
    966963is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
    10081005% \end{figure}
    1010 \begin{figure}[h]
    10151012y = simd<8>::xor<h,l>(y, y);
    10161013y = simd<16>::xor<h,l>(y, y);
     1014y = simd<32>::xor<h,l>(y, y);
    10241022codes such as the various Hamming codes for detecting
    10251023and correcting numbers of bit errors dependent on the
    1026 number of parity bits added.  Whereas
     1024number of parity bits added. 
    10281025Figure \ref{ID-parity} shows an IDISA-A parity implementation
    10291026with only 5 operations required for 32-bit fields,
    10301027slightly more than a 2X improvement over the 11 operations
    1031 required in Warren's implementation using shifts  \cite{HackersDelight}.
    1033 This represents slightly
    1034 more than a 2X improvement.  The improvement is less than
     1028required in a RefB implementation following Warren
     1029 \cite{HackersDelight}. The improvement is less than
    103510303X seen in other cases because one of the operands need
    10361031not be modified before applying the exclusive-or operation.
    10651060\subsection{String/Decimal/Integer Conversion}
    1066 \begin{figure}[h]
    1067 \begin{center}\small
    1068 \begin{verbatim}
    1069 b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
    1070 b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
    1071 b=(d & 0x0000FFFF) + 10000 * (d >> 16);
    1072 \end{verbatim}
    1073 \end{center}
    1074 \caption{BCD to Integer Reference Algorithm}
    1075 \label{BCD2int}
    1076 \end{figure}
    1078 \begin{figure}[h]
    1079 \begin{center}\small
    1080 \begin{verbatim}
    1081 t1=simd<8>:const(10);
    1082 t2=simd<16>:const(100);
    1083 t3=simd<32>:const(10000);
    1084 b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d);
    1085 b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b);
    1086 b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b);
    1087 \end{verbatim}
    1088 \end{center}
    1089 \caption{IDISA BCD to Integer}
    1090 \label{ID-BCD2int}
    1091 \end{figure}
    10931062Just as DNA sequences represent an important use case for
    10981067conversion and as a direct representation for
    10991068packed binary coded decimal.
     1073b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F)
     1074b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF)
     1075b=(d & 0x0000FFFF) + 10000 * (d >> 16)
     1078\caption{BCD to Integer Reference Algorithm}
     1088b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d)
     1089b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b)
     1090b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b)
     1093\caption{IDISA BCD to Integer}
    11011097Figure \ref{BCD2int} shows a three-step inductive
    11321128\subsection{Further Applications}
    1134 Further applications of inductive doubling can be
    1135 found in integer contraction and dilation for quadtrees and
     1131Further applications of IDISA can often be found
     1132by searching for algorithms employing the magic masks
     1133s \verb:0x55555555:,  \verb:0x33333333:, and so on.
     1134Examples include the bitslice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}
     1135and integer contraction and dilation for quadtrees and
    11361136octtrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
    11371137Pixel packing from 32 bit fields into a 5:5:5 representation
    12121212offsetting the circuit complexity of half-operand
    12131213modifiers with potential elimination of dedicated
    1214 logic for some {/ad hoc} horizontal SWAR operations.
     1214logic for some {\em ad hoc} horizontal SWAR operations.
    12151215Even if legacy support for these operations is required,
    12161216it may be possible to provide that support through
    1271 \includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
     1271\includegraphics[width=85mm, trim= 40 350 0 50]{IDISA.pdf}
    12721272\caption{IDISA Block Diagram}
Note: See TracChangeset for help on using the changeset viewer.