# Changeset 255

Ignore:
Timestamp:
Jan 4, 2009, 3:54:20 PM (10 years ago)
Message:

References, cleanups

Location:
docs/ASPLOS09
Files:
3 edited

### Legend:

Unmodified
 r254 \authorinfo{Robert D. Cameron \and Dan Lin} {School of Computing Science, Simon Fraser University} {\{cameron, lindanl\}@cs.sfu.ca} {\tt \{cameron, lindanl\}@cs.sfu.ca} design.  First we introduce a simple model and notation for SWAR operations in general and then present the four key features of an idealized architecture in support of parallel bit stream programming. features of IDISA. the simultaneous calculation of individual field values in accord with the following equation. $r_i = F_n(a_i, b_i)$ \begin{eqnarray} r_i &=& F_n(a_i, b_i) \end{eqnarray} For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left \mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\ \mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n \end{eqnarray}Throughout this paper, we focus on SWAR implementations using a {\em three-register model} involving two input registers and one output register, each of size $N=2^K$ bits. %\doublespace The SSE and Altivec instruction sets support each of the addition and subtraction operations in SWAR form for 8, 16 and 32-bit fields, while the SSE set also includes the 64-bit forms.  For example, \verb#simd<8>::add# in our nomenclature is provided by the operation \verb:paddb: on SSE and the operation \verb:vaddubm: on Altivec. The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and \verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are more constrained, requiring that all field shifts by the same amount. \end{eqnarray} The Altivec instruction set includes each of these operations for 8, 16 and 32-bit fields directly following the three-register model.   The SSE set uses a two-register model with the result being copied back to one of the input operands.   However, the C language intrinsics commonly used to access these instructions reflect the three-register model.  The SSE set extends these operations to include operations on 64-bit fields, but constrains the shift instructions, requiring that all field shifts by the same amount. Given these definitions and notation, we now present For the purpose of direct and efficient support for inductive doubling algorithms on bit streams, the provision of doubling algorithms, the provision of a core set of operations at field widths of 2 and 4 as well as the more traditional field widths of 8, 16 and 32 is critical for elegant and efficient expression of the algorithms.  In essence, inductive doubling algorithms is key.  In essence, inductive doubling algorithms work by establishing some base property at either single or 2-bit fields.  Each iteration of the algorithm then operations.  Thus packing with conversion by masking off all but the low $n/2$ bits of each field may be be performed using the operation \verb#simd::pack# be performed using the operation \verb#simd::pack#. The final facility of the inductive doubling architecture is a set of merging operations \verb#simd::mergeh# and \verb#simd::mergel# that produce $2n$-bit fields a set of merging operations that produce $2n$-bit fields by concatenating corresponding $n$-bit fields from the operand registers.   The respective operand registers.   The operations \verb#r=simd::mergeh(a,b)# and \verb#s=simd::mergel(a,b)# a constant load operation \verb#simd::constant(c)# loads the constant value $c$ into each $n$ bit field. The pack and merge facilities of SSE will also be assumed. Reference architecture B (RefB) consists of a register-rich logical instruction \verb#simd::rotl(a,b)#  rotates each field of $a$ by the rotation count in the corresponding field of $b$. A three-register \verb#simd<1>::if(a,b,c)# bitwise logical operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned A three-input \verb#simd<1>::if(a,b,c)# bitwise logical operator implements the logic $r=a \wedge b \vee \neg a \wedge c$, patterned after the Altivec \verb:vec_sel: operation.  Finally, a \verb#simd<8>::permute(a,b,c)# selects an arbitrary access.  This assumption makes for straightforward performance evaluation based on instruction count for straight-line computational kernels.  Furthermore, the assumption also eliminates artifacts due to possibly different latencies in reference and IDISA architectures.  Because the same assumption is made for reference and IDISA count for straight-line computational kernels. Furthermore, the assumption also eliminates artifacts due to possibly different latencies in reference and IDISA architectures. Because the same assumption is made for reference and IDISA architectures, determination of the additional circuit complexity due to IDISA features is unaffected by the assumption. In the remainder of this paper, then, IDISA-A and IDISA-B \section{Population Count} \begin{figure}[h] \begin{figure} \begin{center}\small \begin{verbatim} An inductive doubling algorithm of $n$ steps typically applies mask or shift operations at each step for each of the two operands being combined in the step.  If there is one such operation for each operand, and the combination can be implemented in a single SWAR operation, then a total two operands being combined in the step.  In general, the mask constants shown in Figure \ref{HD-pop} recur; these are termed magic masks'' by Knuth \cite{v4pf1a}. If the algorithm employs a single operation at each step, then a total of $3n$ operations are the required in a RefB implementation, and possibly $4n$ for a RefA implementation including the that this is optimal for any three-register instruction set model. \begin{figure}[tbh] \begin{figure} \begin{center} \includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf} \caption{Input/Output Model for Serial to Parallel Transposition} \includegraphics[width=87mm, trim= 40 50 0 50]{S2P_IO.pdf} \caption{Serial to Parallel Transposition} \label{s2p-spec} \end{center} % \begin{figure}[tbh] \begin{figure} \begin{center}\small \begin{verbatim} even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}; odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}; even={0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}; odd ={1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}; mask = simd<8>::constant(0xAA); t0 = simd<8>::permute(s0, s1, even); dividing the bitpair streams into bit streams in the third stage. \begin{figure}[tbh] \small \begin{verbatim} p0 = simd<8>::pack(s0, s1); p1 = simd<8>::pack(s0, s1); \end{verbatim} \caption{Stage 1 Transposition Step in the Inductive Halving Algorithm} \label{halvingstep} \end{figure} Figure \ref{halvingstep} shows one step in stage 1 of the inductive halving algorithm, comprising just two IDISA-A operations. of each nybble and one \verb#simd<4>::pack# operation to extract the low pair of each nybble.  Four applications of this step complete stage 2. \begin{figure} \small \begin{verbatim} p0 = simd<8>::pack(s0, s1); p1 = simd<8>::pack(s0, s1); \end{verbatim} \caption{Step in Inductive Halving Algorithm Stage 1} \label{halvingstep} \end{figure} Stage 3 similarly uses four applications of a step that uses a model. \begin{figure*}[t] \begin{figure*} \begin{center} \begin{tabular}{|c||c|c|c|c|c|c|c|c|} left instruction.  Right justification by shifting an $n$ bit field $i$ positions to the right is equivalent to a left rotate of $n-i$ positions.  These rotation amounts are computed by the operation \verb#rj=simd<8>::sub(simd<8>::constant(8), cts_4)# as shown in row 5, positions.  Given a register value \verb:c8: preloaded with the value 8 in each 8-bit field, the right rotation amounts are computed by the operation \verb#rj=simd<8>::sub(c8, cts_4)# producing values shown in row 5, except that don't care fields (which won't be subsequently used) are marked \verb:XX:. The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)# as shown in row 6, and are combined with the right shift amounts by the selection operation  \verb#rot_8=simd_if(simd<16>::constant(0xFF00), rj, lj)# producing the values shown in row 6, and are then combined with the right shift amounts by the selection operation  \verb#rot_8=simd_if(mask0xFF00, rj, lj)# as shown in row 7.  Using these computed values, the inductive step is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)# % \end{figure} \begin{figure}[h] \begin{figure}[tb] \begin{center}\small \begin{verbatim} y = simd<8>::xor(y, y); y = simd<16>::xor(y, y); y = simd<32>::xor(y, y); \end{verbatim} \end{center} codes such as the various Hamming codes for detecting and correcting numbers of bit errors dependent on the number of parity bits added.  Whereas number of parity bits added. Figure \ref{ID-parity} shows an IDISA-A parity implementation with only 5 operations required for 32-bit fields, slightly more than a 2X improvement over the 11 operations required in Warren's implementation using shifts  \cite{HackersDelight}. This represents slightly more than a 2X improvement.  The improvement is less than required in a RefB implementation following Warren \cite{HackersDelight}. The improvement is less than 3X seen in other cases because one of the operands need not be modified before applying the exclusive-or operation. \subsection{String/Decimal/Integer Conversion} \begin{figure}[h] \begin{center}\small \begin{verbatim} b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F); b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF); b=(d & 0x0000FFFF) + 10000 * (d >> 16); \end{verbatim} \end{center} \caption{BCD to Integer Reference Algorithm} \label{BCD2int} \end{figure} \begin{figure}[h] \begin{center}\small \begin{verbatim} t1=simd<8>:const(10); t2=simd<16>:const(100); t3=simd<32>:const(10000); b=simd<8>::add(simd<8>::mult(d,t1), d); b=simd<16>::add(simd<16>::mult(b,t2), b); b=simd<32>::add(simd<32>::mult(b,t3), b); \end{verbatim} \end{center} \caption{IDISA BCD to Integer} \label{ID-BCD2int} \end{figure} Just as DNA sequences represent an important use case for conversion and as a direct representation for packed binary coded decimal. \begin{figure} \begin{center}\small \begin{verbatim} b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F) b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF) b=(d & 0x0000FFFF) + 10000 * (d >> 16) \end{verbatim} \end{center} \caption{BCD to Integer Reference Algorithm} \label{BCD2int} \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} t1=simd<8>:const(10) t2=simd<16>:const(100) t3=simd<32>:const(10000) b=simd<8>::add(simd<8>::mult(d,t1), d) b=simd<16>::add(simd<16>::mult(b,t2), b) b=simd<32>::add(simd<32>::mult(b,t3), b) \end{verbatim} \end{center} \caption{IDISA BCD to Integer} \label{ID-BCD2int} \end{figure} Figure \ref{BCD2int} shows a three-step inductive \subsection{Further Applications} Further applications of inductive doubling can be found in integer contraction and dilation for quadtrees and Further applications of IDISA can often be found by searching for algorithms employing the magic masks s \verb:0x55555555:,  \verb:0x33333333:, and so on. Examples include the bitslice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06} and integer contraction and dilation for quadtrees and octtrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}. Pixel packing from 32 bit fields into a 5:5:5 representation offsetting the circuit complexity of half-operand modifiers with potential elimination of dedicated logic for some {/ad hoc} horizontal SWAR operations. logic for some {\em ad hoc} horizontal SWAR operations. Even if legacy support for these operations is required, it may be possible to provide that support through \begin{figure}[tbh] \begin{center} \includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf} \includegraphics[width=85mm, trim= 40 350 0 50]{IDISA.pdf} \caption{IDISA Block Diagram} \label{pipeline-model}