Changeset 241 for docs

Dec 18, 2008, 8:41:12 AM (11 years ago)

S2P evaluation with RefA/RefB/IDISA-A/IDISA-B

2 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r240 r241  
    424424could potentially be added to any SWAR processor.  The goal
    425425in this paper is to evaluate these features independent
    426 of any artifacts that may be due to any particular realization,
     426of artifacts that may be due to any particular realization,
    427427while still considering realistic models based on existing
    428428commodity instruction set architectures.  For the purpose
    573573directly adapted to SWAR processing using 128-bit registers.
    574574Each binary operator is replaced by a corresponding binary
    575 SIMD operation.   Without the IDISA features, a
     575SWAR operation.   Without the IDISA features, a
    576576straightforward RefA implementation of population count for
    57757732-bit fields thus employs 10 operations to load or generate
    607607In this section, we consider the first major
    608 application of the inductive doubling architecture:
    609 transposition of byte stream data to parallel bit stream
     608application of IDISA: transposition of byte stream data to parallel bit stream
    610609form.   Of course, this operation is critical to the
    611610method of parallel bit streams and all applications
    612611of the method can benefit from a highly efficient
    613612transposition process.  Before considering how
    614 the inductive doubling architecture supports this
     613the IDISA supports this
    615614transposition process, however, we first consider
    616615algorithms on existing architectures.  Two algorithms
    617616are presented; the best of these requires 72
    618 SIMD operations in the three-register model to perform
     617SWAR operations under the RefB model to perform
    619618transposition of eight serial registers of byte stream data into
    620619eight parallel registers of bit stream data.
    622621We then go on to show how the transposition problem
    623 can be solved using the inductive doubling architecture
    624 with a mere 24 three-register SIMD operations.  We also prove
     622can be solved using IDISA-A or IDISA-B
     623with a mere 24 three-register SIMD operations.  We also show
    625624that this is optimal for any three-register instruction set model.
    636634Figure \ref{s2p-spec} illustrates the input-output requirements of
    637635the transposition problem.  We assume that inputs and
    638 outputs are each SIMD registers of size $N=2^K$ bits.
     636outputs are each SWAR registers of size $N=2^K$ bits.
    639637The input consists of $N$ bytes of serial byte data,
    640638stored consecutively in eight SIMD registers each holding
    657655One straightforward algorithm for implementing the transposition process
    658 takes advantage of SIMD bit gathering operations that exist
     656takes advantage of SWAR bit gathering operations that exist
    659657on some architectures.  This operation gathers one bit per byte
    660658from a particular position within each byte of a SIMD register.
    688686A much more efficient transposition algorithm on commodity
    689 SIMD architectures (SSE and Altivec) involves three
     687SWAR architectures involves three
    690688stages of binary division transformation.  This is similar
    691689to the three stage bit matrix inversion described by
    692 Warren  \cite{HackersDelight}, although modified to use SIMD operations.
     690Warren  \cite{HackersDelight}, although modified to use SWAR operations.
    693691In each stage, input streams are divided into two half-length output streams.
    694692The first stage separates the bits at even numbered positions from those
    699697byte in the original stream and a second stream consisting of bits
    700698from positions 2 and 6 of each original byte.  The stream of bits from
    701 odd positions is similarly divided into streams for bits from Each of the
     699odd positions is similarly divided into streams for bits from each of the
    702700positions 1 and 5 and bits from positions 2 and 6.
    703701Finally, each of the four streams resulting from the second stage are
    704702divided into the desired individual bit streams in the third stage.
     704% \begin{figure}[tbh]
     705% \begin{center}\small
     706% \begin{verbatim}
     707% s0h = simd<16>::srli<8>(s0);
     708% s0l = simd_and(s0, simd<16>::const(0x00FF));
     709% s1h = simd<16>::srli<8>(s1);
     710% s1l = simd_and(s1, simd<16>::const(0x00FF));
     711% t0 = simd<16>::pack(s0h, s1h);
     712% t1 = simd<16>::pack(s0l, s1l);
     713% t0_l1 = simd<16>::slli<1>(t0);
     714% t0_r1 = simd<16>::srli<1>(t1);
     715% mask = simd<8>::const(0xAA);
     716% p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask));
     717% p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask));
     718% \end{verbatim}
     719% \end{center}
     720% \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
     721% \label{s2pstep}
     722% \end{figure}
    709 t0 = simd<16>::pack<h,h>(s0, s1);
    710 t1 = simd<16>::pack<l,l>(s0, s1);
    711 p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
    712 p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
     728even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30};
     729odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
     730mask = simd<8>::const(0xAA);
     731t0 = simd<8>::permute(s0, s1, even);
     732t1 = simd<8>::permute(s0, s1, odd);
     733p0 = simd_if(mask, t0, simd<16>::srli<1>(t1));
     734p1 = simd_if(mask, simd<16>::slli<1>(t0), t1);
    715 \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
     737\caption{RefB Transposition Step in BytePack Stage 1}
    720742using byte packing, shifting and masking.  In each stage, a
    721743transposition step combines each pair of serial input registers to
    722 produce a pair of parallel output registers.  In essence,
    723 doublebytes from the input registers are packed into bytes
    724 in the output registers, with the bits from even positions stored
    725 in the bytes of one output stream and the bits from odd positions
    726 stored in the bytes of the second output stream.
    727 Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
    728 producing two parallel registers \verb:p0: and \verb:p1: from
    729 two serial registers \verb:s0: and \verb:s1:.  This step is applied
     744produce a pair of parallel output registers. 
     745Figure \ref{s2pstep} shows a stage 1 transposition step in a
     746Ref B implementation.  Using the permute facility, the even
     747and odd bytes, respectively, from two serial input registers
     748\verb:s0: and \verb:s1: are packed into temporary registers
     749\verb:t0: and \verb:t1:.  The even and odd bits are then
     750separated into two parallel output registers \verb:p0: and \verb:p1:
     751by selecting alternating bits using a mask.  This step is applied
    730752four times in stage 1; stages 2 and 3 also consist of four applications
    731 of a similar step with different shift and masking constants.
    733 Although we have used the idealized SIMD notation here, each of the
    734 operations maps to a single operation in the Altivec set and a small number
    735 of operations in the SSE set.  Using the Altivec set, there are
    736 6 operations per step for a total of 24 operations per stage. 
    737 The three stages combined required 72 operations to transpose 128 bytes
    738 to parallel bit stream form.  This is the best algorithm known to
    739 us for existing SIMD architectures.
     753of a similar step with different shift and mask constants.
     754Overall, 6 operations per step are required, yielding a total
     755of 72 operations to transpose 128 bytes to parallel bit stream
     756form in the RefB implementation.
     758In a RefA implementation, byte packing may also be achieved
     759by the \verb#simd<16>::pack# with 4 additional operations to
     760prepare operands.  Essentially, the RefB implementation
     761uses single permuite instructions to implement the equivalent of
     762\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
     763The RefA implementation also requires 3 logic operations to implement
     764each \verb#simd_if#.
     765Assuming that mask loads are only need once per 128 bytes,
     766a total of 148 operations are required in the RefB implementation.
    741768\subsection{Inductive Halving Algorithm}
    743 Using the inductive doubling architecture, it is possible to design
     770Using IDISA, it is possible to design
    744771a transposition algorithm that is both easier to understand and requires
    745772many fewer operations than the the bytepack algorithm described above.
    750777dividing the bitpair streams into bit streams in the third stage.
    757783p1 = simd<8>::pack<l,l>(s0, s1);
    759 \caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
     785\caption{Stage 1 Transposition Step in the Inductive Halving Algorithm}
    763789Figure \ref{halvingstep} shows one step in stage 1 of the inductive
    764 halving algorithm, comprising just two SIMD operations.
     790halving algorithm, comprising just two IDISA-A operations.
    765791The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
    766792from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
    780806byte registers s0 through s7 into the eight parallel bit stream
    781807registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
     808Under either IDISA-A or IDISA-B models, a mere 24 operations per 128
     809input bytes is required.
    786 hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
    787 lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
    788 hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
    789 lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
    790 hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
    791 lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
    792 hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
    793 lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
    794 hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
    795 hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
    796 lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
    797 ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
    798 hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
    799 hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
    800 lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
    801 ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
     814hnybble0 = simd<8>::pack<h,h>(s0, s1);
     815lnybble0 = simd<8>::pack<l,l>(s0, s1);
     816hnybble1 = simd<8>::pack<h,h>(s2, s3);
     817lnybble1 = simd<8>::pack<l,l>(s2, s3);
     818hnybble2 = simd<8>::pack<h,h>(s4, s5);
     819lnybble2 = simd<8>::pack<l,l>(s4, s5);
     820hnybble3 = simd<8>::pack<h,h>(s6, s7);
     821lnybble3 = simd<8>::pack<l,l>(s6, s7);
     822hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
     823hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
     824lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
     825ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
     826hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
     827hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
     828lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
     829ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
    802830bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
    803831bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
Note: See TracChangeset for help on using the changeset viewer.