# Changeset 241 for docs/ASPLOS09

Ignore:
Timestamp:
Dec 18, 2008, 8:41:12 AM (10 years ago)
Message:

S2P evaluation with RefA/RefB/IDISA-A/IDISA-B

Location:
docs/ASPLOS09
Files:
2 edited

### Legend:

Unmodified
 r240 could potentially be added to any SWAR processor.  The goal in this paper is to evaluate these features independent of any artifacts that may be due to any particular realization, of artifacts that may be due to any particular realization, while still considering realistic models based on existing commodity instruction set architectures.  For the purpose directly adapted to SWAR processing using 128-bit registers. Each binary operator is replaced by a corresponding binary SIMD operation.   Without the IDISA features, a SWAR operation.   Without the IDISA features, a straightforward RefA implementation of population count for 32-bit fields thus employs 10 operations to load or generate In this section, we consider the first major application of the inductive doubling architecture: transposition of byte stream data to parallel bit stream application of IDISA: transposition of byte stream data to parallel bit stream form.   Of course, this operation is critical to the method of parallel bit streams and all applications of the method can benefit from a highly efficient transposition process.  Before considering how the inductive doubling architecture supports this the IDISA supports this transposition process, however, we first consider algorithms on existing architectures.  Two algorithms are presented; the best of these requires 72 SIMD operations in the three-register model to perform SWAR operations under the RefB model to perform transposition of eight serial registers of byte stream data into eight parallel registers of bit stream data. We then go on to show how the transposition problem can be solved using the inductive doubling architecture with a mere 24 three-register SIMD operations.  We also prove can be solved using IDISA-A or IDISA-B with a mere 24 three-register SIMD operations.  We also show that this is optimal for any three-register instruction set model. \begin{figure}[tbh] Figure \ref{s2p-spec} illustrates the input-output requirements of the transposition problem.  We assume that inputs and outputs are each SIMD registers of size $N=2^K$ bits. outputs are each SWAR registers of size $N=2^K$ bits. The input consists of $N$ bytes of serial byte data, stored consecutively in eight SIMD registers each holding \end{figure} One straightforward algorithm for implementing the transposition process takes advantage of SIMD bit gathering operations that exist takes advantage of SWAR bit gathering operations that exist on some architectures.  This operation gathers one bit per byte from a particular position within each byte of a SIMD register. A much more efficient transposition algorithm on commodity SIMD architectures (SSE and Altivec) involves three SWAR architectures involves three stages of binary division transformation.  This is similar to the three stage bit matrix inversion described by Warren  \cite{HackersDelight}, although modified to use SIMD operations. Warren  \cite{HackersDelight}, although modified to use SWAR operations. In each stage, input streams are divided into two half-length output streams. The first stage separates the bits at even numbered positions from those byte in the original stream and a second stream consisting of bits from positions 2 and 6 of each original byte.  The stream of bits from odd positions is similarly divided into streams for bits from Each of the odd positions is similarly divided into streams for bits from each of the positions 1 and 5 and bits from positions 2 and 6. Finally, each of the four streams resulting from the second stage are divided into the desired individual bit streams in the third stage. % \begin{figure}[tbh] % \begin{center}\small % \begin{verbatim} % s0h = simd<16>::srli<8>(s0); % s0l = simd_and(s0, simd<16>::const(0x00FF)); % s1h = simd<16>::srli<8>(s1); % s1l = simd_and(s1, simd<16>::const(0x00FF)); % t0 = simd<16>::pack(s0h, s1h); % t1 = simd<16>::pack(s0l, s1l); % t0_l1 = simd<16>::slli<1>(t0); % t0_r1 = simd<16>::srli<1>(t1); % mask = simd<8>::const(0xAA); % p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask)); % p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask)); % \end{verbatim} % \end{center} % \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm} % \label{s2pstep} % \end{figure} % \begin{figure}[tbh] \begin{center}\small \begin{verbatim} t0 = simd<16>::pack(s0, s1); t1 = simd<16>::pack(s0, s1); p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1)); p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1); even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}; odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}; mask = simd<8>::const(0xAA); t0 = simd<8>::permute(s0, s1, even); t1 = simd<8>::permute(s0, s1, odd); p0 = simd_if(mask, t0, simd<16>::srli<1>(t1)); p1 = simd_if(mask, simd<16>::slli<1>(t0), t1); \end{verbatim} \end{center} \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm} \caption{RefB Transposition Step in BytePack Stage 1} \label{s2pstep} \end{figure} using byte packing, shifting and masking.  In each stage, a transposition step combines each pair of serial input registers to produce a pair of parallel output registers.  In essence, doublebytes from the input registers are packed into bytes in the output registers, with the bits from even positions stored in the bytes of one output stream and the bits from odd positions stored in the bytes of the second output stream. Figure \ref{s2pstep} shows a step in stage 1 of the algorithm producing two parallel registers \verb:p0: and \verb:p1: from two serial registers \verb:s0: and \verb:s1:.  This step is applied produce a pair of parallel output registers. Figure \ref{s2pstep} shows a stage 1 transposition step in a Ref B implementation.  Using the permute facility, the even and odd bytes, respectively, from two serial input registers \verb:s0: and \verb:s1: are packed into temporary registers \verb:t0: and \verb:t1:.  The even and odd bits are then separated into two parallel output registers \verb:p0: and \verb:p1: by selecting alternating bits using a mask.  This step is applied four times in stage 1; stages 2 and 3 also consist of four applications of a similar step with different shift and masking constants. Although we have used the idealized SIMD notation here, each of the operations maps to a single operation in the Altivec set and a small number of operations in the SSE set.  Using the Altivec set, there are 6 operations per step for a total of 24 operations per stage. The three stages combined required 72 operations to transpose 128 bytes to parallel bit stream form.  This is the best algorithm known to us for existing SIMD architectures. of a similar step with different shift and mask constants. Overall, 6 operations per step are required, yielding a total of 72 operations to transpose 128 bytes to parallel bit stream form in the RefB implementation. In a RefA implementation, byte packing may also be achieved by the \verb#simd<16>::pack# with 4 additional operations to prepare operands.  Essentially, the RefB implementation uses single permuite instructions to implement the equivalent of \verb#simd<16>::pack(s0, s1)# and \verb#simd<16>::pack(s0, s1)#. The RefA implementation also requires 3 logic operations to implement each \verb#simd_if#. Assuming that mask loads are only need once per 128 bytes, a total of 148 operations are required in the RefB implementation. \subsection{Inductive Halving Algorithm} Using the inductive doubling architecture, it is possible to design Using IDISA, it is possible to design a transposition algorithm that is both easier to understand and requires many fewer operations than the the bytepack algorithm described above. dividing the bitpair streams into bit streams in the third stage. \begin{figure}[tbh] \small p1 = simd<8>::pack(s0, s1); \end{verbatim} \caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm} \caption{Stage 1 Transposition Step in the Inductive Halving Algorithm} \label{halvingstep} \end{figure} Figure \ref{halvingstep} shows one step in stage 1 of the inductive halving algorithm, comprising just two SIMD operations. halving algorithm, comprising just two IDISA-A operations. The \verb#simd<8>::pack# operation extracts the high nybble of each byte from the input registers, while the \verb#simd<8>::pack# operation extracts byte registers s0 through s7 into the eight parallel bit stream registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}. Under either IDISA-A or IDISA-B models, a mere 24 operations per 128 input bytes is required. \begin{figure}[tbh] \small \begin{verbatim} hi_nybble0 = simd<8>::pack(s0, s1); lo_nybble0 = simd<8>::pack(s0, s1); hi_nybble1 = simd<8>::pack(s2, s3); lo_nybble1 = simd<8>::pack(s2, s3); hi_nybble2 = simd<8>::pack(s4, s5); lo_nybble2 = simd<8>::pack(s4, s5); hi_nybble3 = simd<8>::pack(s6, s7); lo_nybble3 = simd<8>::pack(s6, s7); hh_pair0 = simd<4>::pack(hi_nybble0, hi_nybble1); hl_pair0 = simd<4>::pack(hi_nybble0, hi_nybble1); lh_pair0 = simd<4>::pack(lo_nybble0, lo_nybble1); ll_pair0 = simd<4>::pack(lo_nybble0, lo_nybble1); hh_pair1 = simd<4>::pack(hi_nybble2, hi_nybble3); hl_pair1 = simd<4>::pack(hi_nybble2, hi_nybble3); lh_pair1 = simd<4>::pack(lo_nybble2, lo_nybble3); ll_pair1 = simd<4>::pack(lo_nybble2, lo_nybble3); hnybble0 = simd<8>::pack(s0, s1); lnybble0 = simd<8>::pack(s0, s1); hnybble1 = simd<8>::pack(s2, s3); lnybble1 = simd<8>::pack(s2, s3); hnybble2 = simd<8>::pack(s4, s5); lnybble2 = simd<8>::pack(s4, s5); hnybble3 = simd<8>::pack(s6, s7); lnybble3 = simd<8>::pack(s6, s7); hh_pair0 = simd<4>::pack(hnybble0, hnybble1); hl_pair0 = simd<4>::pack(hnybble0, hnybble1); lh_pair0 = simd<4>::pack(lnybble0, lnybble1); ll_pair0 = simd<4>::pack(lnybble0, lnybble1); hh_pair1 = simd<4>::pack(hnybble2, hnybble3); hl_pair1 = simd<4>::pack(hnybble2, hnybble3); lh_pair1 = simd<4>::pack(lnybble2, lnybble3); ll_pair1 = simd<4>::pack(lnybble2, lnybble3); bit0 = simd<2>::pack(hh_pair0, hh_pair1); bit1 = simd<2>::pack(hh_pair0, hh_pair1);