Changeset 254 for docs


Ignore:
Timestamp:
Jan 2, 2009, 3:57:30 PM (11 years ago)
Author:
cameron
Message:

Trim to 12 pages

Location:
docs/ASPLOS09
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r253 r254  
    5555\end{abstract}
    5656
    57 \category{CR-number}{subcategory}{third-level}
    58 
    59 \terms
    60 term1, term2
    61 
    62 \keywords
    63 keyword1, keyword2
     57\category{C.1.2}{PROCESSOR ARCHITECTURES}{Multiple Data Stream Architectures (Multiprocessors)}
     58
     59\terms Design, Performance
     60
     61\keywords inductive doubling, parallel bit streams, SWAR
    6462
    6563\sloppy
     
    123121of the proposed features is to support such algorithms
    124122with specific facilities to  transition between successive power-of-2 field
    125 widths.   These transitions are quite frequent in several critical
    126 algorithms for parallel bit streams.   These transitions also
    127 occur in other applications as well.  In related work,
    128 efficient automatic interleaving based on power-of-2 strides has been
    129 reported quite useful for a number of SWAR kernels \cite{Nuzman}.
     123widths.   These transitions are quite frequent in
     124parallel bit stream programming as well as other applications.
    130125The specific features presented herein will be referred to
    131126as IDISA: inductive doubling instruction set architecture.
     
    152147\section{Inductive Doubling Architecture}
    153148
    154 This section presents an idealized model for a SWAR instruction set
     149This section presents IDISA as an idealized model for a SWAR instruction set
    155150architecture designed specifically to support inductive doubling
    156 algorithms in the domain of parallel bit stream programming.   
    157 The architecture is idealized
     151algorithms.  The architecture is idealized
    158152in the sense that we concentrate on only the necessary features
    159153for our purpose, without enumerating the additional
     
    169163
    170164
    171 The idealized architecture supports typical SWAR integer
    172 operations common to existing commodity architectures such as SSE
    173 and Altivec.   The architecture is defined to support SWAR
    174 operations on registers of size $N=2^K$ bits, for some integer $K$.
     165IDISA supports typical SWAR integer operations using a {\em
     166three-register model} involving two input registers
     167and one output register.  Each register is of size $N=2^K$ bits,
     168for some integer $K$.
    175169Typical values of $K$ for commodity processors include $K=6$ for
    176170the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
    177171the 128-bit registers of SSE and Altivec technology and $K=8$ for
    178 the upcoming Intel AVX technology.   The registers may be
     172the upcoming Intel AVX technology. 
     173The registers may be
    179174partitioned into $N/n$ fields of width $n=2^k$ bits for some values
    180175of $k \leq K$.   Typical values of $k$ used on commodity processors
     
    186181register $r$, using big-endian numbering.
    187182
    188 Throughout this paper, we focus on SWAR implementations using a {\em
    189 three-register model} involving two input registers
    190 and one output register, each of size $N=2^K$ bits.
    191183Let \verb:simd<n>: represent the class of SWAR operations defined
    192184on fields of size $n$ using C++ template syntax.  Given a
    193185binary function $F_n$ on $n$-bit fields, we denote the SWAR
    194 version of this operation as \verb#simd<n>::F#.  Given two
     186version of this operation as \verb#simd<n>::F#.  Given two input
    195187registers \verb:a: and \verb:b: holding values $a$ and $b$,
    196188respectively, the operation \verb#r=simd<n>::F(a,b)# stores
    197 the value $r$ in the register \verb:r: as determined by
     189the value $r$ in the output register \verb:r: as determined by
    198190the simultaneous calculation of individual field values in
    199191accord with the following equation.
     
    208200\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
    209201\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
    210 \end{eqnarray}
     202\end{eqnarray}Throughout this paper, we focus on SWAR implementations using a {\em
     203three-register model} involving two input registers
     204and one output register, each of size $N=2^K$ bits.
    211205%\doublespace
    212206
     
    400394pipelined processors with a throughput of one SWAR instruction
    401395each processor cycle for straight-line code free of memory
    402 access.  Such processors are certainly feasible with a
    403 suitable investment in circuitry, although practical designs
    404 may make sacrifices to achieve close to optimal throughput
    405 with less circuitry.  However, the assumption makes for
     396access.  This assumption makes for
    406397straightforward performance evaluation based on instruction
    407398count for straight-line computational kernels.  Furthermore,
     
    412403complexity due to IDISA features is unaffected by the
    413404assumption.
     405
     406
    414407
    415408In the remainder of this paper, then, IDISA-A and IDISA-B
     
    720713\verb#simd<2>::pack<h,h># operation to extract the high bit of
    721714each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
    722 each pair.  The complete algorithm to transform eight serial
     715each pair.  Under either IDISA-A or IDISA-B models,
     716the complete algorithm to transform eight serial
    723717byte registers s0 through s7 into the eight parallel bit stream
    724 registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
    725 Under either IDISA-A or IDISA-B models, a mere 24 operations per 128
    726 input bytes is required.
    727 
    728 \begin{figure}[tbh]
    729 \small
    730 \begin{verbatim}
    731 hnybble0 = simd<8>::pack<h,h>(s0, s1);
    732 lnybble0 = simd<8>::pack<l,l>(s0, s1);
    733 hnybble1 = simd<8>::pack<h,h>(s2, s3);
    734 lnybble1 = simd<8>::pack<l,l>(s2, s3);
    735 hnybble2 = simd<8>::pack<h,h>(s4, s5);
    736 lnybble2 = simd<8>::pack<l,l>(s4, s5);
    737 hnybble3 = simd<8>::pack<h,h>(s6, s7);
    738 lnybble3 = simd<8>::pack<l,l>(s6, s7);
    739 hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
    740 hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
    741 lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
    742 ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
    743 hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
    744 hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
    745 lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
    746 ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
    747 bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
    748 bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
    749 bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
    750 bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
    751 bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
    752 bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
    753 bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
    754 bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
    755 \end{verbatim}
    756 \caption{Complete Inductive Halving Algorithm}
    757 \label{halvingalgorithm}
    758 \end{figure}
     718registers bit0 through bit7 requires a mere 24 instructions per 128
     719input bytes.
     720
     721% \begin{figure}[tbh]
     722% \small
     723% \begin{verbatim}
     724% hnybble0 = simd<8>::pack<h,h>(s0, s1);
     725% lnybble0 = simd<8>::pack<l,l>(s0, s1);
     726% hnybble1 = simd<8>::pack<h,h>(s2, s3);
     727% lnybble1 = simd<8>::pack<l,l>(s2, s3);
     728% hnybble2 = simd<8>::pack<h,h>(s4, s5);
     729% lnybble2 = simd<8>::pack<l,l>(s4, s5);
     730% hnybble3 = simd<8>::pack<h,h>(s6, s7);
     731% lnybble3 = simd<8>::pack<l,l>(s6, s7);
     732% hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
     733% hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
     734% lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
     735% ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
     736% hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
     737% hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
     738% lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
     739% ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
     740% bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
     741% bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
     742% bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
     743% bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
     744% bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
     745% bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
     746% bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
     747% bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
     748% \end{verbatim}
     749% \caption{Complete Inductive Halving Algorithm}
     750% \label{halvingalgorithm}
     751% \end{figure}
    759752
    760753\subsection{Optimality of the Inductive Halving Algorithm}
     
    819812implementation reduces this to 72.
    820813
    821 \begin{figure}[tbh]
    822 \begin{center}\small
    823 \begin{verbatim}
    824 bit01_r0 = simd<1>::mergeh(bit0, bit1);
    825 bit01_r1 = simd<1>::mergel(bit0, bit1);
    826 bit23_r0 = simd<1>::mergeh(bit2, bit3);
    827 bit23_r1 = simd<1>::mergel(bit2, bit3);
    828 bit45_r0 = simd<1>::mergeh(bit4, bit5);
    829 bit45_r1 = simd<1>::mergel(bit4, bit5);
    830 bit67_r0 = simd<1>::mergeh(bit6, bit7);
    831 bit67_r1 = simd<1>::mergel(bit6, bit7);
    832 bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
    833 bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
    834 bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
    835 bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
    836 bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
    837 bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
    838 bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
    839 bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
    840 s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
    841 s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
    842 s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
    843 s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
    844 s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
    845 s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
    846 s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
    847 s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
    848 \end{verbatim}
    849 \end{center}
    850 \label{p2s-inductive}
    851 \caption{Parallel to Serial Transposition by Inductive Doubling}
    852 \end{figure}
     814% \begin{figure}[tbh]
     815% \begin{center}\small
     816% \begin{verbatim}
     817% bit01_r0 = simd<1>::mergeh(bit0, bit1);
     818% bit01_r1 = simd<1>::mergel(bit0, bit1);
     819% bit23_r0 = simd<1>::mergeh(bit2, bit3);
     820% bit23_r1 = simd<1>::mergel(bit2, bit3);
     821% bit45_r0 = simd<1>::mergeh(bit4, bit5);
     822% bit45_r1 = simd<1>::mergel(bit4, bit5);
     823% bit67_r0 = simd<1>::mergeh(bit6, bit7);
     824% bit67_r1 = simd<1>::mergel(bit6, bit7);
     825% bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
     826% bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
     827% bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
     828% bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
     829% bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
     830% bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
     831% bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
     832% bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
     833% s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
     834% s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
     835% s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
     836% s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
     837% s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
     838% s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
     839% s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
     840% s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
     841% \end{verbatim}
     842% \end{center}
     843% \label{p2s-inductive}
     844% \caption{Parallel to Serial Transposition by Inductive Doubling}
     845% \end{figure}
     846
    853847
    854848An algorithm employing only 24 operations using IDISA-A/B is relatively
     
    864858the nybble streams are merged in stage 3 to produce the
    865859desired byte stream data.  The full inductive doubling
    866 algorithm for parallel to serial transposition is shown in Figure
    867 \ref{p2s-inductive}.
    868 
    869 This algorithm is again optimal, requiring the fewest operations
     860algorithm for parallel to serial transposition thus
     861requires three stages of 8 instructions each.  The algorithm is again
     862optimal, requiring the fewest operations
    870863of any possible algorithm using any 3-register instruction set
    871 model.  The proof directly follows that for the serial to parallel
    872 transposition.
     864model.
    873865
    874866\begin{figure*}[t]
     
    1001993\subsection{Parity}
    1002994
    1003 \begin{figure}[h]
    1004 \begin{center}\small
    1005 \begin{verbatim}
    1006 y = x ^ (x >> 1);
    1007 y = y ^ (y >> 2);
    1008 y = y ^ (y >> 4);
    1009 y = y ^ (y >> 8);
    1010 y = y ^ (y >>16);
    1011 y = y & 1;
    1012 \end{verbatim}
    1013 \end{center}
    1014 \caption{Parity Reference Algorithm}
    1015 \label{HD-parity}
    1016 \end{figure}
     995% \begin{figure}[h]
     996% \begin{center}\small
     997% \begin{verbatim}
     998% y = x ^ (x >> 1);
     999% y = y ^ (y >> 2);
     1000% y = y ^ (y >> 4);
     1001% y = y ^ (y >> 8);
     1002% y = y ^ (y >>16);
     1003% y = y & 1;
     1004% \end{verbatim}
     1005% \end{center}
     1006% \caption{Parity Reference Algorithm}
     1007% \label{HD-parity}
     1008% \end{figure}
    10171009
    10181010\begin{figure}[h]
     
    10231015y = simd<8>::xor<h,l>(y, y);
    10241016y = simd<16>::xor<h,l>(y, y);
    1025 y = simd<32>::xor<h,l>(y, y);
    10261017\end{verbatim}
    10271018\end{center}
     
    10301021\end{figure}
    10311022
    1032 Figures \ref{HD-parity} shows Warren's for parity
    1033 employing 11 operations, giving rise to a straightforward
    1034 RefA implementation also using 11 operations. 
    10351023Parity has important applications for error-correcting
    10361024codes such as the various Hamming codes for detecting
    10371025and correcting numbers of bit errors dependent on the
    1038 number of parity bits added. 
    1039 
    1040 Figure \ref{ID-parity} shows the IDISA parity implementation
    1041 with only 5 operations required.  This represents slightly
     1026number of parity bits added.  Whereas
     1027
     1028Figure \ref{ID-parity} shows an IDISA-A parity implementation
     1029with only 5 operations required for 32-bit fields,
     1030slightly more than a 2X improvement over the 11 operations
     1031required in Warren's implementation using shifts  \cite{HackersDelight}.
     1032
     1033This represents slightly
    10421034more than a 2X improvement.  The improvement is less than
    104310353X seen in other cases because one of the operands need
    10441036not be modified before applying the exclusive-or operation.
    10451037
     1038\subsection{Bit Reverse}
     1039
     1040Bit reverse is an important operation needed in a number
     1041of low level codecs.  Following Warren's inductive
     1042doubling implementation using masks and shifts \cite{HackersDelight},
     1043a RefA implementation on 32-bit fields requires 28
     1044operations, while a straightforward IDISA-A implementation
     1045using \verb#simd<n>::rotli# at each inductive doubling
     1046level requires only 5 operations.
    10461047\subsection{Packed DNA Representation}
    10471048
     
    10611062merging operations supported by existing SWAR facilities
    10621063on commodity processors.
    1063 
    1064 \subsection{Bit Reverse}
    1065 
    1066 Bit reverse is an important operation needed in a number
    1067 of low level codecs.  Following Warren's inductive
    1068 doubling implementation using masks and shifts \cite{HackersDelight},
    1069 a RefA implementation on 32-bit fields requires 28
    1070 operations, while a straightforward IDISA-A implementation
    1071 using \verb#simd<n>::rotli# at each inductive doubling
    1072 level requires only 5 operations.
    10731064
    10741065\subsection{String/Decimal/Integer Conversion}
     
    15321523%\bibliographystyle{plainnat}
    15331524\bibliographystyle{plain}
     1525
    15341526\bibliography{asplos094-cameron}
    15351527%\begin{thebibliography}{}
Note: See TracChangeset for help on using the changeset viewer.