Changeset 248 for docs/ASPLOS09

Dec 26, 2008, 6:17:28 AM (10 years ago)

General clean-up, trimming, bit reverse, pixel deletion.

2 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r247 r248  
    186186register $r$, using big-endian numbering.
     188Throughout this paper, we focus on
     189SWAR implementations using the following {\em
     190three-register model}, although it is also possible to define
     191IDISA extensions for a two-register model of the SSE instruction set.
    188192Let \verb:simd<n>: represent the class of SIMD operations defined
    189193on fields of size $n$ using C++ template syntax.  Given a
    234238doubling algorithms on bit streams, the provision of
    235239a core set of operations at field widths of 2 and 4 as
    236 well as the more traditional field witdhs of 8, 16 and 32
     240well as the more traditional field widths of 8, 16 and 32
    237241is critical for elegant and efficient expression of the
    238242algorithms.  In essence, inductive doubling algorithms
    569573\subsection{Bit Gathering Algorithm}
    571 \begin{figure}[tbh]
    572 \begin{center}
    573 \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
    574 \caption{Serial to Parallel Transposition Using Bit-Gathering}
    575 \label{gather}
    576 \end{center}
    577 \end{figure}
     575% \begin{figure}[tbh]
     576% \begin{center}
     577% \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
     578% \caption{Serial to Parallel Transposition Using Bit-Gathering}
     579% \label{gather}
     580% \end{center}
     581% \end{figure}
    578582One straightforward algorithm for implementing the transposition process
    579583takes advantage of SWAR bit gathering operations that exist
    580584on some architectures.  This operation gathers one bit per byte
    581585from a particular position within each byte of a SIMD register.
    582 For example, the {\tt pmovmskb} operation of the Intel MMX and
    583 SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
     586For example, the {\tt pmovmskb} operation of the Intel
     587SSE instruction set forms a 16-bit mask
    584588consisting of the high bit of each byte.  Similarly, the
    585589{\tt \verb:si_gbb:} operation of the synergistic processing units of the
    586590Cell Broadband Engine gathers together the low bit of each byte
    587 from the SIMD register.  Figure \ref{gather} illustrates the
    588 bit gathering process.
    590 For each bit stream of output, the bit gather algorithm requires
    591 one gather operation for each of the 8 input registers,
    592 giving 64 bit gather operations in all.  In addition, for seven
    593 of the eight bit positions, it is necessary to shift the bits
    594 of each input register into the conventional position for
    595 gathering.  A total of 56 shift operations are required for this
    596 task.  Finally, the result of each bit gather operation must
    597 be properly inserted into the output stream.  If the first
    598 gather operation for a stream can be used to also initialize
    599 the output register, there will then need to be 7 insert
    600 operations for the results of the remaining gather operations
    601 for this stream, making 56 insert operations in all.
    602 The insert step may be more complex than a single operation
    603 in some cases, but we use one operation per insert as a lower bound.
    604 Thus, the bit gather algorithm requires
    605 at least 176 operations to perform the transposition task.
     591from the SIMD register. 
     592% Figure \ref{gather} illustrates the
     593% bit gathering process.
     595Using bit gathering, each bit stream of output is assembled 16 positions
     596at a time.  Bits from the input register must be shifted into
     597position, the gather operation performed and the result inserted
     598into position in the output register.  For the 8 streams, this
     599requires at least 22 operations for 16 positions, or 176 operations
     600to complete the transposition task.
    607602\subsection{BytePack Algorithm}
    609 A much more efficient transposition algorithm on commodity
     604A more efficient transposition algorithm on commodity
    610605SWAR architectures involves three
    611606stages of binary division transformation.  This is similar
    998993\section{Beyond Parallel Bit Streams}
    1000 It can be argued that text processing with parallel bit
    1001 streams by itself is not sufficiently important to justify
    1002 the circuit complexity of IDISA.  In this section, then
    1003 we move on to consider further applications that may
    1004 benefit from IDISA features.  These include additional
    1005 basic inductive doubling
    1006 algorithms such as parity, and least power-of-2 ceiling.
    1007 Further applications for narrow field widths are discussed
    1008 as well, such as packed DNA representations using 2-bit
    1009 field widths and packed decimal representations using 4-bit
    1010 field widths. Most significantly, however, we show that IDISA offers
    1011 a fully general solution to the problem of supporting
     995IDISA has a variety of applications in domains beyond
     996text processing with parallel bit streams.  We present
     997a number of examples in this section, including,
     998most significantly, a full general solution to the problem of supporting
    1012999{\em horizontal} SWAR operations.
    1014 \subsection{Inductive Doubling with Logic Operations}
    1034 x = x - 1;
    1035 x = x | (x >> 1);
    1036 x = x | (x >> 2);
    1037 x = x | (x >> 4);
    1038 x = x | (x >> 8);
    1039 x = x | (x >>16);
    1040 x = x + 1;
    1041 \end{verbatim}
    1042 \end{center}
    1043 \caption{Power-of-2 Ceiling Reference Algorithm}
    1044 \label{HD-clp2}
    1045 \end{figure}
    1047 \begin{figure}
    1048 \begin{center}\small
    1049 \begin{verbatim}
    10501021y = simd<2>::xor<h,l>(x, x);
    10511022y = simd<4>::xor<h,l>(y, y);
    1061 \begin{figure}
    1062 \begin{center}\small
    1063 \begin{verbatim}
    1064 x = simd<32>::sub(x, simd<32>::const(1));
    1065 x = simd<2>::or<x, h>(x, x);
    1066 x = simd<4>::or<x, h>(x, x);
    1067 x = simd<8>::or<x, h>(x, x);
    1068 x = simd<16>::or<x, h>(x, x);
    1069 x = simd<32>::or<x, h>(x, x);
    1070 x = simd<32>::add(x, simd<32>::const(1));
    1071 \end{verbatim}
    1072 \end{center}
    1073 \caption{IDISA Power-of-2 Ceiling Implementation}
    1074 \label{ID-clp2}
    1075 \end{figure}
    1077 Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
    1078 code for two further inductive doubling examples using
    1079 logical operators as the combining feature.  In the
    1080 first of these, the ``exclusive or'' operation is applied
    1081 to all bits in a 32-bit value to determine parity.
     1032Figures \ref{HD-parity} shows Warren's for parity
     1033employing 11 operations, giving rise to a straightforward
     1034RefA implementation also using 11 operations. 
    10821035Parity has important applications for error-correcting
    10831036codes such as the various Hamming codes for detecting
    10841037and correcting numbers of bit errors dependent on the
    1085 number of parity bits added.  Warren's version uses
    1086 11 operations for parity of a single 32-bit value;
    1087 a straightforward SWAR adaptation also uses 11 operations
    1088 for the parity of a set of 32-bit fields.
     1038number of parity bits added. 
    10901040Figure \ref{ID-parity} shows the IDISA parity implementation
    109310433X seen in other cases because one of the operands need
    10941044not be modified before applying the exclusive-or operation.
    1096 Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
    1097 code for the least power-of-2 ceiling of a 32-bit value.  The
    1098 technique is to employ inductive doubling to fill in one bits at
    1099 all bit positions after the first one bit in the original value to
    1100 first produce a value of the form $2^n-1$.  This is then incremented
    1101 to determine the power of 2 ceiling.  This function and
    1102 its straightforward adaptation for SWAR application on a
    1103 set of 32-bit fields requires 12 operations.   The
    1104 corresponding IDISA implementation of Figure \ref{ID-clp2}
    1105 requires 7 operations, just under a 2X improvement.
    11071046\subsection{Packed DNA Representation}
    11251064\subsection{Bit Reverse}
    1127 Include or omit?
     1066Bit reverse is an important operation needed in a number
     1067of low level codecs.  Following Warren's inductive
     1068doubling implementation using masks and shifts \cite{HackersDelight},
     1069a RefA implementation on 32-bit fields requires 28
     1070operations, while a straightforward IDISA-A implementation
     1071using \verb#simd<n>::rotli# at each inductive doubling
     1072level requires only 5 operations.
    11291074\subsection{String/Decimal/Integer Conversion}
     1140\subsection{Pixel Packing}
     1142Pixel packing from 32 bit fields into a 5:5:5 representation
     1143is a further application of parallel bit deletion.
    11971145\subsection{Systematic Support for Horizontal SIMD Operations}
    15371485In the case of transposition algorithms to and from parallel bit stream
    15381486form, the architecture has been shown to make possible
    1539 straightforward inductive doubling algorithms with the
    1540 lowest total number of operations that can be achieved by
    1541 any possible 3-register SWAR architecture.
     1487straightforward inductive doubling algorithms with a 3X
     1488speedup over the best known versions on permute-capable
     1489reference architectures, achieving the lowest total number
     1490of operations of any possible 3-register SWAR architecture.
    15431492Applications of IDISA in other areas have also been
    15651514support in further ways.   For example, it may be possible
    15661515to develop a pipelined architecture supporting two or three
    1567 steps of inductive doubling in a single operation.
     1516steps of inductive doubling in a single operation. 
Note: See TracChangeset for help on using the changeset viewer.