# Changeset 248 for docs

Ignore:
Timestamp:
Dec 26, 2008, 6:17:28 AM (10 years ago)
Message:

General clean-up, trimming, bit reverse, pixel deletion.

Location:
docs/ASPLOS09
Files:
2 edited

### Legend:

Unmodified
 r247 register $r$, using big-endian numbering. Throughout this paper, we focus on SWAR implementations using the following {\em three-register model}, although it is also possible to define IDISA extensions for a two-register model of the SSE instruction set. Let \verb:simd: represent the class of SIMD operations defined on fields of size $n$ using C++ template syntax.  Given a doubling algorithms on bit streams, the provision of a core set of operations at field widths of 2 and 4 as well as the more traditional field witdhs of 8, 16 and 32 well as the more traditional field widths of 8, 16 and 32 is critical for elegant and efficient expression of the algorithms.  In essence, inductive doubling algorithms \subsection{Bit Gathering Algorithm} \begin{figure}[tbh] \begin{center} \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf} \caption{Serial to Parallel Transposition Using Bit-Gathering} \label{gather} \end{center} \end{figure} % \begin{figure}[tbh] % \begin{center} % \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf} % \caption{Serial to Parallel Transposition Using Bit-Gathering} % \label{gather} % \end{center} % \end{figure} One straightforward algorithm for implementing the transposition process takes advantage of SWAR bit gathering operations that exist on some architectures.  This operation gathers one bit per byte from a particular position within each byte of a SIMD register. For example, the {\tt pmovmskb} operation of the Intel MMX and SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask For example, the {\tt pmovmskb} operation of the Intel SSE instruction set forms a 16-bit mask consisting of the high bit of each byte.  Similarly, the {\tt \verb:si_gbb:} operation of the synergistic processing units of the Cell Broadband Engine gathers together the low bit of each byte from the SIMD register.  Figure \ref{gather} illustrates the bit gathering process. For each bit stream of output, the bit gather algorithm requires one gather operation for each of the 8 input registers, giving 64 bit gather operations in all.  In addition, for seven of the eight bit positions, it is necessary to shift the bits of each input register into the conventional position for gathering.  A total of 56 shift operations are required for this task.  Finally, the result of each bit gather operation must be properly inserted into the output stream.  If the first gather operation for a stream can be used to also initialize the output register, there will then need to be 7 insert operations for the results of the remaining gather operations for this stream, making 56 insert operations in all. The insert step may be more complex than a single operation in some cases, but we use one operation per insert as a lower bound. Thus, the bit gather algorithm requires at least 176 operations to perform the transposition task. from the SIMD register. % Figure \ref{gather} illustrates the % bit gathering process. Using bit gathering, each bit stream of output is assembled 16 positions at a time.  Bits from the input register must be shifted into position, the gather operation performed and the result inserted into position in the output register.  For the 8 streams, this requires at least 22 operations for 16 positions, or 176 operations to complete the transposition task. \subsection{BytePack Algorithm} A much more efficient transposition algorithm on commodity A more efficient transposition algorithm on commodity SWAR architectures involves three stages of binary division transformation.  This is similar \section{Beyond Parallel Bit Streams} It can be argued that text processing with parallel bit streams by itself is not sufficiently important to justify the circuit complexity of IDISA.  In this section, then we move on to consider further applications that may benefit from IDISA features.  These include additional basic inductive doubling algorithms such as parity, and least power-of-2 ceiling. Further applications for narrow field widths are discussed as well, such as packed DNA representations using 2-bit field widths and packed decimal representations using 4-bit field widths. Most significantly, however, we show that IDISA offers a fully general solution to the problem of supporting IDISA has a variety of applications in domains beyond text processing with parallel bit streams.  We present a number of examples in this section, including, most significantly, a full general solution to the problem of supporting {\em horizontal} SWAR operations. \subsection{Inductive Doubling with Logic Operations} \subsection{Parity} \begin{figure} \begin{center}\small \begin{verbatim} x = x - 1; x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >>16); x = x + 1; \end{verbatim} \end{center} \caption{Power-of-2 Ceiling Reference Algorithm} \label{HD-clp2} \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} y = simd<2>::xor(x, x); y = simd<4>::xor(y, y); \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} x = simd<32>::sub(x, simd<32>::const(1)); x = simd<2>::or(x, x); x = simd<4>::or(x, x); x = simd<8>::or(x, x); x = simd<16>::or(x, x); x = simd<32>::or(x, x); x = simd<32>::add(x, simd<32>::const(1)); \end{verbatim} \end{center} \caption{IDISA Power-of-2 Ceiling Implementation} \label{ID-clp2} \end{figure} Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's code for two further inductive doubling examples using logical operators as the combining feature.  In the first of these, the exclusive or'' operation is applied to all bits in a 32-bit value to determine parity. Figures \ref{HD-parity} shows Warren's for parity employing 11 operations, giving rise to a straightforward RefA implementation also using 11 operations. Parity has important applications for error-correcting codes such as the various Hamming codes for detecting and correcting numbers of bit errors dependent on the number of parity bits added.  Warren's version uses 11 operations for parity of a single 32-bit value; a straightforward SWAR adaptation also uses 11 operations for the parity of a set of 32-bit fields. number of parity bits added. Figure \ref{ID-parity} shows the IDISA parity implementation 3X seen in other cases because one of the operands need not be modified before applying the exclusive-or operation. Using the or'' logical operation, Figure \ref{HD-clp2} shows Warren's code for the least power-of-2 ceiling of a 32-bit value.  The technique is to employ inductive doubling to fill in one bits at all bit positions after the first one bit in the original value to first produce a value of the form $2^n-1$.  This is then incremented to determine the power of 2 ceiling.  This function and its straightforward adaptation for SWAR application on a set of 32-bit fields requires 12 operations.   The corresponding IDISA implementation of Figure \ref{ID-clp2} requires 7 operations, just under a 2X improvement. \subsection{Packed DNA Representation} \subsection{Bit Reverse} Include or omit? Bit reverse is an important operation needed in a number of low level codecs.  Following Warren's inductive doubling implementation using masks and shifts \cite{HackersDelight}, a RefA implementation on 32-bit fields requires 28 operations, while a straightforward IDISA-A implementation using \verb#simd::rotli# at each inductive doubling level requires only 5 operations. \subsection{String/Decimal/Integer Conversion} \subsection{Pixel Packing} Pixel packing from 32 bit fields into a 5:5:5 representation is a further application of parallel bit deletion. \subsection{Systematic Support for Horizontal SIMD Operations} In the case of transposition algorithms to and from parallel bit stream form, the architecture has been shown to make possible straightforward inductive doubling algorithms with the lowest total number of operations that can be achieved by any possible 3-register SWAR architecture. straightforward inductive doubling algorithms with a 3X speedup over the best known versions on permute-capable reference architectures, achieving the lowest total number of operations of any possible 3-register SWAR architecture. Applications of IDISA in other areas have also been support in further ways.   For example, it may be possible to develop a pipelined architecture supporting two or three steps of inductive doubling in a single operation. steps of inductive doubling in a single operation. %\appendix