# Changeset 234 for docs/ASPLOS09/asplos094-cameron.tex

Ignore:
Timestamp:
Dec 15, 2008, 6:34:17 AM (11 years ago)
Message:

 r233 left or a right justified 8-element result by an appropriate shift operation. \begin{figure} \begin{figure*} \begin{center} \begin{tabular}{|c||c|c|c|c|c|c|c|c|} \label{centraldelstep} \caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction} \end{figure} \end{figure*} The observation about how two $n$-bit central deletion results can \section{Systematic Support for Horizontal SIMD Operations} \section{Beyond Parallel Bit Streams} It can be argued that text processing with parallel bit streams by itself is not sufficiently important to justify the circuit complexity of IDISA.  In this section, then we move on to consider further applications that may benefit from IDISA features.  These include additional basic inductive doubling algorithms such as parity, and least power-of-2 ceiling. Further applications for narrow field widths are discussed as well, such as packed DNA representations using 2-bit field widths and packed decimal representations using 4-bit field widths. Most significantly, however, we show that IDISA offers a fully general solution to the problem of supporting {\em horizontal} SWAR operations. \subsection{Inductive Doubling with Logic Operations} \begin{figure} \begin{center}\small \begin{verbatim} y = x ^ (x >> 1); y = y ^ (y >> 2); y = y ^ (y >> 4); y = y ^ (y >> 8); y = y ^ (y >>16); y = y & 1; \end{verbatim} \end{center} \caption{Parity Reference Algorithm} \label{HD-parity} \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} x = x - 1; x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >>16); x = x + 1; \end{verbatim} \end{center} \caption{Power-of-2 Ceiling Reference Algorithm} \label{HD-clp2} \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} y = simd<2>::xor(x, x); y = simd<4>::xor(y, y); y = simd<8>::xor(y, y); y = simd<16>::xor(y, y); y = simd<32>::xor(y, y); \end{verbatim} \end{center} \caption{IDISA Parity Implementation} \label{ID-parity} \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} x = simd<32>::sub(x, simd<32>::const(1)); x = simd<2>::or(x, x); x = simd<4>::or(x, x); x = simd<8>::or(x, x); x = simd<16>::or(x, x); x = simd<32>::or(x, x); x = simd<32>::add(x, simd<32>::const(1)); \end{verbatim} \end{center} \caption{IDISA Power-of-2 Ceiling Implementation} \label{ID-clp2} \end{figure} Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's code for two further inductive doubling examples using logical operators as the combining feature.  In the first of these, the exclusive or'' operation is applied to all bits in a 32-bit value to determine parity. Parity has important applications for error-correcting codes such as the various Hamming codes for detecting and correcting numbers of bit errors dependent on the number of parity bits added.  Warren's version uses 11 operations for parity of a single 32-bit value; a straightforward SWAR adaptation also uses 11 operations for the parity of a set of 32-bit fields. Figure \ref{ID-parity} shows the IDISA parity implementation with only 5 operations required.  This represents slightly more than a 2X improvement.  The improvement is less than 3X seen in other cases because one of the operands need not be modified before applying the exclusive-or operation. Using the or'' logical operation, Figure \ref{HD-clp2} shows Warren's code for the least power-of-2 ceiling of a 32-bit value.  The technique is to employ inductive doubling to fill in one bits at all bit positions after the first one bit in the original value to first produce a value of the form $2^n-1$.  This is then incremented to determine the power of 2 ceiling.  This function and its straightforward adaptation for SWAR application on a set of 32-bit fields requires 12 operations.   The corresponding IDISA implementation of Figure \ref{ID-clp2} requires 7 operations, just under a 2X improvement. \subsection{Packed DNA Representation} DNA sequences are often represented as strings consisting of the four nucleotide codes A, C, G and T.  Internally, these sequences are frequently represented in internal form as packed sequences of 2-bit values.  The IDISA \verb#simd<8>:pack# and \verb#simd<4>:pack# operations allow these packed representations to be quickly computed from byte-oriented string values by two steps of inductive halving.   Similarly, conversion back to string form can use two steps of inductive merging.  Without direct support for these pack and merge operations, the SWAR implementations of these conversions require the cost of explicit masking and shifting in combination with the 16-bit to 8-bit packing and 8-bit to 16-bit merging operations supported by existing SWAR facilities on commodity processors. \subsection{Bit Reverse} Include or omit? \subsection{String/Decimal/Integer Conversion} \begin{figure} \begin{center}\small \begin{verbatim} b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F); b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF); b = (d & 0x0000FFFF) + 10000 * (d >> 16); \end{verbatim} \end{center} \caption{BCD to Integer Reference Algorithm} \label{BCD2int} \end{figure} \begin{figure} \begin{center}\small \begin{verbatim} c10 = simd<16>:const(10); c100 = simd<16>:const(100); c10000 = simd<32>:const(10000); b = simd<8>::add(simd<8>::mult(d, c10), d); b = simd<16>::add(simd<16>::mult(b, c100), b); b = simd<32>::add(simd<32>::mult(b, c10000), b); \end{verbatim} \end{center} \caption{IDISA Parity Implementation} \label{ID-BCD2int} \end{figure} Just as DNA sequences represent an important use case for SWAR operations on 2-bit fields, packed sequences of decimal or hexadecimal digits represent a common use case for 4-bit fields.  These representations can be used both as an intermediate form in numeric string to integer conversion and as a direct representation for packed binary coded decimal. Figure \ref{BCD2int} shows a three-step inductive doubling implementation for conversion of 32-bit packed BCD values to integer form.  The 32-bit value consists of 8 4-bit decimal digits.  Pairs of digits are first combined by multiplying the higher digit of the pair by 10 and adding.  Pairs of these two-digit results are then further combined by multiplying the value of the higher of the two-digit results by 100 and adding.  The final step is to combine four-digit results by multiplying the higher one by 10000 and adding.  Overall, 20 operations are required for this implementation as well as the corresponding SWAR implementation for sets of 32-bit fields.  Preloading of 6 constants into registers for repeated use can reduce the number of operations to 14 at the cost of significant register pressure. The IDISA implementation of this algorithm is shown in Figure \ref{ID-BCD2int}.  This implementation shows an interesting variation in the use of half-operand modifiers, with only one operand of each of the addition and multiplication operations modified at each level.  Overall, this implementation requires 9 operations, or 6 operations with 3 preloaded constants.  This represents more than a 2X reduction in instruction count as well as a 2X reduction in register pressure. \subsection{Systematic Support for Horizontal SIMD Operations} In SIMD parlance, {\em horizontal} operations are in the context of particular architectures is a potential area for further work. \begin{figure*} \begin{center} \begin{tabular}{|c||c|c|c|} \hline Kernel & Altivec ops & IDISA ops & Speed-up  \\ \hline pop\_count<32> &  & 5n & 3X  \\ \hline bit\_reverse<32> & & & \\ \hline Gray2binary<32> & 10n & 5n &  2X\\ \hline \end{tabular} \end{center} \label{perftable} \caption{Performance Results} \end{figure*} \section{Implementation} \section{Conclusions}