Index: docs/ASPLOS09/asplos094-cameron.tex
===================================================================
--- docs/ASPLOS09/asplos094-cameron.tex (revision 233)
+++ docs/ASPLOS09/asplos094-cameron.tex (revision 234)
@@ -899,5 +899,5 @@
left or a right justified 8-element result by an appropriate shift operation.
-\begin{figure}
+\begin{figure*}
\begin{center}
\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
@@ -915,5 +915,5 @@
\label{centraldelstep}
\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
-\end{figure}
+\end{figure*}
The observation about how two $n$-bit central deletion results can
@@ -965,5 +965,211 @@
-\section{Systematic Support for Horizontal SIMD Operations}
+
+\section{Beyond Parallel Bit Streams}
+
+It can be argued that text processing with parallel bit
+streams by itself is not sufficiently important to justify
+the circuit complexity of IDISA. In this section, then
+we move on to consider further applications that may
+benefit from IDISA features. These include additional
+basic inductive doubling
+algorithms such as parity, and least power-of-2 ceiling.
+Further applications for narrow field widths are discussed
+as well, such as packed DNA representations using 2-bit
+field widths and packed decimal representations using 4-bit
+field widths. Most significantly, however, we show that IDISA offers
+a fully general solution to the problem of supporting
+{\em horizontal} SWAR operations.
+
+\subsection{Inductive Doubling with Logic Operations}
+
+\begin{figure}
+\begin{center}\small
+\begin{verbatim}
+y = x ^ (x >> 1);
+y = y ^ (y >> 2);
+y = y ^ (y >> 4);
+y = y ^ (y >> 8);
+y = y ^ (y >>16);
+y = y & 1;
+\end{verbatim}
+\end{center}
+\caption{Parity Reference Algorithm}
+\label{HD-parity}
+\end{figure}
+
+\begin{figure}
+\begin{center}\small
+\begin{verbatim}
+x = x - 1;
+
+x = x | (x >> 1);
+
+x = x | (x >> 2);
+
+x = x | (x >> 4);
+
+x = x | (x >> 8);
+
+x = x | (x >>16);
+
+x = x + 1;
+\end{verbatim}
+\end{center}
+\caption{Power-of-2 Ceiling Reference Algorithm}
+\label{HD-clp2}
+\end{figure}
+
+\begin{figure}
+\begin{center}\small
+\begin{verbatim}
+y = simd<2>::xor(x, x);
+y = simd<4>::xor(y, y);
+y = simd<8>::xor(y, y);
+y = simd<16>::xor(y, y);
+y = simd<32>::xor(y, y);
+\end{verbatim}
+\end{center}
+\caption{IDISA Parity Implementation}
+\label{ID-parity}
+\end{figure}
+
+\begin{figure}
+\begin{center}\small
+\begin{verbatim}
+x = simd<32>::sub(x, simd<32>::const(1));
+x = simd<2>::or(x, x);
+x = simd<4>::or(x, x);
+x = simd<8>::or(x, x);
+x = simd<16>::or(x, x);
+x = simd<32>::or(x, x);
+x = simd<32>::add(x, simd<32>::const(1));
+\end{verbatim}
+\end{center}
+\caption{IDISA Power-of-2 Ceiling Implementation}
+\label{ID-clp2}
+\end{figure}
+
+Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
+code for two further inductive doubling examples using
+logical operators as the combining feature. In the
+first of these, the ``exclusive or'' operation is applied
+to all bits in a 32-bit value to determine parity.
+Parity has important applications for error-correcting
+codes such as the various Hamming codes for detecting
+and correcting numbers of bit errors dependent on the
+number of parity bits added. Warren's version uses
+11 operations for parity of a single 32-bit value;
+a straightforward SWAR adaptation also uses 11 operations
+for the parity of a set of 32-bit fields.
+
+Figure \ref{ID-parity} shows the IDISA parity implementation
+with only 5 operations required. This represents slightly
+more than a 2X improvement. The improvement is less than
+3X seen in other cases because one of the operands need
+not be modified before applying the exclusive-or operation.
+
+Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
+code for the least power-of-2 ceiling of a 32-bit value. The
+technique is to employ inductive doubling to fill in one bits at
+all bit positions after the first one bit in the original value to
+first produce a value of the form $2^n-1$. This is then incremented
+to determine the power of 2 ceiling. This function and
+its straightforward adaptation for SWAR application on a
+set of 32-bit fields requires 12 operations. The
+corresponding IDISA implementation of Figure \ref{ID-clp2}
+requires 7 operations, just under a 2X improvement.
+
+\subsection{Packed DNA Representation}
+
+DNA sequences are often represented as strings consisting
+of the four nucleotide codes A, C, G and T. Internally,
+these sequences are frequently represented in internal
+form as packed sequences of 2-bit values. The IDISA
+\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
+allow these packed representations to be quickly computed
+from byte-oriented string values by two steps of inductive
+halving. Similarly, conversion back to string form
+can use two steps of inductive merging. Without direct
+support for these pack and merge operations, the SWAR
+implementations of these conversions require the cost
+of explicit masking and shifting in combination with
+the 16-bit to 8-bit packing and 8-bit to 16-bit
+merging operations supported by existing SWAR facilities
+on commodity processors.
+
+\subsection{Bit Reverse}
+
+Include or omit?
+
+\subsection{String/Decimal/Integer Conversion}
+\begin{figure}
+\begin{center}\small
+\begin{verbatim}
+b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
+b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
+b = (d & 0x0000FFFF) + 10000 * (d >> 16);
+\end{verbatim}
+\end{center}
+\caption{BCD to Integer Reference Algorithm}
+\label{BCD2int}
+\end{figure}
+
+\begin{figure}
+\begin{center}\small
+\begin{verbatim}
+c10 = simd<16>:const(10);
+c100 = simd<16>:const(100);
+c10000 = simd<32>:const(10000);
+b = simd<8>::add(simd<8>::mult(d, c10), d);
+b = simd<16>::add(simd<16>::mult(b, c100), b);
+b = simd<32>::add(simd<32>::mult(b, c10000), b);
+\end{verbatim}
+\end{center}
+\caption{IDISA Parity Implementation}
+\label{ID-BCD2int}
+\end{figure}
+
+Just as DNA sequences represent an important use case for
+SWAR operations on 2-bit fields, packed sequences of
+decimal or hexadecimal digits represent a common use case
+for 4-bit fields. These representations can be used
+both as an intermediate form in numeric string to integer
+conversion and as a direct representation for
+packed binary coded decimal.
+
+Figure \ref{BCD2int} shows a three-step inductive
+doubling implementation for conversion of 32-bit packed BCD
+values to integer form. The 32-bit value consists
+of 8 4-bit decimal digits. Pairs of digits are
+first combined by multiplying the higher digit
+of the pair by 10 and adding. Pairs of these
+two-digit results are then further combined by
+multiplying the value of the higher of the two-digit
+results by 100 and adding. The final step is
+to combine four-digit results by multiplying the
+higher one by 10000 and adding. Overall, 20
+operations are required for this implementation
+as well as the corresponding SWAR implementation
+for sets of 32-bit fields. Preloading of 6 constants
+into registers for repeated use can reduce the number of
+operations to 14 at the cost of significant register
+pressure.
+
+The IDISA implementation of this algorithm is shown
+in Figure \ref{ID-BCD2int}. This implementation
+shows an interesting variation in the use of
+half-operand modifiers, with only one operand
+of each of the addition and multiplication operations
+modified at each level. Overall, this implementation
+requires 9 operations, or 6 operations with 3
+preloaded constants. This represents more than a 2X
+reduction in instruction count as well as a 2X reduction
+in register pressure.
+
+
+
+
+\subsection{Systematic Support for Horizontal SIMD Operations}
In SIMD parlance, {\em horizontal} operations are
@@ -1046,4 +1252,22 @@
in the context of particular architectures is a potential
area for further work.
+
+
+\begin{figure*}
+\begin{center}
+\begin{tabular}{|c||c|c|c|}
+\hline
+Kernel & Altivec ops & IDISA ops & Speed-up \\ \hline
+pop\_count<32> & & 5n & 3X \\ \hline
+
+bit\_reverse<32> & & & \\ \hline
+Gray2binary<32> & 10n & 5n & 2X\\ \hline
+\end{tabular}
+\end{center}
+\label{perftable}
+\caption{Performance Results}
+\end{figure*}
+
+
\section{Implementation}
@@ -1164,4 +1388,7 @@
+
+
+
\section{Conclusions}