Changeset 234 for docs

Dec 15, 2008, 6:34:17 AM (11 years ago)

Added non bitstream examples.

2 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r233 r234  
    899899left or a right justified 8-element result by an appropriate shift operation.
    901 \begin{figure}
    916916\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
    917 \end{figure}
    919919The observation about how two $n$-bit central deletion results can
    967 \section{Systematic Support for Horizontal SIMD Operations}
     968\section{Beyond Parallel Bit Streams}
     970It can be argued that text processing with parallel bit
     971streams by itself is not sufficiently important to justify
     972the circuit complexity of IDISA.  In this section, then
     973we move on to consider further applications that may
     974benefit from IDISA features.  These include additional
     975basic inductive doubling
     976algorithms such as parity, and least power-of-2 ceiling.
     977Further applications for narrow field widths are discussed
     978as well, such as packed DNA representations using 2-bit
     979field widths and packed decimal representations using 4-bit
     980field widths. Most significantly, however, we show that IDISA offers
     981a fully general solution to the problem of supporting
     982{\em horizontal} SWAR operations.
     984\subsection{Inductive Doubling with Logic Operations}
     989y = x ^ (x >> 1);
     990y = y ^ (y >> 2);
     991y = y ^ (y >> 4);
     992y = y ^ (y >> 8);
     993y = y ^ (y >>16);
     994y = y & 1;
     997\caption{Parity Reference Algorithm}
     1004x = x - 1;
     1006x = x | (x >> 1);
     1008x = x | (x >> 2);
     1010x = x | (x >> 4);
     1012x = x | (x >> 8);
     1014x = x | (x >>16);
     1016x = x + 1;
     1019\caption{Power-of-2 Ceiling Reference Algorithm}
     1026y = simd<2>::xor<l,h>(x, x);
     1027y = simd<4>::xor<l,h>(y, y);
     1028y = simd<8>::xor<l,h>(y, y);
     1029y = simd<16>::xor<l,h>(y, y);
     1030y = simd<32>::xor<l,h>(y, y);
     1033\caption{IDISA Parity Implementation}
     1040x = simd<32>::sub(x, simd<32>::const(1));
     1041x = simd<2>::or<x, h>(x, x);
     1042x = simd<4>::or<x, h>(x, x);
     1043x = simd<8>::or<x, h>(x, x);
     1044x = simd<16>::or<x, h>(x, x);
     1045x = simd<32>::or<x, h>(x, x);
     1046x = simd<32>::add(x, simd<32>::const(1));
     1049\caption{IDISA Power-of-2 Ceiling Implementation}
     1053Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
     1054code for two further inductive doubling examples using
     1055logical operators as the combining feature.  In the
     1056first of these, the ``exclusive or'' operation is applied
     1057to all bits in a 32-bit value to determine parity.
     1058Parity has important applications for error-correcting
     1059codes such as the various Hamming codes for detecting
     1060and correcting numbers of bit errors dependent on the
     1061number of parity bits added.  Warren's version uses
     106211 operations for parity of a single 32-bit value;
     1063a straightforward SWAR adaptation also uses 11 operations
     1064for the parity of a set of 32-bit fields.
     1066Figure \ref{ID-parity} shows the IDISA parity implementation
     1067with only 5 operations required.  This represents slightly
     1068more than a 2X improvement.  The improvement is less than
     10693X seen in other cases because one of the operands need
     1070not be modified before applying the exclusive-or operation.
     1072Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
     1073code for the least power-of-2 ceiling of a 32-bit value.  The
     1074technique is to employ inductive doubling to fill in one bits at
     1075all bit positions after the first one bit in the original value to
     1076first produce a value of the form $2^n-1$.  This is then incremented
     1077to determine the power of 2 ceiling.  This function and
     1078its straightforward adaptation for SWAR application on a
     1079set of 32-bit fields requires 12 operations.   The
     1080corresponding IDISA implementation of Figure \ref{ID-clp2}
     1081requires 7 operations, just under a 2X improvement.
     1083\subsection{Packed DNA Representation}
     1085DNA sequences are often represented as strings consisting
     1086of the four nucleotide codes A, C, G and T.  Internally,
     1087these sequences are frequently represented in internal
     1088form as packed sequences of 2-bit values.  The IDISA
     1089\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
     1090allow these packed representations to be quickly computed
     1091from byte-oriented string values by two steps of inductive
     1092halving.   Similarly, conversion back to string form
     1093can use two steps of inductive merging.  Without direct
     1094support for these pack and merge operations, the SWAR
     1095implementations of these conversions require the cost
     1096of explicit masking and shifting in combination with
     1097the 16-bit to 8-bit packing and 8-bit to 16-bit
     1098merging operations supported by existing SWAR facilities
     1099on commodity processors.
     1101\subsection{Bit Reverse}
     1103Include or omit?
     1105\subsection{String/Decimal/Integer Conversion}
     1109b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
     1110b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
     1111b = (d & 0x0000FFFF) + 10000 * (d >> 16);
     1114\caption{BCD to Integer Reference Algorithm}
     1121c10 = simd<16>:const(10);
     1122c100 = simd<16>:const(100);
     1123c10000 = simd<32>:const(10000);
     1124b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d);
     1125b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b);
     1126b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b);
     1129\caption{IDISA Parity Implementation}
     1133Just as DNA sequences represent an important use case for
     1134SWAR operations on 2-bit fields, packed sequences of
     1135decimal or hexadecimal digits represent a common use case
     1136for 4-bit fields.  These representations can be used
     1137both as an intermediate form in numeric string to integer
     1138conversion and as a direct representation for
     1139packed binary coded decimal.
     1141Figure \ref{BCD2int} shows a three-step inductive
     1142doubling implementation for conversion of 32-bit packed BCD
     1143values to integer form.  The 32-bit value consists
     1144of 8 4-bit decimal digits.  Pairs of digits are
     1145first combined by multiplying the higher digit
     1146of the pair by 10 and adding.  Pairs of these
     1147two-digit results are then further combined by
     1148multiplying the value of the higher of the two-digit
     1149results by 100 and adding.  The final step is
     1150to combine four-digit results by multiplying the
     1151higher one by 10000 and adding.  Overall, 20
     1152operations are required for this implementation
     1153as well as the corresponding SWAR implementation
     1154for sets of 32-bit fields.  Preloading of 6 constants
     1155into registers for repeated use can reduce the number of
     1156operations to 14 at the cost of significant register
     1159The IDISA implementation of this algorithm is shown
     1160in Figure \ref{ID-BCD2int}.  This implementation
     1161shows an interesting variation in the use of
     1162half-operand modifiers, with only one operand
     1163of each of the addition and multiplication operations
     1164modified at each level.  Overall, this implementation
     1165requires 9 operations, or 6 operations with 3
     1166preloaded constants.  This represents more than a 2X
     1167reduction in instruction count as well as a 2X reduction
     1168in register pressure.
     1173\subsection{Systematic Support for Horizontal SIMD Operations}
    9691175In SIMD parlance, {\em horizontal} operations are
    10461252in the context of particular architectures is a potential
    10471253area for further work.
     1260Kernel & Altivec ops & IDISA ops & Speed-up  \\ \hline
     1261pop\_count<32> &  & 5n & 3X  \\ \hline
     1263bit\_reverse<32> & & & \\ \hline
     1264Gray2binary<32> & 10n & 5n &  2X\\ \hline
     1268\caption{Performance Results}
Note: See TracChangeset for help on using the changeset viewer.