Changeset 234 for docs/ASPLOS09


Ignore:
Timestamp:
Dec 15, 2008, 6:34:17 AM (11 years ago)
Author:
cameron
Message:

Added non bitstream examples.

Location:
docs/ASPLOS09
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r233 r234  
    899899left or a right justified 8-element result by an appropriate shift operation.
    900900
    901 \begin{figure}
     901\begin{figure*}
    902902\begin{center}
    903903\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
     
    915915\label{centraldelstep}
    916916\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
    917 \end{figure}
     917\end{figure*}
    918918
    919919The observation about how two $n$-bit central deletion results can
     
    965965
    966966
    967 \section{Systematic Support for Horizontal SIMD Operations}
     967
     968\section{Beyond Parallel Bit Streams}
     969
     970It can be argued that text processing with parallel bit
     971streams by itself is not sufficiently important to justify
     972the circuit complexity of IDISA.  In this section, then
     973we move on to consider further applications that may
     974benefit from IDISA features.  These include additional
     975basic inductive doubling
     976algorithms such as parity, and least power-of-2 ceiling.
     977Further applications for narrow field widths are discussed
     978as well, such as packed DNA representations using 2-bit
     979field widths and packed decimal representations using 4-bit
     980field widths. Most significantly, however, we show that IDISA offers
     981a fully general solution to the problem of supporting
     982{\em horizontal} SWAR operations.
     983
     984\subsection{Inductive Doubling with Logic Operations}
     985
     986\begin{figure}
     987\begin{center}\small
     988\begin{verbatim}
     989y = x ^ (x >> 1);
     990y = y ^ (y >> 2);
     991y = y ^ (y >> 4);
     992y = y ^ (y >> 8);
     993y = y ^ (y >>16);
     994y = y & 1;
     995\end{verbatim}
     996\end{center}
     997\caption{Parity Reference Algorithm}
     998\label{HD-parity}
     999\end{figure}
     1000
     1001\begin{figure}
     1002\begin{center}\small
     1003\begin{verbatim}
     1004x = x - 1;
     1005
     1006x = x | (x >> 1);
     1007
     1008x = x | (x >> 2);
     1009
     1010x = x | (x >> 4);
     1011
     1012x = x | (x >> 8);
     1013
     1014x = x | (x >>16);
     1015
     1016x = x + 1;
     1017\end{verbatim}
     1018\end{center}
     1019\caption{Power-of-2 Ceiling Reference Algorithm}
     1020\label{HD-clp2}
     1021\end{figure}
     1022
     1023\begin{figure}
     1024\begin{center}\small
     1025\begin{verbatim}
     1026y = simd<2>::xor<l,h>(x, x);
     1027y = simd<4>::xor<l,h>(y, y);
     1028y = simd<8>::xor<l,h>(y, y);
     1029y = simd<16>::xor<l,h>(y, y);
     1030y = simd<32>::xor<l,h>(y, y);
     1031\end{verbatim}
     1032\end{center}
     1033\caption{IDISA Parity Implementation}
     1034\label{ID-parity}
     1035\end{figure}
     1036
     1037\begin{figure}
     1038\begin{center}\small
     1039\begin{verbatim}
     1040x = simd<32>::sub(x, simd<32>::const(1));
     1041x = simd<2>::or<x, h>(x, x);
     1042x = simd<4>::or<x, h>(x, x);
     1043x = simd<8>::or<x, h>(x, x);
     1044x = simd<16>::or<x, h>(x, x);
     1045x = simd<32>::or<x, h>(x, x);
     1046x = simd<32>::add(x, simd<32>::const(1));
     1047\end{verbatim}
     1048\end{center}
     1049\caption{IDISA Power-of-2 Ceiling Implementation}
     1050\label{ID-clp2}
     1051\end{figure}
     1052
     1053Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
     1054code for two further inductive doubling examples using
     1055logical operators as the combining feature.  In the
     1056first of these, the ``exclusive or'' operation is applied
     1057to all bits in a 32-bit value to determine parity.
     1058Parity has important applications for error-correcting
     1059codes such as the various Hamming codes for detecting
     1060and correcting numbers of bit errors dependent on the
     1061number of parity bits added.  Warren's version uses
     106211 operations for parity of a single 32-bit value;
     1063a straightforward SWAR adaptation also uses 11 operations
     1064for the parity of a set of 32-bit fields.
     1065
     1066Figure \ref{ID-parity} shows the IDISA parity implementation
     1067with only 5 operations required.  This represents slightly
     1068more than a 2X improvement.  The improvement is less than
     10693X seen in other cases because one of the operands need
     1070not be modified before applying the exclusive-or operation.
     1071
     1072Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
     1073code for the least power-of-2 ceiling of a 32-bit value.  The
     1074technique is to employ inductive doubling to fill in one bits at
     1075all bit positions after the first one bit in the original value to
     1076first produce a value of the form $2^n-1$.  This is then incremented
     1077to determine the power of 2 ceiling.  This function and
     1078its straightforward adaptation for SWAR application on a
     1079set of 32-bit fields requires 12 operations.   The
     1080corresponding IDISA implementation of Figure \ref{ID-clp2}
     1081requires 7 operations, just under a 2X improvement.
     1082
     1083\subsection{Packed DNA Representation}
     1084
     1085DNA sequences are often represented as strings consisting
     1086of the four nucleotide codes A, C, G and T.  Internally,
     1087these sequences are frequently represented in internal
     1088form as packed sequences of 2-bit values.  The IDISA
     1089\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
     1090allow these packed representations to be quickly computed
     1091from byte-oriented string values by two steps of inductive
     1092halving.   Similarly, conversion back to string form
     1093can use two steps of inductive merging.  Without direct
     1094support for these pack and merge operations, the SWAR
     1095implementations of these conversions require the cost
     1096of explicit masking and shifting in combination with
     1097the 16-bit to 8-bit packing and 8-bit to 16-bit
     1098merging operations supported by existing SWAR facilities
     1099on commodity processors.
     1100
     1101\subsection{Bit Reverse}
     1102
     1103Include or omit?
     1104
     1105\subsection{String/Decimal/Integer Conversion}
     1106\begin{figure}
     1107\begin{center}\small
     1108\begin{verbatim}
     1109b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
     1110b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
     1111b = (d & 0x0000FFFF) + 10000 * (d >> 16);
     1112\end{verbatim}
     1113\end{center}
     1114\caption{BCD to Integer Reference Algorithm}
     1115\label{BCD2int}
     1116\end{figure}
     1117
     1118\begin{figure}
     1119\begin{center}\small
     1120\begin{verbatim}
     1121c10 = simd<16>:const(10);
     1122c100 = simd<16>:const(100);
     1123c10000 = simd<32>:const(10000);
     1124b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d);
     1125b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b);
     1126b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b);
     1127\end{verbatim}
     1128\end{center}
     1129\caption{IDISA Parity Implementation}
     1130\label{ID-BCD2int}
     1131\end{figure}
     1132
     1133Just as DNA sequences represent an important use case for
     1134SWAR operations on 2-bit fields, packed sequences of
     1135decimal or hexadecimal digits represent a common use case
     1136for 4-bit fields.  These representations can be used
     1137both as an intermediate form in numeric string to integer
     1138conversion and as a direct representation for
     1139packed binary coded decimal.
     1140
     1141Figure \ref{BCD2int} shows a three-step inductive
     1142doubling implementation for conversion of 32-bit packed BCD
     1143values to integer form.  The 32-bit value consists
     1144of 8 4-bit decimal digits.  Pairs of digits are
     1145first combined by multiplying the higher digit
     1146of the pair by 10 and adding.  Pairs of these
     1147two-digit results are then further combined by
     1148multiplying the value of the higher of the two-digit
     1149results by 100 and adding.  The final step is
     1150to combine four-digit results by multiplying the
     1151higher one by 10000 and adding.  Overall, 20
     1152operations are required for this implementation
     1153as well as the corresponding SWAR implementation
     1154for sets of 32-bit fields.  Preloading of 6 constants
     1155into registers for repeated use can reduce the number of
     1156operations to 14 at the cost of significant register
     1157pressure.
     1158
     1159The IDISA implementation of this algorithm is shown
     1160in Figure \ref{ID-BCD2int}.  This implementation
     1161shows an interesting variation in the use of
     1162half-operand modifiers, with only one operand
     1163of each of the addition and multiplication operations
     1164modified at each level.  Overall, this implementation
     1165requires 9 operations, or 6 operations with 3
     1166preloaded constants.  This represents more than a 2X
     1167reduction in instruction count as well as a 2X reduction
     1168in register pressure.
     1169
     1170
     1171
     1172
     1173\subsection{Systematic Support for Horizontal SIMD Operations}
    9681174
    9691175In SIMD parlance, {\em horizontal} operations are
     
    10461252in the context of particular architectures is a potential
    10471253area for further work.
     1254
     1255
     1256\begin{figure*}
     1257\begin{center}
     1258\begin{tabular}{|c||c|c|c|}
     1259\hline
     1260Kernel & Altivec ops & IDISA ops & Speed-up  \\ \hline
     1261pop\_count<32> &  & 5n & 3X  \\ \hline
     1262
     1263bit\_reverse<32> & & & \\ \hline
     1264Gray2binary<32> & 10n & 5n &  2X\\ \hline
     1265\end{tabular}
     1266\end{center}
     1267\label{perftable}
     1268\caption{Performance Results}
     1269\end{figure*}
     1270
     1271
    10481272
    10491273\section{Implementation}
     
    11641388
    11651389
     1390
     1391
     1392
    11661393\section{Conclusions}
    11671394
Note: See TracChangeset for help on using the changeset viewer.