Changeset 234 for docs/ASPLOS09
 Timestamp:
 Dec 15, 2008, 6:34:17 AM (11 years ago)
 Location:
 docs/ASPLOS09
 Files:

 2 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r233 r234 899 899 left or a right justified 8element result by an appropriate shift operation. 900 900 901 \begin{figure }901 \begin{figure*} 902 902 \begin{center} 903 903 \begin{tabular}{ccccccccc} … … 915 915 \label{centraldelstep} 916 916 \caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction} 917 \end{figure }917 \end{figure*} 918 918 919 919 The observation about how two $n$bit central deletion results can … … 965 965 966 966 967 \section{Systematic Support for Horizontal SIMD Operations} 967 968 \section{Beyond Parallel Bit Streams} 969 970 It can be argued that text processing with parallel bit 971 streams by itself is not sufficiently important to justify 972 the circuit complexity of IDISA. In this section, then 973 we move on to consider further applications that may 974 benefit from IDISA features. These include additional 975 basic inductive doubling 976 algorithms such as parity, and least powerof2 ceiling. 977 Further applications for narrow field widths are discussed 978 as well, such as packed DNA representations using 2bit 979 field widths and packed decimal representations using 4bit 980 field widths. Most significantly, however, we show that IDISA offers 981 a fully general solution to the problem of supporting 982 {\em horizontal} SWAR operations. 983 984 \subsection{Inductive Doubling with Logic Operations} 985 986 \begin{figure} 987 \begin{center}\small 988 \begin{verbatim} 989 y = x ^ (x >> 1); 990 y = y ^ (y >> 2); 991 y = y ^ (y >> 4); 992 y = y ^ (y >> 8); 993 y = y ^ (y >>16); 994 y = y & 1; 995 \end{verbatim} 996 \end{center} 997 \caption{Parity Reference Algorithm} 998 \label{HDparity} 999 \end{figure} 1000 1001 \begin{figure} 1002 \begin{center}\small 1003 \begin{verbatim} 1004 x = x  1; 1005 1006 x = x  (x >> 1); 1007 1008 x = x  (x >> 2); 1009 1010 x = x  (x >> 4); 1011 1012 x = x  (x >> 8); 1013 1014 x = x  (x >>16); 1015 1016 x = x + 1; 1017 \end{verbatim} 1018 \end{center} 1019 \caption{Powerof2 Ceiling Reference Algorithm} 1020 \label{HDclp2} 1021 \end{figure} 1022 1023 \begin{figure} 1024 \begin{center}\small 1025 \begin{verbatim} 1026 y = simd<2>::xor<l,h>(x, x); 1027 y = simd<4>::xor<l,h>(y, y); 1028 y = simd<8>::xor<l,h>(y, y); 1029 y = simd<16>::xor<l,h>(y, y); 1030 y = simd<32>::xor<l,h>(y, y); 1031 \end{verbatim} 1032 \end{center} 1033 \caption{IDISA Parity Implementation} 1034 \label{IDparity} 1035 \end{figure} 1036 1037 \begin{figure} 1038 \begin{center}\small 1039 \begin{verbatim} 1040 x = simd<32>::sub(x, simd<32>::const(1)); 1041 x = simd<2>::or<x, h>(x, x); 1042 x = simd<4>::or<x, h>(x, x); 1043 x = simd<8>::or<x, h>(x, x); 1044 x = simd<16>::or<x, h>(x, x); 1045 x = simd<32>::or<x, h>(x, x); 1046 x = simd<32>::add(x, simd<32>::const(1)); 1047 \end{verbatim} 1048 \end{center} 1049 \caption{IDISA Powerof2 Ceiling Implementation} 1050 \label{IDclp2} 1051 \end{figure} 1052 1053 Figures \ref{HDparity} and \ref{HDclp2} show Warren's 1054 code for two further inductive doubling examples using 1055 logical operators as the combining feature. In the 1056 first of these, the ``exclusive or'' operation is applied 1057 to all bits in a 32bit value to determine parity. 1058 Parity has important applications for errorcorrecting 1059 codes such as the various Hamming codes for detecting 1060 and correcting numbers of bit errors dependent on the 1061 number of parity bits added. Warren's version uses 1062 11 operations for parity of a single 32bit value; 1063 a straightforward SWAR adaptation also uses 11 operations 1064 for the parity of a set of 32bit fields. 1065 1066 Figure \ref{IDparity} shows the IDISA parity implementation 1067 with only 5 operations required. This represents slightly 1068 more than a 2X improvement. The improvement is less than 1069 3X seen in other cases because one of the operands need 1070 not be modified before applying the exclusiveor operation. 1071 1072 Using the ``or'' logical operation, Figure \ref{HDclp2} shows Warren's 1073 code for the least powerof2 ceiling of a 32bit value. The 1074 technique is to employ inductive doubling to fill in one bits at 1075 all bit positions after the first one bit in the original value to 1076 first produce a value of the form $2^n1$. This is then incremented 1077 to determine the power of 2 ceiling. This function and 1078 its straightforward adaptation for SWAR application on a 1079 set of 32bit fields requires 12 operations. The 1080 corresponding IDISA implementation of Figure \ref{IDclp2} 1081 requires 7 operations, just under a 2X improvement. 1082 1083 \subsection{Packed DNA Representation} 1084 1085 DNA sequences are often represented as strings consisting 1086 of the four nucleotide codes A, C, G and T. Internally, 1087 these sequences are frequently represented in internal 1088 form as packed sequences of 2bit values. The IDISA 1089 \verb#simd<8>:pack# and \verb#simd<4>:pack# operations 1090 allow these packed representations to be quickly computed 1091 from byteoriented string values by two steps of inductive 1092 halving. Similarly, conversion back to string form 1093 can use two steps of inductive merging. Without direct 1094 support for these pack and merge operations, the SWAR 1095 implementations of these conversions require the cost 1096 of explicit masking and shifting in combination with 1097 the 16bit to 8bit packing and 8bit to 16bit 1098 merging operations supported by existing SWAR facilities 1099 on commodity processors. 1100 1101 \subsection{Bit Reverse} 1102 1103 Include or omit? 1104 1105 \subsection{String/Decimal/Integer Conversion} 1106 \begin{figure} 1107 \begin{center}\small 1108 \begin{verbatim} 1109 b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F); 1110 b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF); 1111 b = (d & 0x0000FFFF) + 10000 * (d >> 16); 1112 \end{verbatim} 1113 \end{center} 1114 \caption{BCD to Integer Reference Algorithm} 1115 \label{BCD2int} 1116 \end{figure} 1117 1118 \begin{figure} 1119 \begin{center}\small 1120 \begin{verbatim} 1121 c10 = simd<16>:const(10); 1122 c100 = simd<16>:const(100); 1123 c10000 = simd<32>:const(10000); 1124 b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d); 1125 b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b); 1126 b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b); 1127 \end{verbatim} 1128 \end{center} 1129 \caption{IDISA Parity Implementation} 1130 \label{IDBCD2int} 1131 \end{figure} 1132 1133 Just as DNA sequences represent an important use case for 1134 SWAR operations on 2bit fields, packed sequences of 1135 decimal or hexadecimal digits represent a common use case 1136 for 4bit fields. These representations can be used 1137 both as an intermediate form in numeric string to integer 1138 conversion and as a direct representation for 1139 packed binary coded decimal. 1140 1141 Figure \ref{BCD2int} shows a threestep inductive 1142 doubling implementation for conversion of 32bit packed BCD 1143 values to integer form. The 32bit value consists 1144 of 8 4bit decimal digits. Pairs of digits are 1145 first combined by multiplying the higher digit 1146 of the pair by 10 and adding. Pairs of these 1147 twodigit results are then further combined by 1148 multiplying the value of the higher of the twodigit 1149 results by 100 and adding. The final step is 1150 to combine fourdigit results by multiplying the 1151 higher one by 10000 and adding. Overall, 20 1152 operations are required for this implementation 1153 as well as the corresponding SWAR implementation 1154 for sets of 32bit fields. Preloading of 6 constants 1155 into registers for repeated use can reduce the number of 1156 operations to 14 at the cost of significant register 1157 pressure. 1158 1159 The IDISA implementation of this algorithm is shown 1160 in Figure \ref{IDBCD2int}. This implementation 1161 shows an interesting variation in the use of 1162 halfoperand modifiers, with only one operand 1163 of each of the addition and multiplication operations 1164 modified at each level. Overall, this implementation 1165 requires 9 operations, or 6 operations with 3 1166 preloaded constants. This represents more than a 2X 1167 reduction in instruction count as well as a 2X reduction 1168 in register pressure. 1169 1170 1171 1172 1173 \subsection{Systematic Support for Horizontal SIMD Operations} 968 1174 969 1175 In SIMD parlance, {\em horizontal} operations are … … 1046 1252 in the context of particular architectures is a potential 1047 1253 area for further work. 1254 1255 1256 \begin{figure*} 1257 \begin{center} 1258 \begin{tabular}{cccc} 1259 \hline 1260 Kernel & Altivec ops & IDISA ops & Speedup \\ \hline 1261 pop\_count<32> & & 5n & 3X \\ \hline 1262 1263 bit\_reverse<32> & & & \\ \hline 1264 Gray2binary<32> & 10n & 5n & 2X\\ \hline 1265 \end{tabular} 1266 \end{center} 1267 \label{perftable} 1268 \caption{Performance Results} 1269 \end{figure*} 1270 1271 1048 1272 1049 1273 \section{Implementation} … … 1164 1388 1165 1389 1390 1391 1392 1166 1393 \section{Conclusions} 1167 1394
Note: See TracChangeset
for help on using the changeset viewer.