Changeset 248 for docs/ASPLOS09
 Timestamp:
 Dec 26, 2008, 6:17:28 AM (10 years ago)
 Location:
 docs/ASPLOS09
 Files:

 2 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r247 r248 186 186 register $r$, using bigendian numbering. 187 187 188 Throughout this paper, we focus on 189 SWAR implementations using the following {\em 190 threeregister model}, although it is also possible to define 191 IDISA extensions for a tworegister model of the SSE instruction set. 188 192 Let \verb:simd<n>: represent the class of SIMD operations defined 189 193 on fields of size $n$ using C++ template syntax. Given a … … 234 238 doubling algorithms on bit streams, the provision of 235 239 a core set of operations at field widths of 2 and 4 as 236 well as the more traditional field wi tdhs of 8, 16 and 32240 well as the more traditional field widths of 8, 16 and 32 237 241 is critical for elegant and efficient expression of the 238 242 algorithms. In essence, inductive doubling algorithms … … 569 573 \subsection{Bit Gathering Algorithm} 570 574 571 \begin{figure}[tbh]572 \begin{center}573 \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}574 \caption{Serial to Parallel Transposition Using BitGathering}575 \label{gather}576 \end{center}577 \end{figure}575 % \begin{figure}[tbh] 576 % \begin{center} 577 % \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf} 578 % \caption{Serial to Parallel Transposition Using BitGathering} 579 % \label{gather} 580 % \end{center} 581 % \end{figure} 578 582 One straightforward algorithm for implementing the transposition process 579 583 takes advantage of SWAR bit gathering operations that exist 580 584 on some architectures. This operation gathers one bit per byte 581 585 from a particular position within each byte of a SIMD register. 582 For example, the {\tt pmovmskb} operation of the Intel MMX and583 SSE instruction set s forms an 8bit (MMX) or 16bit (SSE)mask586 For example, the {\tt pmovmskb} operation of the Intel 587 SSE instruction set forms a 16bit mask 584 588 consisting of the high bit of each byte. Similarly, the 585 589 {\tt \verb:si_gbb:} operation of the synergistic processing units of the 586 590 Cell Broadband Engine gathers together the low bit of each byte 587 from the SIMD register. Figure \ref{gather} illustrates the 588 bit gathering process. 589 590 For each bit stream of output, the bit gather algorithm requires 591 one gather operation for each of the 8 input registers, 592 giving 64 bit gather operations in all. In addition, for seven 593 of the eight bit positions, it is necessary to shift the bits 594 of each input register into the conventional position for 595 gathering. A total of 56 shift operations are required for this 596 task. Finally, the result of each bit gather operation must 597 be properly inserted into the output stream. If the first 598 gather operation for a stream can be used to also initialize 599 the output register, there will then need to be 7 insert 600 operations for the results of the remaining gather operations 601 for this stream, making 56 insert operations in all. 602 The insert step may be more complex than a single operation 603 in some cases, but we use one operation per insert as a lower bound. 604 Thus, the bit gather algorithm requires 605 at least 176 operations to perform the transposition task. 591 from the SIMD register. 592 % Figure \ref{gather} illustrates the 593 % bit gathering process. 594 595 Using bit gathering, each bit stream of output is assembled 16 positions 596 at a time. Bits from the input register must be shifted into 597 position, the gather operation performed and the result inserted 598 into position in the output register. For the 8 streams, this 599 requires at least 22 operations for 16 positions, or 176 operations 600 to complete the transposition task. 606 601 607 602 \subsection{BytePack Algorithm} 608 603 609 A m uch more efficient transposition algorithm on commodity604 A more efficient transposition algorithm on commodity 610 605 SWAR architectures involves three 611 606 stages of binary division transformation. This is similar … … 998 993 \section{Beyond Parallel Bit Streams} 999 994 1000 It can be argued that text processing with parallel bit 1001 streams by itself is not sufficiently important to justify 1002 the circuit complexity of IDISA. In this section, then 1003 we move on to consider further applications that may 1004 benefit from IDISA features. These include additional 1005 basic inductive doubling 1006 algorithms such as parity, and least powerof2 ceiling. 1007 Further applications for narrow field widths are discussed 1008 as well, such as packed DNA representations using 2bit 1009 field widths and packed decimal representations using 4bit 1010 field widths. Most significantly, however, we show that IDISA offers 1011 a fully general solution to the problem of supporting 995 IDISA has a variety of applications in domains beyond 996 text processing with parallel bit streams. We present 997 a number of examples in this section, including, 998 most significantly, a full general solution to the problem of supporting 1012 999 {\em horizontal} SWAR operations. 1013 1000 1014 \subsection{ Inductive Doubling with Logic Operations}1001 \subsection{Parity} 1015 1002 1016 1003 \begin{figure} … … 1032 1019 \begin{center}\small 1033 1020 \begin{verbatim} 1034 x = x  1;1035 x = x  (x >> 1);1036 x = x  (x >> 2);1037 x = x  (x >> 4);1038 x = x  (x >> 8);1039 x = x  (x >>16);1040 x = x + 1;1041 \end{verbatim}1042 \end{center}1043 \caption{Powerof2 Ceiling Reference Algorithm}1044 \label{HDclp2}1045 \end{figure}1046 1047 \begin{figure}1048 \begin{center}\small1049 \begin{verbatim}1050 1021 y = simd<2>::xor<h,l>(x, x); 1051 1022 y = simd<4>::xor<h,l>(y, y); … … 1059 1030 \end{figure} 1060 1031 1061 \begin{figure} 1062 \begin{center}\small 1063 \begin{verbatim} 1064 x = simd<32>::sub(x, simd<32>::const(1)); 1065 x = simd<2>::or<x, h>(x, x); 1066 x = simd<4>::or<x, h>(x, x); 1067 x = simd<8>::or<x, h>(x, x); 1068 x = simd<16>::or<x, h>(x, x); 1069 x = simd<32>::or<x, h>(x, x); 1070 x = simd<32>::add(x, simd<32>::const(1)); 1071 \end{verbatim} 1072 \end{center} 1073 \caption{IDISA Powerof2 Ceiling Implementation} 1074 \label{IDclp2} 1075 \end{figure} 1076 1077 Figures \ref{HDparity} and \ref{HDclp2} show Warren's 1078 code for two further inductive doubling examples using 1079 logical operators as the combining feature. In the 1080 first of these, the ``exclusive or'' operation is applied 1081 to all bits in a 32bit value to determine parity. 1032 Figures \ref{HDparity} shows Warren's for parity 1033 employing 11 operations, giving rise to a straightforward 1034 RefA implementation also using 11 operations. 1082 1035 Parity has important applications for errorcorrecting 1083 1036 codes such as the various Hamming codes for detecting 1084 1037 and correcting numbers of bit errors dependent on the 1085 number of parity bits added. Warren's version uses 1086 11 operations for parity of a single 32bit value; 1087 a straightforward SWAR adaptation also uses 11 operations 1088 for the parity of a set of 32bit fields. 1038 number of parity bits added. 1089 1039 1090 1040 Figure \ref{IDparity} shows the IDISA parity implementation … … 1093 1043 3X seen in other cases because one of the operands need 1094 1044 not be modified before applying the exclusiveor operation. 1095 1096 Using the ``or'' logical operation, Figure \ref{HDclp2} shows Warren's1097 code for the least powerof2 ceiling of a 32bit value. The1098 technique is to employ inductive doubling to fill in one bits at1099 all bit positions after the first one bit in the original value to1100 first produce a value of the form $2^n1$. This is then incremented1101 to determine the power of 2 ceiling. This function and1102 its straightforward adaptation for SWAR application on a1103 set of 32bit fields requires 12 operations. The1104 corresponding IDISA implementation of Figure \ref{IDclp2}1105 requires 7 operations, just under a 2X improvement.1106 1045 1107 1046 \subsection{Packed DNA Representation} … … 1125 1064 \subsection{Bit Reverse} 1126 1065 1127 Include or omit? 1066 Bit reverse is an important operation needed in a number 1067 of low level codecs. Following Warren's inductive 1068 doubling implementation using masks and shifts \cite{HackersDelight}, 1069 a RefA implementation on 32bit fields requires 28 1070 operations, while a straightforward IDISAA implementation 1071 using \verb#simd<n>::rotli# at each inductive doubling 1072 level requires only 5 operations. 1128 1073 1129 1074 \subsection{String/Decimal/Integer Conversion} … … 1193 1138 1194 1139 1195 1140 \subsection{Pixel Packing} 1141 1142 Pixel packing from 32 bit fields into a 5:5:5 representation 1143 is a further application of parallel bit deletion. 1196 1144 1197 1145 \subsection{Systematic Support for Horizontal SIMD Operations} … … 1537 1485 In the case of transposition algorithms to and from parallel bit stream 1538 1486 form, the architecture has been shown to make possible 1539 straightforward inductive doubling algorithms with the 1540 lowest total number of operations that can be achieved by 1541 any possible 3register SWAR architecture. 1487 straightforward inductive doubling algorithms with a 3X 1488 speedup over the best known versions on permutecapable 1489 reference architectures, achieving the lowest total number 1490 of operations of any possible 3register SWAR architecture. 1542 1491 1543 1492 Applications of IDISA in other areas have also been … … 1565 1514 support in further ways. For example, it may be possible 1566 1515 to develop a pipelined architecture supporting two or three 1567 steps of inductive doubling in a single operation. 1516 steps of inductive doubling in a single operation. 1568 1517 1569 1518 %\appendix
Note: See TracChangeset
for help on using the changeset viewer.