 Timestamp:
 Dec 24, 2008, 7:41:11 AM (11 years ago)
 Location:
 docs/ASPLOS09
 Files:

 2 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r244 r245 151 151 152 152 The remainder of this paper is organized as follows. 153 154 153 The second section of this paper introduces IDISA and the 155 SIMD notation used throughout this paper. A brief comparison of 156 IDISA features with existing SIMD instruction 157 sets of commodity processors such as the Intel SSE 158 instruction set and the Power PC Altivec instruction set 159 is also made. 160 161 The third section then discusses the evaluation methodology 162 for IDISA. Two reference architectures are briefly described as 163 well as criteria for evaluation IDISA against these architectures. 164 154 SIMD notation used throughout this paper. The third 155 section moves on to discuss an evaluation methodology 156 for IDISA in comparison to two reference architectures 157 motivated by the SSE and Altivec instruction sets. 165 158 The fourth section provides a short first example of 166 159 the inductive doubling principle in action through 167 the case of population count. Although this operation 168 is not a strong determinant of performance for parallel bit 169 stream applications, it is nevertheless an operation needed 170 frequently enough in the general computing milieux to find 171 its way into some instruction set architectures, typically 172 at one particular field width. 173 %By way of comparison, the 174 %inductive doubling architecture sacrifices some 175 %performance at the chosen field width, while offering a more 176 %general solution with frequently better performance at 177 %other field widths. 178 179 The fifth section then moves on to consider the performancecritical 180 and key task of conversion between serial byte streams and parallel 181 bit streams. A first algorithm that uses the existing SIMD 182 operations common to SSE and Altivec is shown, requiring 72 183 operations to transform 128 bytes of data using the threeregister 184 instruction form. We then move on to consider how the task may 185 be simplified using IDISA to 186 require a mere 24 operations. As well as providing a 3X speedup, 187 it is also argued that the version using the inductive doubling 188 architecture is considerably simpler and easier to program. 189 190 The sixth section then briefly considers the inverse transposition 191 process, converting parallel bit stream data back into byte 192 streams. Again, an algorithm to carry out this task requires 193 72 straightline SIMD operations in the Altivec threeregister 194 form, but is reduced to a simpler 24 operations using IDISA. 195 196 The seventh section then goes on to consider the problem of 197 parallel bit deletion. This operation is performancecritical 198 to any applications that require filtering or 199 editing operations on strings using the parallel bit stream 200 algorithms. For example, it is fundamental to the 201 highspeed UTF8 to UTF16 transcoding algorithm that is 202 often a critical component in XML parsing. In this 203 section, an inductive doubling algorithm based on the 204 concept of a central deletion result is described and 205 shown to have much better performance than a parallelprefix 206 alternative. 207 160 the case of population count. Sections 5 through 7 161 then address the application of IDISA to core 162 algorithms in text processing with parallel bit 163 streams. 208 164 The eighth section then considers the potential role of 209 165 IDISA in supporting applications beyond parallel bit streams. 210 Additional examples from several domains are presented. 211 Perhaps most importantly, however, the role of IDISA 212 in supporting a fully systematic treatment of {\em horizontal} 213 SWAR operations for any application area is discussed. 214 215 An implementation model for IDISA is then considered 216 in section 9 of the paper, focusing on a pipelined 217 SIMD architecture featuring a modified operand fetch stage. 218 A gatecount analysis of one feasible implementation is 219 provided as well as a discussion of the implementation 220 of required extensions to handle 2bit and 4bit fields not 221 commonly supported on existing SIMD architectures. Design 222 tradeoffs are also considered focusing the potential removal of 223 various {\em ad hoc} instructions on existing processors in favor of 224 more general alternatives provided through IDISA. 225 226 The tenth section concludes the paper with a summary of results 227 and discussion of areas for future work. 166 Section 9 addresses IDISA implementation while 167 Section 10 concludes the paper with a summary of 168 results and directions for further work. 228 169 229 170 … … 1357 1298 \section{Implementation} 1358 1299 1359 We have carried implementation work for IDISA in three 1360 ways. First, we have constructed libraries that 1361 implement the IDISA instructions by template and/or macro 1362 expansion for each of MMX, SSE, Altivec, and SPU platforms. 1363 Second, we have developed a model implementation 1364 involving a modified operand fetch component 1365 of a pipelined SIMD processor. Third, we have written 1366 and evaluated Verilog HDL description of this model 1367 implementation. 1300 IDISA may be implemented as a software 1301 abstraction on top of existing SWAR architectures or 1302 directly in hardware. In this section, we briefly 1303 discuss implementation of IDISA libraries before 1304 moving on to consider hardware design. Although 1305 a full realization of IDISA in hardware is beyond our 1306 current capabilities, our goal is to develop a sufficiently 1307 detailed design to assess the costs of IDISA implementation 1308 in terms of the incremental complexity over the RefA 1309 and RefB architectures. The cost factors we consider, then, 1310 are the implementation of the halfoperand modifiers, and 1311 the extension of core operations to the 2bit, 4bit and 1312 128bit field widths. In each case, we also discuss 1313 design tradeoffs. 1368 1314 1369 1315 \subsection{IDISA Libraries} … … 1507 1453 or 2558 gates in total. 1508 1454 1455 The gatelevel complexity of halfoperand modifiers as described is nontrivial, 1456 but modest. However, one possible design tradeoff is to 1457 differentiate the two operands, permitting a high halfoperand 1458 modifier to be used only with the first operand and a lowmodifier 1459 to be used only with the second operand. This would exclude 1460 \verb#<h,h># and \verb#<l,l># modifier combinations and also 1461 certain combinations for noncommutative core operations. 1462 The principal 1463 consequence for the applications considered here would be with 1464 respect to the pack operations in forward transposition, but it 1465 may be possible to address this through SIEU circuitry. 1466 If this approach were to be taken, the gate complexity of 1467 halfoperand modification would be reduced by slightly more than half. 1468 1469 \subsection{2Bit and 4Bit Field Widths} 1470 1471 Beyond the halfoperand modifiers, extension of core SWAR 1472 operations to 2bit and 4bit field widths is critical to 1473 inductive doubling support. The principal operations 1474 that need to be supported in this way are addition, pack, merge 1475 merge, and rotate. 1476 1477 Addition for 4bit fields in a 128bit SWAR processor may be implemented 1478 as a modification to that for 8bit fields by incorporating logic to 1479 disable carry propagation at the 16 midfield boundaries. For 2bit 1480 fields, disabling carry propagation at 32 additional boundaries suffices, 1481 although it may be simpler to directly implement the simple logic 1482 of 2bit adders. 1483 1484 Pack and merge require bit selection logic for each field width. 1485 A straightforward implementation model for each operation 1486 uses 128 2input and gates to select the desired bits from the 1487 operand registers and another 128 2input or gates to feed 1488 these results into the destination register. 1489 1490 Rotation for 2bit fields involves simple logic for selecting 1491 between the 2 bits of each field of the operand being 1492 rotated on the basis of the low bit of each field of the 1493 rotation count. Rotation for 4bit fields is more complex, 1494 but can also be based on 1of4 selection circuitry 1495 involving the low 2 bits of the rotation count fields. 1496 1497 \subsection{128Bit Field Widths} 1498 1499 For completeness, the IDISA model requires core operations 1500 to be implemented at the full register width, as well 1501 as powerof2 partitions. This may be problematic for 1502 operations such as addition due to the inherent delays 1503 in 128bit carry propagation. However, the primary 1504 role of 128 bit operations in inductive doubling 1505 is to combine two 64bit fields using \verb#<h,l># 1506 operand combinations. In view of this, it may be 1507 reasonable to define hardware support for such 1508 combinations to be based on 64bit logic, with support 1509 for 128bit logic implemented through firmware or software. 1510 1511 \subsection{Final Notes and Further Tradeoffs} 1512 1513 In order to present IDISA as a concept for design extension of 1514 any SWAR architecture, our discussion of gatelevel implementation 1515 is necessarily abstract. Additional circuitry is sure to be 1516 required, for example, in implementation of SIDU. However, 1517 in context of the 128bit reference architectures studied, 1518 our analysis suggests realistic IDISA implementation well 1519 within a 10,000 gate cost. 1520 1521 However, the additional circuitry required may be offset 1522 by elimination of specialpurpose instructions found in 1523 existing processors that could instead be implemented through efficient 1524 IDISA sequences. These include examples such as population 1525 count, count leading and/or trailing zeroes and parity. 1526 They also include specialized horizontal SIMD operations. 1527 Thus, design tradeoffs can be made with the potential of 1528 reducing the chip area devoted to special purpose instructions 1529 in favor of more general IDISA features. 1509 1530 1510 1531 \section{Conclusions} … … 1515 1536 and a realization of that principle in the form of 1516 1537 IDISA, a modified SWAR instruction set architecture. 1517 IDISA features support for SWAR operations at all powerof21538 IDISA offers support for SWAR operations at all powerof2 1518 1539 field widths, including 2bit and 4bit widths, in particular, 1519 1540 as well as halfoperand modifiers and pack and merge operations … … 1527 1548 have been examined and shown to benefit from dramatic 1528 1549 reductions in instruction count compared to the best 1529 known algorithms on existing architectures. In the case 1530 of transposition algorithms to and from parallel bit stream 1550 known algorithms on reference architectures. This 1551 includes both a reference architecture modeled on 1552 the SWAR capabilities of the SSE family as well as 1553 an architecture incorporating the powerful permute 1554 or shuffle capabilities found in Altivec or Cell BE processors. 1555 In the case of transposition algorithms to and from parallel bit stream 1531 1556 form, the architecture has been shown to make possible 1532 1557 straightforward inductive doubling algorithms with the … … 1548 1573 a customized operand fetch unit to implement the halfoperand 1549 1574 modifier logic. Gatelevel implementation of this unit 1550 has been analyzed and showed to be quite reasonable. 1551 1552 Many specialpurpose operations that are found in existing 1553 processors could instead be implemented through efficient 1554 IDISA sequences. These include examples such as population 1555 count, count leading and/or trailing zeroes and parity. 1556 They also include specialized horizontal SIMD operations. 1557 Thus, design tradeoffs can be made with the potential of 1558 reducing the chip area devoted to special purpose instructions 1559 in favor of more general IDISA features. 1560 1561 Other tradeoffs may be possible in IDISA implementation itself. 1562 Full support of IDISA features to the largest field widths 1563 are not necessary in many cases. For example, a given 128bit 1564 SIMD facility may support IDISA features only up to 32bit 1565 or 64bit fields, similar to the current Altivec and SSE 1566 architectures, respectively. It may also be possible to 1567 reduce the complexity of operand fetch circuitry by a factor 1568 of two by dedicating one operand to a possible high halfoperand 1569 modifier and the other to a possible low halfoperand modifier. 1575 and operations at the 2bit and 4bit field widths have 1576 been analyzed and found to be quite reasonable within 1577 a 10,000 gate bound. Design tradeoffs to reduce the cost 1578 have also been discussed, possibly even leading to a 1579 net complexity reduction through elimination of instructions 1580 that implement specialcase versions of inductive doubling. 1570 1581 1571 1582 Future research may consider the extension of inductive doubling
Note: See TracChangeset
for help on using the changeset viewer.