 Timestamp:
 Dec 16, 2008, 8:48:39 AM (11 years ago)
 Location:
 docs/ASPLOS09
 Files:

 2 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r238 r239 452 452 of $a$ by the rotation count in the corresponding field of $b$. 453 453 A threeregister \verb#simd<1>::if(a, b, c)# bitwise logical 454 operator implements the logic $a & b  ~a& c$, patterned454 operator implements the logic $a \& b  ~a \& c$, patterned 455 455 after the Altivec \verb:vec_sel: operation. Finally, 456 456 a \verb#simd<8>::permute(a, b, c)# selects an arbitrary … … 496 496 primary concern for performance improvement of these applications. 497 497 498 The additional circuity complexity to realize IDISAA and 499 IDISAB designs over their reference models will be 500 addressed in the penultimate section. That discussion 501 will focus primarily on the complexity of implementing 502 halfoperand modifier logic, but will also address 503 the extension of the core operations to operate on 504 2bit, 4bit and 128bit fields, as well. 505 498 506 \section{Population Count} 499 507 … … 529 537 in practice, consider the problem of {\em population count}: determining 530 538 the number of one bits within a particular bit field. It is important 531 enough for such operations as 532 calculating Hamming distance to be includedas a builtin instruction539 enough for such operations as calculating Hamming distance to be included 540 as a builtin instruction 533 541 on some processors. For example, the SPU of the Cell Broadband Engine 534 542 has a SIMD population count instruction \verb:si_cntb: for simultaneously … … 566 574 Each binary operator is replaced by a corresponding binary 567 575 SIMD operation. Without the IDISA features, a 568 straightforward SWARimplementation of population count for576 straightforward RefA implementation of population count for 569 577 32bit fields thus employs 10 operations to load or generate 570 578 mask constants, 10 bitwiseand operations, 5 shifts and 5 adds for a … … 573 581 20 operations, 5 of which are required to generate mask constants. 574 582 At the cost of register pressure, it is possible that these constants 575 could be kept preloaded in long vector processing. 576 In any case, the IDISA implementation requires only 5 operations 577 to carry out the work requiring 15 to 20 operations with the 578 optimized version using standard SWAR operations. IDISA thus offers 579 offers a 3X to 4X improvement in instruction count as well as 580 a reduction in register pressure. 581 582 The pattern illustrated by population count is typical. An 583 inductive doubling algorithm of $n$ steps typically applies 584 two mask or shift operations at each step for each of the 585 two operands being combined in the step. If the combination 583 could be kept preloaded in long vector processing. In accord 584 with our evaluation model, the RefB cost is thus 15 operations. 585 As the IDISA implementation requires no constants at all, 586 both the IDISAA and IDISAB cost is 5 operations. 587 At our assumed one CPU cycle per instruction model, IDISAA 588 offers a 4X improvement over RefA, while IDISAB offers a 3X 589 improvement over its comparator. 590 591 The pattern illustrated by population count is typical. 592 An inductive doubling algorithm of $n$ steps typically applies 593 mask or shift operations at each step for each of the 594 two operands being combined in the step. If there is 595 one such operation for each operand, and the combination 586 596 can be implemented in a single SWAR operation, then a total 587 of $3n$ operations are the minimum required for the SWAR 588 algorithm with IDISA features. An IDISA implementation 589 typically eliminates the explicit mask and shift operations 597 of $3n$ operations are the required in a RefB implementation, 598 and possibly $4n$ for a RefA implementation including the 599 cost of loading masks. IDISAA and IDISAB implementations 600 typically eliminate the explicit mask and shift operations 590 601 through appropriate halfoperand modifiers, reducing the 591 total instruction count to $n$. Thus a 3X improvement592 is typical.602 total instruction count to $n$. Thus a 3X to 4X improvement 603 obtains in these cases. 593 604 594 605 \section{Transposition to Parallel Bit Streams}
Note: See TracChangeset
for help on using the changeset viewer.