Changeset 230 for docs/ASPLOS09/asplos094cameron.tex
 Timestamp:
 Dec 10, 2008, 1:16:12 PM (11 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r229 r230 490 490 implementation for a 32bit integer {\tt x} adapted from 491 491 Warren \cite{HackersDelight}, while 492 Figure \ref{inductivepopcount} shows the corresponding SWAR 493 implementation for a vector of 32bit fields using the inductive doubling 494 instruction set architecture. Each implementation employs 492 Figure \ref{inductivepopcount} shows the corresponding IDISA 493 implementation for a vector of 32bit fields. Each implementation employs 495 494 five steps of inductive doubling to produce population counts 496 495 within 32 bit fields. The traditional implementation employs 497 496 explicit masking and shifting operations, while these 498 497 operations are implicit within the semantics of the inductive 499 doubling instructions usedin Figure \ref{inductivepopcount}.498 doubling instructions shown in Figure \ref{inductivepopcount}. 500 499 In each implementation, the first step determines the 501 500 the population counts within 2bit fields … … 509 508 With the substitution of longer mask constants replicated for four 510 509 32bit fields, the implementation of Figure \ref{HDpop} can be 511 easily adapted to SWAR processing using 128bit registers. 512 Such an implementation requires 10 operations to load or generate 510 directly adapted to SWAR processing using 128bit registers. 511 Each binary operator is replaced by a corresponding binary 512 SIMD operation. Without the IDISA features, a 513 straightforward SWAR implementation of population count for 514 32bit fields thus employs 10 operations to load or generate 513 515 mask constants, 10 bitwiseand operations, 5 shifts and 5 adds for a 514 total of 30 operations to complete the task in comparison 515 to a mere However, Warren further refines the 516 implementation to an optimized version requiring only 20 operations, 517 5 of which are required to generate mask constants. At the cost 518 of register pressure, it is possible that these constants 519 could be kept preloaded. In any case, the IDISA implementation 516 total of 30 operations to complete the task. Employing 517 optimization identified by Warren, this can be reduced to 518 20 operations, 5 of which are required to generate mask constants. 519 At the cost of register pressure, it is possible that these constants 520 could be kept preloaded in long vector processing. 521 In any case, the IDISA implementation requires only 5 operations 522 to carry out the work requiring 15 to 20 operations with the 523 optimized version using standard SWAR operations. IDISA thus offers 520 524 offers a 3X to 4X improvement in instruction count as well as 521 525 a reduction in register pressure. 522 526 523 524 527 The pattern illustrated by population count is typical. An 528 inductive doubling algorithm of $n$ steps typically applies 529 two mask or shift operations at each step for each of the 530 two operands being combined in the step. If the combination 531 can be implemented in a single SWAR operation, then a total 532 of $3n$ operations are the minimum required for the SWAR 533 algorithm with IDISA features. An IDISA implementation 534 typically eliminates the explicit mask and shift operations 535 through appropriate halfoperand modifiers, reducing the 536 total instruction count to $n$. Thus a 3X improvement 537 is typical. 525 538 526 539 \section{Transposition to Parallel Bit Streams}
Note: See TracChangeset
for help on using the changeset viewer.