# Changeset 239 for docs

Ignore:
Timestamp:
Dec 16, 2008, 8:48:39 AM (11 years ago)
Message:

Pop count evaluation

Location:
docs/ASPLOS09
Files:
2 edited

### Legend:

Unmodified
 r238 of $a$ by the rotation count in the corresponding field of $b$. A three-register \verb#simd<1>::if(a, b, c)# bitwise logical operator implements the logic $a & b | ~a & c$, patterned operator implements the logic $a \& b | ~a \& c$, patterned after the Altivec \verb:vec_sel: operation.  Finally, a \verb#simd<8>::permute(a, b, c)# selects an arbitrary primary concern for performance improvement of these applications. The additional circuity complexity to realize IDISA-A and IDISA-B designs over their reference models will be addressed in the penultimate section.  That discussion will focus primarily on the complexity of implementing half-operand modifier logic, but will also address the extension of the core operations to operate on 2-bit, 4-bit and 128-bit fields, as well. \section{Population Count} in practice, consider the problem of {\em population count}: determining the number of one bits within a particular bit field.  It is important enough for such operations as calculating Hamming distance to be included as a built-in instruction enough for such operations as calculating Hamming distance to be included as a built-in instruction on some processors.  For example, the SPU of the Cell Broadband Engine has a SIMD population count instruction \verb:si_cntb: for simultaneously Each binary operator is replaced by a corresponding binary SIMD operation.   Without the IDISA features, a straightforward SWAR implementation of population count for straightforward RefA implementation of population count for 32-bit fields thus employs 10 operations to load or generate mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a 20 operations, 5 of which are required to generate mask constants. At the cost of register pressure, it is possible that these constants could be kept preloaded in long vector processing. In any case, the IDISA implementation requires only 5 operations to carry out the work requiring 15 to 20 operations with the optimized version using standard SWAR operations.   IDISA thus offers offers a 3X to 4X improvement in instruction count as well as a reduction in register pressure. The pattern illustrated by population count is typical.  An inductive doubling algorithm of $n$ steps typically applies two mask or shift operations at each step for each of the two operands being combined in the step.   If the combination could be kept preloaded in long vector processing.  In accord with our evaluation model, the RefB cost is thus 15 operations. As the IDISA implementation requires no constants at all, both the IDISA-A and IDISA-B cost is 5 operations. At our assumed one CPU cycle per instruction model, IDISA-A offers a 4X improvement over RefA, while IDISA-B offers a 3X improvement over its comparator. The pattern illustrated by population count is typical. An inductive doubling algorithm of $n$ steps typically applies mask or shift operations at each step for each of the two operands being combined in the step.  If there is one such operation for each operand, and the combination can be implemented in a single SWAR operation, then a total of $3n$ operations are the minimum required for the SWAR algorithm with IDISA features.   An IDISA implementation typically eliminates the explicit mask and shift operations of $3n$ operations are the required in a RefB implementation, and possibly $4n$ for a RefA implementation including the cost of loading masks.  IDISA-A and IDISA-B implementations typically eliminate the explicit mask and shift operations through appropriate half-operand modifiers, reducing the total instruction count to $n$.   Thus a 3X improvement is typical. total instruction count to $n$.   Thus a 3X to 4X improvement obtains in these cases. \section{Transposition to Parallel Bit Streams}