# Changeset 230 for docs/ASPLOS09/asplos094-cameron.tex

Ignore:
Timestamp:
Dec 10, 2008, 1:16:12 PM (11 years ago)
Message:

Further improvements to population count section.

File:
1 edited

### Legend:

Unmodified
 r229 implementation for a 32-bit integer {\tt x} adapted from Warren \cite{HackersDelight}, while Figure \ref{inductivepopcount} shows the corresponding SWAR implementation for a vector of 32-bit fields using the inductive doubling instruction set architecture.  Each implementation employs Figure \ref{inductivepopcount} shows the corresponding IDISA implementation for a vector of 32-bit fields.  Each implementation employs five steps of inductive doubling to produce population counts within 32 bit fields.  The traditional implementation employs explicit masking and shifting operations, while these operations are implicit within the semantics of the inductive doubling instructions used in Figure \ref{inductivepopcount}. doubling instructions shown in Figure \ref{inductivepopcount}. In each implementation, the first step determines the the population counts within 2-bit fields With the substitution of longer mask constants replicated for four 32-bit fields, the implementation of Figure \ref{HD-pop} can be easily adapted to SWAR processing using 128-bit registers. Such an implementation requires 10 operations to load or generate directly adapted to SWAR processing using 128-bit registers. Each binary operator is replaced by a corresponding binary SIMD operation.   Without the IDISA features, a straightforward SWAR implementation of population count for 32-bit fields thus employs 10 operations to load or generate mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a total of 30 operations to complete the task in comparison to a mere   However, Warren further refines the implementation to an optimized version requiring only 20 operations, 5 of which are required to generate mask constants.  At the cost of register pressure, it is possible that these constants could be kept preloaded.  In any case, the IDISA implementation total of 30 operations to complete the task.   Employing optimization identified by Warren, this can be reduced to 20 operations, 5 of which are required to generate mask constants. At the cost of register pressure, it is possible that these constants could be kept preloaded in long vector processing. In any case, the IDISA implementation requires only 5 operations to carry out the work requiring 15 to 20 operations with the optimized version using standard SWAR operations.   IDISA thus offers offers a 3X to 4X improvement in instruction count as well as a reduction in register pressure. The pattern illustrated by population count is typical.  An inductive doubling algorithm of $n$ steps typically applies two mask or shift operations at each step for each of the two operands being combined in the step.   If the combination can be implemented in a single SWAR operation, then a total of $3n$ operations are the minimum required for the SWAR algorithm with IDISA features.   An IDISA implementation typically eliminates the explicit mask and shift operations through appropriate half-operand modifiers, reducing the total instruction count to $n$.   Thus a 3X improvement is typical. \section{Transposition to Parallel Bit Streams}