Changeset 230 for docs


Ignore:
Timestamp:
Dec 10, 2008, 1:16:12 PM (11 years ago)
Author:
cameron
Message:

Further improvements to population count section.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r229 r230  
    490490implementation for a 32-bit integer {\tt x} adapted from
    491491Warren \cite{HackersDelight}, while
    492 Figure \ref{inductivepopcount} shows the corresponding SWAR
    493 implementation for a vector of 32-bit fields using the inductive doubling
    494 instruction set architecture.  Each implementation employs
     492Figure \ref{inductivepopcount} shows the corresponding IDISA
     493implementation for a vector of 32-bit fields.  Each implementation employs
    495494five steps of inductive doubling to produce population counts
    496495within 32 bit fields.  The traditional implementation employs
    497496explicit masking and shifting operations, while these
    498497operations are implicit within the semantics of the inductive
    499 doubling instructions used in Figure \ref{inductivepopcount}.
     498doubling instructions shown in Figure \ref{inductivepopcount}.
    500499In each implementation, the first step determines the
    501500the population counts within 2-bit fields
     
    509508With the substitution of longer mask constants replicated for four
    51050932-bit fields, the implementation of Figure \ref{HD-pop} can be
    511 easily adapted to SWAR processing using 128-bit registers.
    512 Such an implementation requires 10 operations to load or generate
     510directly adapted to SWAR processing using 128-bit registers.
     511Each binary operator is replaced by a corresponding binary
     512SIMD operation.   Without the IDISA features, a
     513straightforward SWAR implementation of population count for
     51432-bit fields thus employs 10 operations to load or generate
    513515mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
    514 total of 30 operations to complete the task in comparison
    515 to a mere   However, Warren further refines the
    516 implementation to an optimized version requiring only 20 operations,
    517 5 of which are required to generate mask constants.  At the cost
    518 of register pressure, it is possible that these constants
    519 could be kept preloaded.  In any case, the IDISA implementation
     516total of 30 operations to complete the task.   Employing
     517optimization identified by Warren, this can be reduced to
     51820 operations, 5 of which are required to generate mask constants.
     519At the cost of register pressure, it is possible that these constants
     520could be kept preloaded in long vector processing.
     521In any case, the IDISA implementation requires only 5 operations
     522to carry out the work requiring 15 to 20 operations with the
     523optimized version using standard SWAR operations.   IDISA thus offers
    520524offers a 3X to 4X improvement in instruction count as well as
    521525a reduction in register pressure.
    522526
    523 
    524 
     527The pattern illustrated by population count is typical.  An
     528inductive doubling algorithm of $n$ steps typically applies
     529two mask or shift operations at each step for each of the
     530two operands being combined in the step.   If the combination
     531can be implemented in a single SWAR operation, then a total
     532of $3n$ operations are the minimum required for the SWAR
     533algorithm with IDISA features.   An IDISA implementation
     534typically eliminates the explicit mask and shift operations
     535through appropriate half-operand modifiers, reducing the
     536total instruction count to $n$.   Thus a 3X improvement
     537is typical.
    525538
    526539\section{Transposition to Parallel Bit Streams}
Note: See TracChangeset for help on using the changeset viewer.