Changeset 239 for docs/ASPLOS09


Ignore:
Timestamp:
Dec 16, 2008, 8:48:39 AM (10 years ago)
Author:
cameron
Message:

Pop count evaluation

Location:
docs/ASPLOS09
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r238 r239  
    452452of $a$ by the rotation count in the corresponding field of $b$.
    453453A three-register \verb#simd<1>::if(a, b, c)# bitwise logical
    454 operator implements the logic $a & b | ~a & c$, patterned
     454operator implements the logic $a \& b | ~a \& c$, patterned
    455455after the Altivec \verb:vec_sel: operation.  Finally,
    456456a \verb#simd<8>::permute(a, b, c)# selects an arbitrary
     
    496496primary concern for performance improvement of these applications.
    497497
     498The additional circuity complexity to realize IDISA-A and
     499IDISA-B designs over their reference models will be
     500addressed in the penultimate section.  That discussion
     501will focus primarily on the complexity of implementing
     502half-operand modifier logic, but will also address
     503the extension of the core operations to operate on
     5042-bit, 4-bit and 128-bit fields, as well.
     505
    498506\section{Population Count}
    499507
     
    529537in practice, consider the problem of {\em population count}: determining
    530538the number of one bits within a particular bit field.  It is important
    531 enough for such operations as
    532 calculating Hamming distance to be included as a built-in instruction
     539enough for such operations as calculating Hamming distance to be included
     540as a built-in instruction
    533541on some processors.  For example, the SPU of the Cell Broadband Engine
    534542has a SIMD population count instruction \verb:si_cntb: for simultaneously
     
    566574Each binary operator is replaced by a corresponding binary
    567575SIMD operation.   Without the IDISA features, a
    568 straightforward SWAR implementation of population count for
     576straightforward RefA implementation of population count for
    56957732-bit fields thus employs 10 operations to load or generate
    570578mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
     
    57358120 operations, 5 of which are required to generate mask constants.
    574582At the cost of register pressure, it is possible that these constants
    575 could be kept preloaded in long vector processing.
    576 In any case, the IDISA implementation requires only 5 operations
    577 to carry out the work requiring 15 to 20 operations with the
    578 optimized version using standard SWAR operations.   IDISA thus offers
    579 offers a 3X to 4X improvement in instruction count as well as
    580 a reduction in register pressure.
    581 
    582 The pattern illustrated by population count is typical.  An
    583 inductive doubling algorithm of $n$ steps typically applies
    584 two mask or shift operations at each step for each of the
    585 two operands being combined in the step.   If the combination
     583could be kept preloaded in long vector processing.  In accord
     584with our evaluation model, the RefB cost is thus 15 operations.
     585As the IDISA implementation requires no constants at all,
     586both the IDISA-A and IDISA-B cost is 5 operations.
     587At our assumed one CPU cycle per instruction model, IDISA-A
     588offers a 4X improvement over RefA, while IDISA-B offers a 3X
     589improvement over its comparator.
     590
     591The pattern illustrated by population count is typical.
     592An inductive doubling algorithm of $n$ steps typically applies
     593mask or shift operations at each step for each of the
     594two operands being combined in the step.  If there is
     595one such operation for each operand, and the combination
    586596can be implemented in a single SWAR operation, then a total
    587 of $3n$ operations are the minimum required for the SWAR
    588 algorithm with IDISA features.   An IDISA implementation
    589 typically eliminates the explicit mask and shift operations
     597of $3n$ operations are the required in a RefB implementation,
     598and possibly $4n$ for a RefA implementation including the
     599cost of loading masks.  IDISA-A and IDISA-B implementations
     600typically eliminate the explicit mask and shift operations
    590601through appropriate half-operand modifiers, reducing the
    591 total instruction count to $n$.   Thus a 3X improvement
    592 is typical.
     602total instruction count to $n$.   Thus a 3X to 4X improvement
     603obtains in these cases.
    593604
    594605\section{Transposition to Parallel Bit Streams}
Note: See TracChangeset for help on using the changeset viewer.