Changeset 245

Dec 24, 2008, 7:41:11 AM (11 years ago)

Section 9 completion; conclusions; tighten intro.

2 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r244 r245  
    152152The remainder of this paper is organized as follows.
    154153The second section of this paper introduces IDISA and the
    155 SIMD notation used throughout this paper.  A brief comparison of
    156 IDISA features with existing SIMD instruction
    157 sets of commodity processors such as the Intel SSE
    158 instruction set and the Power PC Altivec instruction set
    159 is also made.
    161 The third section then discusses the evaluation methodology
    162 for IDISA.  Two reference architectures are briefly described as
    163 well as criteria for evaluation IDISA against these architectures.
     154SIMD notation used throughout this paper.  The third
     155section moves on to discuss an evaluation methodology
     156for IDISA in comparison to two reference architectures
     157motivated by the SSE and Altivec instruction sets.
    165158The fourth section provides a short first example of
    166159the inductive doubling principle in action through
    167 the case of population count.   Although this operation
    168 is not a strong determinant of performance for parallel bit
    169 stream applications, it is nevertheless an operation needed
    170 frequently enough in the general computing milieux to find
    171 its way into some instruction set architectures, typically
    172 at one particular field width. 
    173 %By way of comparison, the
    174 %inductive doubling architecture sacrifices some
    175 %performance at the chosen field width, while offering a more
    176 %general solution with frequently better performance at
    177 %other field widths.
    179 The fifth section then moves on to consider the performance-critical
    180 and key task of conversion between serial byte streams and parallel
    181 bit streams.  A first algorithm that uses the existing SIMD
    182 operations common to SSE and Altivec is shown, requiring 72
    183 operations to transform 128 bytes of data using the three-register
    184 instruction form.  We then move on to consider how the task may
    185 be simplified using IDISA to
    186 require a mere 24 operations.  As well as providing a 3X speed-up,
    187 it is also argued that the version using the inductive doubling
    188 architecture is considerably simpler and easier to program.
    190 The sixth section then briefly considers the inverse transposition
    191 process, converting parallel bit stream data back into byte
    192 streams.  Again, an algorithm to carry out this task requires
    193 72 straight-line SIMD operations in the Altivec three-register
    194 form, but is reduced to a simpler 24 operations using IDISA.
    196 The seventh section then goes on to consider the problem of
    197 parallel bit deletion.  This operation is performance-critical
    198 to any applications that require filtering or
    199 editing operations on strings using the parallel bit stream
    200 algorithms.   For example, it is fundamental to the
    201 high-speed UTF-8 to UTF-16 transcoding algorithm that is
    202 often a critical component in XML parsing.  In this
    203 section, an inductive doubling algorithm based on the
    204 concept of a central deletion result is described and
    205 shown to have much better performance than a parallel-prefix
    206 alternative.
     160the case of population count.  Sections 5 through 7
     161then address the application of IDISA to core
     162algorithms in text processing with parallel bit
    208164The eighth section then considers the potential role of
    209165IDISA in supporting applications beyond parallel bit streams.
    210 Additional examples from several domains are presented.
    211 Perhaps most importantly, however, the role of IDISA
    212 in supporting a fully systematic treatment of {\em horizontal}
    213 SWAR operations for any application area is discussed.
    215 An implementation model for IDISA is then considered
    216 in section 9 of the paper, focusing on a pipelined
    217 SIMD architecture featuring a modified operand fetch stage.
    218 A gate-count analysis of one feasible implementation is
    219 provided as well as a discussion of the implementation
    220 of required extensions to handle 2-bit and 4-bit fields not
    221 commonly supported on existing SIMD architectures.   Design
    222 tradeoffs are also considered focusing the potential removal of
    223 various {\em ad hoc} instructions on existing processors in favor of
    224 more general alternatives provided through IDISA.
    226 The tenth section concludes the paper with a summary of results
    227 and discussion of areas for future work.
     166Section 9 addresses IDISA implementation while
     167Section 10 concludes the paper with a summary of
     168results and directions for further work.
    1359 We have carried implementation work for IDISA in three
    1360 ways.  First, we have constructed libraries that
    1361 implement the IDISA instructions by template and/or macro
    1362 expansion for each of MMX, SSE, Altivec, and SPU platforms.
    1363 Second, we have developed a model implementation
    1364 involving a modified operand fetch component
    1365 of a pipelined SIMD processor.  Third, we have written
    1366 and evaluated Verilog HDL description of this model
    1367 implementation.
     1300IDISA may be implemented as a software
     1301abstraction on top of existing SWAR architectures or
     1302directly in hardware.  In this section, we briefly
     1303discuss implementation of IDISA libraries before
     1304moving on to consider hardware design.  Although
     1305a full realization of IDISA in hardware is beyond our
     1306current capabilities, our goal is to develop a sufficiently
     1307detailed design to assess the costs of IDISA implementation
     1308in terms of the incremental complexity over the RefA
     1309and RefB architectures.  The cost factors we consider, then,
     1310are the implementation of the half-operand modifiers, and
     1311the extension of core operations to the 2-bit, 4-bit and
     1312128-bit field widths.  In each case, we also discuss
     1313design tradeoffs.
    13691315\subsection{IDISA Libraries}
    15071453or 2558 gates in total.
     1455The gate-level complexity of half-operand modifiers as described is nontrivial,
     1456but modest.   However, one possible design tradeoff is to
     1457differentiate the two operands, permitting a high half-operand
     1458modifier to be used only with the first operand and a low-modifier
     1459to be used only with the second operand.  This would exclude
     1460\verb#<h,h># and \verb#<l,l># modifier combinations and also
     1461certain combinations for noncommutative core operations. 
     1462The principal
     1463consequence for the applications considered here would be with
     1464respect to the pack operations in forward transposition, but it
     1465may be possible to address this through SIEU circuitry.
     1466If this approach were to be taken, the gate complexity of
     1467half-operand modification would be reduced by slightly more than half.
     1469\subsection{2-Bit and 4-Bit Field Widths}
     1471Beyond the half-operand modifiers, extension of core SWAR
     1472operations to 2-bit and 4-bit field widths is critical to
     1473inductive doubling support.  The principal operations
     1474that need to be supported in this way are addition, pack, merge
     1475merge, and rotate. 
     1477Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
     1478as a modification to that for 8-bit fields by incorporating logic to
     1479disable carry propagation at the 16 midfield boundaries.  For 2-bit
     1480fields, disabling carry propagation at 32 additional boundaries suffices,
     1481although it may be simpler to directly implement the simple logic
     1482of 2-bit adders.
     1484Pack and merge require bit selection logic for each field width.
     1485A straightforward implementation model for each operation
     1486uses 128 2-input and gates to select the desired bits from the
     1487operand registers and another 128 2-input or gates to feed
     1488these results into the destination register.
     1490Rotation for 2-bit fields involves simple logic for selecting
     1491between the 2 bits of each field of the operand being
     1492rotated on the basis of the low bit of each field of the
     1493rotation count.  Rotation for 4-bit fields is more complex,
     1494but can also be based on 1-of-4 selection circuitry
     1495involving the low 2 bits of the rotation count fields.
     1497\subsection{128-Bit Field Widths}
     1499For completeness, the IDISA model requires core operations
     1500to be implemented at the full register width, as well
     1501as power-of-2 partitions.  This may be problematic for
     1502operations such as addition due to the inherent delays
     1503in 128-bit carry propagation.  However, the primary
     1504role of 128 bit operations in inductive doubling
     1505is to combine two 64-bit fields using \verb#<h,l>#
     1506operand combinations.  In view of this, it may be
     1507reasonable to define hardware support for such
     1508combinations to be based on 64-bit logic, with support
     1509for 128-bit logic implemented through firmware or software.
     1511\subsection{Final Notes and Further Tradeoffs}
     1513In order to present IDISA as a concept for design extension of
     1514any SWAR architecture, our discussion of gate-level implementation
     1515is necessarily abstract.  Additional circuitry is sure to be
     1516required, for example, in implementation of SIDU.  However,
     1517in context of the 128-bit reference architectures studied,
     1518our analysis suggests realistic IDISA implementation well
     1519within a 10,000 gate cost.
     1521However, the additional circuitry required may be offset
     1522by elimination of special-purpose instructions found in
     1523existing processors that could instead be implemented through efficient
     1524IDISA sequences.  These include examples such as population
     1525count, count leading and/or trailing zeroes and parity.
     1526They also include specialized horizontal SIMD operations.
     1527Thus, design tradeoffs can be made with the potential of
     1528reducing the chip area devoted to special purpose instructions
     1529in favor of more general IDISA features.
    15151536and a realization of that principle in the form of
    15161537IDISA, a modified SWAR instruction set architecture.
    1517 IDISA features support for SWAR operations at all power-of-2
     1538IDISA offers support for SWAR operations at all power-of-2
    15181539field widths, including 2-bit and 4-bit widths, in particular,
    15191540as well as half-operand modifiers and pack and merge operations
    15271548have been examined and shown to benefit from dramatic
    15281549reductions in instruction count compared to the best
    1529 known algorithms on existing architectures.  In the case
    1530 of transposition algorithms to and from parallel bit stream
     1550known algorithms on reference architectures.  This
     1551includes both a reference architecture modeled on
     1552the SWAR capabilities of the SSE family as well as
     1553an architecture incorporating the powerful permute
     1554or shuffle capabilities found in Altivec or Cell BE processors.
     1555In the case of transposition algorithms to and from parallel bit stream
    15311556form, the architecture has been shown to make possible
    15321557straightforward inductive doubling algorithms with the
    15481573a customized operand fetch unit to implement the half-operand
    15491574modifier logic.  Gate-level implementation of this unit
    1550 has been analyzed and showed to be quite reasonable.
    1552 Many special-purpose operations that are found in existing
    1553 processors could instead be implemented through efficient
    1554 IDISA sequences.  These include examples such as population
    1555 count, count leading and/or trailing zeroes and parity.
    1556 They also include specialized horizontal SIMD operations.
    1557 Thus, design tradeoffs can be made with the potential of
    1558 reducing the chip area devoted to special purpose instructions
    1559 in favor of more general IDISA features.
    1561 Other tradeoffs may be possible in IDISA implementation itself.
    1562 Full support of IDISA features to the largest field widths
    1563 are not necessary in many cases.   For example, a given 128-bit
    1564 SIMD facility may support IDISA features only up to 32-bit
    1565 or 64-bit fields, similar to the current Altivec and SSE
    1566 architectures, respectively.   It may also be possible to
    1567 reduce the complexity of operand fetch circuitry by a factor
    1568 of two by dedicating one operand to a possible high half-operand
    1569 modifier and the other to a possible low half-operand modifier.
     1575and operations at the 2-bit and 4-bit field widths have
     1576been analyzed and found to be quite reasonable within
     1577a 10,000 gate bound.  Design tradeoffs to reduce the cost
     1578have also been discussed, possibly even leading to a
     1579net complexity reduction through elimination of instructions
     1580that implement special-case versions of inductive doubling.
    15711582Future research may consider the extension of inductive doubling
Note: See TracChangeset for help on using the changeset viewer.