Changeset 238

Dec 16, 2008, 7:15:17 AM (10 years ago)

Intro and evaluation methodology section

1 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r237 r238  
    150150as IDISA: inductive doubling instruction set architecture.
    152 To assess the benefits of IDISA features in comparison to
    153 those found in existing SIMD support on commodity
    154 processors, we will focus on an idealized three-register
    155 model of SIMD instruction sets and evaluation of
    156 parallel bit stream and other computational kernels with
    157 respect to these features.  The three-register model is
    158 based on the general approach to binary SIMD operations
    159 as ones that operate on the contents of two source operand
    160 registers to produce a value to be written back to a
    161 single destination operand register.  This three-register
    162 model is directly used by the Altivec SIMD instructions
    163 of the Power PC, for example.  On the Intel SSE platform,
    164 the three-register model is used as a programmer's
    165 interface for the C-language intrinsics, while the
    166 underlying instructions use a two-register model with
    167 destructive updating.  The computational kernels we study consist
    168 of 100\% branch-free code operating on registers, without
    169 memory access   For such kernels, we use straight-line
    170 instruction count as the performance metric of interest,
    171 assuming that pipelined processors are well-designed
    172 to handle latencies. 
    175152The remainder of this paper is organized as follows.
    182159is also made.
    184 The third section provides a short first example of
     161The third section then discusses the evaluation methodology
     162for IDISA.  Two reference architectures are briefly described as
     163well as criteria for evaluation IDISA against these architectures.
     165The fourth section provides a short first example of
    185166the inductive doubling principle in action through
    186167the case of population count.   Although this operation
    189170frequently enough in the general computing milieux to find
    190171its way into some instruction set architectures, typically
    191 at one particular field width.  By way of comparison, the
    192 inductive doubling architecture sacrifices some
    193 performance at the chosen field width, while offering a more
    194 general solution with frequently better performance at
    195 other field widths.
    197 The fourth section then moves on to consider the performance-critical
     172at one particular field width. 
     173%By way of comparison, the
     174%inductive doubling architecture sacrifices some
     175%performance at the chosen field width, while offering a more
     176%general solution with frequently better performance at
     177%other field widths.
     179The fifth section then moves on to consider the performance-critical
    198180and key task of conversion between serial byte streams and parallel
    199181bit streams.  A first algorithm that uses the existing SIMD
    206188architecture is considerably simpler and easier to program.
    208 The fifth section then briefly considers the inverse transposition
     190The sixth section then briefly considers the inverse transposition
    209191process, converting parallel bit stream data back into byte
    210192streams.  Again, an algorithm to carry out this task requires
    212194form, but is reduced to a simpler 24 operations using IDISA.
    214 The sixth section then goes on to consider the problem of
     196The seventh section then goes on to consider the problem of
    215197parallel bit deletion.  This operation is performance-critical
    216198to any applications that require filtering or
    226 The seventh section considers the issue of
    227 horizontal SIMD operations, that is, operations for combining
    228 sets of adjacent fields within individual SIMD registers rather than
    229 corresponding fields within sets of registers.  While existing
    230 SIMD instruction set architectures tend to only support a few
    231 ad hoc horizontal combinations, IDISA is shown to provide a systematic
    232 means for efficient horizontal combinations of any kind.
     208The eighth section then considers the potential role of
     209IDISA in supporting applications beyond parallel bit streams.
     210Additional examples from several domains are presented.
     211Perhaps most importantly, however, the role of IDISA
     212in supporting a fully systematic treatment of {\em horizontal}
     213SWAR operations for any application area is discussed.
    234215An implementation model for IDISA is then considered
    235 in section 8 of the paper, focusing on a pipelined
     216in section 9 of the paper, focusing on a pipelined
    236217SIMD architecture featuring a modified operand fetch stage.
    237218A gate-count analysis of one feasible implementation is
    243224more general alternatives provided through IDISA.
    245 The ninth section concludes the paper with a summary of results
     226The tenth section concludes the paper with a summary of results
    246227and discussion of areas for future work.
    428409on SSE.
    430 This completes the description of the proposed inductive
    431 doubling architecture.  As described, many of the features
    432 are already available with the SIMD facilities of
     411This completes the description of IDISA.  As described, many of the
     412features are already available with the SIMD facilities of
    433413existing commodity processors.  The extensions enumerated
    434414here are relatively straightforward.  The innovation
    438418can dramatically reduce instruction count in core parallel bit
    439419stream algorithms, with a factor of 3 reduction being typical.
    440 The next section goes on to illustrate a first simple example
    441 for the task of population count.
     421\section{Evaluation Methodology}
     423IDISA represents a set of instruction set features that
     424could potentially be added to any SWAR processor.  The goal
     425in this paper is to evaluate these features independent
     426of any artifacts that may be due to any particular realization,
     427while still considering realistic models based on existing
     428commodity instruction set architectures.  For the purpose
     429of IDISA evaluation, then, we define two reference architectures.
     430For concreteness, IDISA and the two reference architectures will
     431each be considered as 128-bit processors employing the
     432three-register SWAR model defined in the previous
     435Reference architecture A (RefA) consists of a limited register
     436processor providing a set of core binary operations defined
     437for 8, 16, 32 and 64 bit fields.  The core binary operations
     438will be assumed to be those defined by the SSE instruction
     439set for 16-bit fields.  In addition, we assume that
     440shift immediate operations for each field width exist,
     441e.g., \verb#simd<8>::srli<1>(x)# for a right logical
     442shift of each 8-bit field by 1.  We also assume that
     443a constant load operation \verb#simd::const<n>(c)#
     444loads the constant value $c$ into each $n$ bit field.
     446Reference architecture B (RefB) consists of a register-rich
     447processor incorporating all the operations of reference
     448architecture A as well as the following additional facilities
     449inspired by the Altivec instruction set.
     450For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
     451logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
     452of $a$ by the rotation count in the corresponding field of $b$.
     453A three-register \verb#simd<1>::if(a, b, c)# bitwise logical
     454operator implements the logic $a & b | ~a & c$, patterned
     455after the Altivec \verb:vec_sel: operation.  Finally,
     456a \verb#simd<8>::permute(a, b, c)# selects an arbitrary
     457permutation of bytes from the concatenation of $a$ and $b$ based
     458on the set of indices in $c$.
     460Two versions of IDISA are assessed against these reference
     461architectures as follows.  IDISA-A has all the facilities
     462of RefA extended with half-operand modifiers and all core
     463operations at field widths of 2, 4 and 128.  IDISA-B is
     464similarly defined and extended based on RefB.  Algorithms
     465for both RefA and IDISA-A are assessed assuming that
     466any required constants must be loaded as needed; this
     467reflects the limited register assumption.  On the other,
     468assessment for both RefB and IDISA-B will make the
     469assumption that sufficiently many registers exist that
     470constants can be kept preloaded.
     472In each case, the processors are assumed to be well-designed
     473pipeline processors with a throughput of one SWAR instruction
     474each processor cycle for straight-line code free of memory
     475access.  Such processors are certainly feasible with a
     476suitable investment in circuitry, although practical designs
     477may make sacrifices to achieve close to optimal throughput
     478with less circuitry.  However, the assumption makes for
     479straightforward performance evaluation based on instruction
     480count for straight-line computational kernels.  Furthermore,
     481the assumption also eliminates artifacts due to possibly different
     482latencies in reference and IDISA architectures.  Because
     483the same assumption is made for reference and IDISA
     484architectures, determination of the additional circuit
     485complexity due to IDISA features is unaffected by the
     488In the remainder of this paper, then, IDISA-A and IDISA-B
     489models are evaluated against their respective reference
     490architectures on straight-line computational kernels
     491used in parallel bit stream processing and other applications.
     492As XML and other sequential text processing applications
     493tend to use memory in an efficient streaming model, the
     494applications tend to be compute-bound rather than IO-bound.
     495Thus, the focus on computational kernels addresses the
     496primary concern for performance improvement of these applications.
    443498\section{Population Count}
    445 \begin{figure}
    469 \caption{Ideal SIMD Population Count}
     524\caption{IDISA Population Count}
Note: See TracChangeset for help on using the changeset viewer.