# Changeset 238

Ignore:
Timestamp:
Dec 16, 2008, 7:15:17 AM (10 years ago)
Message:

Intro and evaluation methodology section

File:
1 edited

### Legend:

Unmodified
 r237 as IDISA: inductive doubling instruction set architecture. To assess the benefits of IDISA features in comparison to those found in existing SIMD support on commodity processors, we will focus on an idealized three-register model of SIMD instruction sets and evaluation of parallel bit stream and other computational kernels with respect to these features.  The three-register model is based on the general approach to binary SIMD operations as ones that operate on the contents of two source operand registers to produce a value to be written back to a single destination operand register.  This three-register model is directly used by the Altivec SIMD instructions of the Power PC, for example.  On the Intel SSE platform, the three-register model is used as a programmer's interface for the C-language intrinsics, while the underlying instructions use a two-register model with destructive updating.  The computational kernels we study consist of 100\% branch-free code operating on registers, without memory access   For such kernels, we use straight-line instruction count as the performance metric of interest, assuming that pipelined processors are well-designed to handle latencies. The remainder of this paper is organized as follows. is also made. The third section provides a short first example of The third section then discusses the evaluation methodology for IDISA.  Two reference architectures are briefly described as well as criteria for evaluation IDISA against these architectures. The fourth section provides a short first example of the inductive doubling principle in action through the case of population count.   Although this operation frequently enough in the general computing milieux to find its way into some instruction set architectures, typically at one particular field width.  By way of comparison, the inductive doubling architecture sacrifices some performance at the chosen field width, while offering a more general solution with frequently better performance at other field widths. The fourth section then moves on to consider the performance-critical at one particular field width. %By way of comparison, the %inductive doubling architecture sacrifices some %performance at the chosen field width, while offering a more %general solution with frequently better performance at %other field widths. The fifth section then moves on to consider the performance-critical and key task of conversion between serial byte streams and parallel bit streams.  A first algorithm that uses the existing SIMD architecture is considerably simpler and easier to program. The fifth section then briefly considers the inverse transposition The sixth section then briefly considers the inverse transposition process, converting parallel bit stream data back into byte streams.  Again, an algorithm to carry out this task requires form, but is reduced to a simpler 24 operations using IDISA. The sixth section then goes on to consider the problem of The seventh section then goes on to consider the problem of parallel bit deletion.  This operation is performance-critical to any applications that require filtering or alternative. The seventh section considers the issue of horizontal SIMD operations, that is, operations for combining sets of adjacent fields within individual SIMD registers rather than corresponding fields within sets of registers.  While existing SIMD instruction set architectures tend to only support a few ad hoc horizontal combinations, IDISA is shown to provide a systematic means for efficient horizontal combinations of any kind. The eighth section then considers the potential role of IDISA in supporting applications beyond parallel bit streams. Additional examples from several domains are presented. Perhaps most importantly, however, the role of IDISA in supporting a fully systematic treatment of {\em horizontal} SWAR operations for any application area is discussed. An implementation model for IDISA is then considered in section 8 of the paper, focusing on a pipelined in section 9 of the paper, focusing on a pipelined SIMD architecture featuring a modified operand fetch stage. A gate-count analysis of one feasible implementation is more general alternatives provided through IDISA. The ninth section concludes the paper with a summary of results The tenth section concludes the paper with a summary of results and discussion of areas for future work. on SSE. This completes the description of the proposed inductive doubling architecture.  As described, many of the features are already available with the SIMD facilities of This completes the description of IDISA.  As described, many of the features are already available with the SIMD facilities of existing commodity processors.  The extensions enumerated here are relatively straightforward.  The innovation can dramatically reduce instruction count in core parallel bit stream algorithms, with a factor of 3 reduction being typical. The next section goes on to illustrate a first simple example for the task of population count. \section{Evaluation Methodology} IDISA represents a set of instruction set features that could potentially be added to any SWAR processor.  The goal in this paper is to evaluate these features independent of any artifacts that may be due to any particular realization, while still considering realistic models based on existing commodity instruction set architectures.  For the purpose of IDISA evaluation, then, we define two reference architectures. For concreteness, IDISA and the two reference architectures will each be considered as 128-bit processors employing the three-register SWAR model defined in the previous section. Reference architecture A (RefA) consists of a limited register processor providing a set of core binary operations defined for 8, 16, 32 and 64 bit fields.  The core binary operations will be assumed to be those defined by the SSE instruction set for 16-bit fields.  In addition, we assume that shift immediate operations for each field width exist, e.g., \verb#simd<8>::srli<1>(x)# for a right logical shift of each 8-bit field by 1.  We also assume that a constant load operation \verb#simd::const(c)# loads the constant value $c$ into each $n$ bit field. Reference architecture B (RefB) consists of a register-rich processor incorporating all the operations of reference architecture A as well as the following additional facilities inspired by the Altivec instruction set. For each of the 8, 16, 32 and 64 bit widths, a binary rotate left logical instruction \verb#simd::rotl(a,b)#  rotates each field of $a$ by the rotation count in the corresponding field of $b$. A three-register \verb#simd<1>::if(a, b, c)# bitwise logical operator implements the logic $a & b | ~a & c$, patterned after the Altivec \verb:vec_sel: operation.  Finally, a \verb#simd<8>::permute(a, b, c)# selects an arbitrary permutation of bytes from the concatenation of $a$ and $b$ based on the set of indices in $c$. Two versions of IDISA are assessed against these reference architectures as follows.  IDISA-A has all the facilities of RefA extended with half-operand modifiers and all core operations at field widths of 2, 4 and 128.  IDISA-B is similarly defined and extended based on RefB.  Algorithms for both RefA and IDISA-A are assessed assuming that any required constants must be loaded as needed; this reflects the limited register assumption.  On the other, assessment for both RefB and IDISA-B will make the assumption that sufficiently many registers exist that constants can be kept preloaded. In each case, the processors are assumed to be well-designed pipeline processors with a throughput of one SWAR instruction each processor cycle for straight-line code free of memory access.  Such processors are certainly feasible with a suitable investment in circuitry, although practical designs may make sacrifices to achieve close to optimal throughput with less circuitry.  However, the assumption makes for straightforward performance evaluation based on instruction count for straight-line computational kernels.  Furthermore, the assumption also eliminates artifacts due to possibly different latencies in reference and IDISA architectures.  Because the same assumption is made for reference and IDISA architectures, determination of the additional circuit complexity due to IDISA features is unaffected by the assumption. In the remainder of this paper, then, IDISA-A and IDISA-B models are evaluated against their respective reference architectures on straight-line computational kernels used in parallel bit stream processing and other applications. As XML and other sequential text processing applications tend to use memory in an efficient streaming model, the applications tend to be compute-bound rather than IO-bound. Thus, the focus on computational kernels addresses the primary concern for performance improvement of these applications. \section{Population Count} \begin{figure} \begin{figure}[h] \begin{center}\small \begin{verbatim} \end{verbatim} \end{center} \caption{Ideal SIMD Population Count} \caption{IDISA Population Count} \label{inductivepopcount} \end{figure}