# Changeset 228 for docs/ASPLOS09/asplos094-cameron.tex

Ignore:
Timestamp:
Dec 8, 2008, 12:50:08 PM (11 years ago)
Message:

IDISA implementation: libraries; model.

File:
1 edited

### Legend:

Unmodified
 r227 \section{Implementation} We have constructed libraries that provide simulated implementation of the inductive doubling architecture on each of the MMX, SSE, Altivec, and SPU platforms and have used these libraries in the implementation of each of the parallel bit stream algorithms discussed herein. This implementation work has been successful in validating the basic concepts underlying the inductive doubling instruction set architecture. Implementation of the architecture on chip is beyond the scope of our present resources and capabilities.  However, the principal requirements are implementation of the various operations at all power-of-2 field widths and implementation of half-operand modifiers.  Implementation of SIMD operations at additional field widths involves design trade-offs with respect to transistor counts, available opcode space, and the potential value of the new operations to SIMD programmers.  From the perspective of parallel bit stream programming, the primary need is for SIMD integer, shift, pack and merge operations at field widths of 2, 4 and 8, as well as the field width of 1, where it makes sense (e.g. with merge operations).  In support of the general concept of inductive doubling architecture, SIMD operations at large field widths (64, 128) are also called for, but these operations cannot be justified on the basis of parallel bit stream programming. Implementation of half-operand modifiers can logically be carried out with additional circuitry attached to the register fetch units of a pipelined processor.  This circuitry would require control signals from the instruction decode unit to identify the field widths of operands and the particular half-operand modifier to be applied, if any.  The additional logic required for instruction decode and that required for operand modification as part of the operand fetch process is expected to be reasonably modest. Full assessment of implementation issues is an important area for future work. We have carried implementation work for IDISA in three ways.  First, we have constructed libraries that implement the IDISA instructions by template and/or macro expansion for each of MMX, SSE, Altivec, and SPU platforms. Second, we have developed a model implementation involving a modified operand fetch component of a pipelined SIMD processor.  Third, we have written and evaluated Verilog HDL description of this model implementation. \subsection{IDISA Libraries} Implementation of IDISA instructions using template and macro libraries has been useful in developing and assessing the correctness of many of the algorithms presented here.  Although these implementations do not deliver the performance benefits associated with direct hardware implementation of IDISA, they have been quite useful in providing a practical means for portable implementation of parallel bit stream algorithms on multiple SWAR architectures.  However, one additional facility has also proven necessary for portability of parallel bit stream algorithms across big-endian and little-endian architectures: the notion of shift-forward and shift-back operations. In essence, shift forward means shift to the left on little-endian systems and shift to the right on big-endian systems, while shift back has the reverse interpretation.  Although this concept is unrelated to inductive doubling, its inclusion with the IDISA libraries has provided a suitable basis for portable SIMD implementations of parallel bit stream algorithms. Beyond this, the IDISA libraries have the additional benefit of allowing the implementation of inductive doubling algorithms at a higher level abstraction, without need for programmer coding of the underlying shift and mask operations. \subsection{IDISA Model} Figure \ref{pipeline-model} shows a model architecture for a pipelined SIMD processor implementing IDISA. The SIMD Register File (SRF) provides a file of $R = 2^A$ registers each of width $N = 2^K$ bits. IDISA instructions identified by the Instruction Fetch Unit (IFU) are forwarded for decoding to the SIMD Instruction Decode Unit (SIDU).  This unit decodes the instruction to produce signals identifying the source and destination operand registers, the half-operand modifiers, the field width specification and the SIMD operation to be applied. The SIDU supplies the source register information and the half-operand modifier information to the SIMD Operand Fetch Unit (SOFU). For each source operand, the SIDU provides an $A$-bit register address and two 1-bit signals $h$ and $l$ indicating the value of the decoded half-operand modifiers for this operand. Only one of these values may be 1; both are 0 if no modifier is specified. In addition, the SIDU supplies decoded field width information to both the SOFU and to the SIMD Instruction Execute Unit (SIEU). The SIDU also supplies decoded SIMD opcode information to SIEU and a decoded $A$-bit register address for the destination register to the SIMD Result Write Back Unit (SRWBU). The SOFU is the key component of the IDISA model that differs from that found in a traditional SWAR processor.  For each of the two $A$-bit source register addresses, SOFU is first responsible for fetching the raw operand values from the SRF. Then, before supplying operand values to the SIEU, the SOFU applies the half-operand modification logic as specified by the $h$, $l$, and field-width signals.  The possibly modified operand values are then provided to the SIEU for carrying out the SIMD operations. A detailed model of SOFU logic is described in the following subsection. The SIEU differs from similar execution units in current commodity processors primarily by providing SIMD operations at each field width $n=2^k$ for $0 \leq k \leq K$.  This involves additional circuitry for field widths not supported in existing processors.  For inductive doubling algorithms in support of parallel bit streams, the principal need is for additional circuitry to support 2-bit and 4-bit field widths.  This circuity is generally less complicated than that for larger fields.  Support for circuitry at these width has other applications as well.   For example, DNA sequences are frequently represented using packed sequences of 2-bit codes for the four possible nucleotides\cite{}, while the need for accurate financial calculation has seen a resurgence of the 4-bit packed BCD format for decimal floating point \cite{}. When execution of the SWAR instruction is completed, the result value is then provided to the SRWBU to update the value stored in the SRF at the address specified by the $A$-bit destination operand. \subsection{Operand Fetch Unit Logic}