Version 4 (modified by cameron, 9 years ago) (diff)


IDISA Toolkit Project

Introduction to IDISA

Although there are now a great many defined SIMD instruction set architectures such as Altivec, VIS, SSE, AVX, in widespread use, there is no widely accepted low-level programming model for cross-platform SIMD programming.

An early attempt to define portable SIMD instructions was that of Fisher and Dietz, who coined the term SWAR (SIMD Within a Register).

Randall J. Fisher and Henry G. Dietz, Compiling for SIMD Within a Register Lecture Notes in Computer Science 1656: Languages and Compilers for Parallel Computing Springer, Berlin, 1999, pp. 290-305.

IDISA (Inductive Doubling Instruction Set Architecture) is our uniform programming model for high-performance SIMD (Single Instruction Multiple Data) programming on multiple computing platforms. While all modern processor families support SIMD instruction sets, the specific instructions available vary considerably from platform to platform. Furthermore, the instruction set designs for each platform tend to involve relatively ad hoc combinations of operations, field widths and vertical or horizontal SIMD processing models. In contrast, the IDISA architecture provides a simple, general model with uniform treatment of SIMD operations at all power-of-2 field widths as support for fully general vertical and horizontal SIMD programming.

IDISA Vertical SIMD Syntax

The following IDISA notation in C++ template syntax present a fully general structure for vertical SIMD operations for any given basic binary operation on power-of-2 field widths. Let n = 2k be the field width in bits. Let f be a basic binary operation defined on n bit quantities producing an n bit result. Let N be the SIMD vector size in bits where N = 2K. Then v=simd<n>::f(a,b) denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector v, given two input SIMD vectors a and b. For each field vi of v, the value computed is f(ai, bi). For example, given 128-bit SIMD vectors, simd<8>::add(a,b) represents the simultaneous addition of sixteen 8-bit fields.

IDISA Horizontal SIMD Syntax

A slight variant of this notation provides a fully general structure for horizontal SIMD operations combining pairs of adjacent fields. Each n-bit field x is viewed as a pair of adjacent n/2 bit fields h(x) for the high n/2 bits of x and l(x) for the low n/2 bits of x. Given binary operation f on n bit fields and an N bit vector a, v=simd_hl<n>::f(a) denotes the application of f in the horizontal combination of all sets of adjacent fields of a, such that vi,=f(h(ai), l(ai)). Thus, simd_hl<16>::add(x) denotes the 16-bit addition of all pairs of adjacent 8-bit fields of x to produce a vector of 16-bit results.

The horizontal combinations under IDISA are also designed to support inductive doubling: the repeated transitions from n/2 to n bit fields widths. For example, the horizontal combination of sets of four adjacent 8-bit fields of a vector x into 32-bit sums can be expressed in two IDISA steps: simd_hl<32>::add(simd_hl<16>:add(x)). Similarly horizontal combinations of eight fields require 3 steps. Note also that important basic operations such as population count and parity on n = 2k bit fields can be expressed using $k$ step inductive doubling algorithms. For example, population count within 8-bit fields of a vector v is computed by simd_hl<8>::add(simd_hl<4>:add(simd_hl<2>:add(v))), while parity within 4-bit fields is computed by simd_hl<4>:xor(simd_hl<2>:xor(v)).


The IDISA toolkit project is to support the use of IDISA as a standard programming model for portable SIMD programming. The project has the following components.

IDISA Generator Kit

The IDISA generator kit is used to generate IDISA implementations for given source language/compiler/architecture combinations. For example, we could generate an IDISA language consist of a C library using GCC vector conventions for the Power PC Altivec instruction set, or a C++ library using MSVC conventions for the Intel SSE2 instruction set. However, it should also have the flexibility for non-SIMD implementations such as implementation of a Python library using Python conventions for operations on unbounded bitstreams.

The generator kit should include optimization technology to ensure that the best possible IDISA implementation is realized for any given platform.

Note that there are lots of potential tricks. For example, consider the implementation of simd_hl<2>::add(a), where addition is natively supported for only larger field widths. A direct implementation requires 1 shift, two mask and one add operation.

simd<16>::add(simd<16>::srli(a, 1) & simd<2>:constant(1), a & simd<2>:constant(1))

But one of the masks can be eliminated by taking advantage of the properties of 2-bit subtraction simd<16>::sub(a, simd<16>::srli(a, 1) & simd<2>:constant(1))

IDISA Test Generator

The test generator complements the generator kit by producing a comprehensive test suite for correctness testing of IDISA implementations.

IDISA Compile-Time Specialization Kit

The compile-time specialization kit is used to provide optimized implementations of IDISA under known static properties of operand values. For example, if it is known that the high bit of each 4-bit field in registers a and b is zero, then a simd<4>::add(a,b) operation with no direct implementation on a particular platform can be realized by a wider-width operation that is, such as simd<16>::add(a,b) on most platforms.

IDISA Reverse Instruction Optimizer.

Various processor architectures provide combined SIMD operations that correspond to sequences of IDISA instructions. For example, the Intel PSADBW performs a packed sum of absolute differences corresponding to the following 5 IDISA operations.

t1 = simd<8>::abs(simd<8>::sub(a,b))
psadbw = simd<64>::add(simd<32>::add(simd<16>::add(t1)))

The reverse instruction optimizer uses knowledge of these available optimized forms to generate optimized implementations where appropriate IDISA instruction sequences may be found. Note that the recognition may involve special case logic: psadbw can be efficiently used for the 8-field horizontal addition: simd<64>::add(simd<32>::add(simd<16>::add(x))) using psadbw(x, 0).