IDISA Toolkit Project

Introduction to IDISA

Although there are now a great many defined SIMD instruction set architectures such as Altivec, VIS, SSE, AVX, in widespread use, there is no widely accepted low-level programming model for cross-platform SIMD programming.

An early attempt to define portable SIMD instructions was that of Fisher and Dietz, who coined the term SWAR (SIMD Within a Register).

Randall J. Fisher and Henry G. Dietz, Compiling for SIMD Within a Register Lecture Notes in Computer Science 1656: Languages and Compilers for Parallel Computing Springer, Berlin, 1999, pp. 290-305.

IDISA (Inductive Doubling Instruction Set Architecture) is our uniform programming model for high-performance SIMD (Single Instruction Multiple Data) programming on multiple computing platforms. While all modern processor families support SIMD instruction sets, the specific instructions available vary considerably from platform to platform. Furthermore, the instruction set designs for each platform tend to involve relatively ad hoc combinations of operations, field widths and vertical or horizontal SIMD processing models. In contrast, the IDISA architecture provides a simple, general model with uniform treatment of SIMD operations at all power-of-2 field widths as support for fully general vertical and horizontal SIMD programming.

IDISA Vertical Operations: Template Class simd<w>

The following IDISA notation in C++ template syntax present a fully general structure for vertical SIMD operations for any given basic binary operation on power-of-2 field widths. Let w = 2k be the field width in bits. Let f be a basic binary operation defined on w bit quantities producing an w bit result. Let W be the SIMD vector size in bits where W = 2K. Then v=simd<w>::f(a,b) denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector v, given two input SIMD vectors a and b. For each field vi of v, the value computed is f(ai, bi). For example, given 128-bit SIMD vectors, simd<8>::add(a,b) represents the simultaneous addition of sixteen 8-bit fields.

See the list of IDISA Vertical operations for the individual operations and their semantics.

IDISA Horizontal Packing Operations: Template Class hsimd<w>

A slight variant of this notation provides a general structure for horizontal SIMD operations with packing. In operating on vectors of w-bit fields, these operations generally produce results in narrower fields, typically w/2.

See the list of IDISA Horizontal Packing operations for the individual operations and their semantics.

IDISA Expansion Operations: Template Class esimd<w>

IDISA expansion operations use basic operations that double the width of data fields. Let g be a basic binary operation on w bit fields that produces 2w-bit results. Given W-bit vectors of w-bit fields a and b, then the result of applying g to all corresponding fields of a and b is an overall 2W-bit result, represented as the concatenation of two W-bit vectors esimd<w>::gh(a, b) and esimd<w>::gl(a, b), as follows.

  • esimd<w>::gh(a, b) = concatenation of g(ai, bi) for 1 <= i <= W/(2w)
  • esimd<w>::gl(a, b) = concatenation of g(ai, bi) for W/(2w)+1 <= i <= W/w

See the list of IDISA Expansion operations for the individual operations and their semantics.

IDISA Field Movement Operations: Template Class mvmd<w>

This class contains operations that copy and/or move the contents of fields to different locations within vectors, while otherwise leaving contents unchanged.

See the list of IDISA Field Movement operations for the individual operations and their semantics.

IDISA Full Register Operations: Nontemplate Class bitblock

This class contains operations that work with the contents of SIMD registers as undivided bitblocks.

See the list of IDISA Bit Block operations for the individual operations and their semantics.


The IDISA toolkit project is to support the use of IDISA as a standard programming model for portable SIMD programming. The project has the following components.

IDISA Generator Kit

The IDISA generator kit is used to generate IDISA implementations for given source language/compiler/architecture combinations. For example, we could generate an IDISA language consist of a C library using GCC vector conventions for the Power PC Altivec instruction set, or a C++ library using MSVC conventions for the Intel SSE2 instruction set. However, it should also have the flexibility for non-SIMD implementations such as implementation of a Python library using Python conventions for operations on unbounded bitstreams.

The generator kit should include optimization technology to ensure that the best possible IDISA implementation is realized for any given platform.

Note that there are lots of potential tricks. Another case occurs with the simd<16>::pack

  1. For example,

consider the implementation of simd<2>::add_hl(a), where addition is natively supported for only larger field widths. A direct implementation requires 1 shift, two mask and one add operation.

simd<2>::add_hl(a) = simd<16>::add(simd<16>::srli(a, 1) & simd<2>::constant(1), a & simd<2>:constant(1))

But one of the masks can be eliminated by taking advantage of the properties of 2-bit subtraction.

simd<2>::add_hl(a) = simd<16>::sub(a, simd<16>::srli(a, 1) & simd<2>::constant(1))

IDISA Test Generator

The test generator complements the generator kit by producing a comprehensive test suite for correctness testing of IDISA implementations.

IDISA Compile-Time Specialization Kit

The compile-time specialization kit is used to provide optimized implementations of IDISA under known static properties of operand values. For example, if it is known that the high bit of each 4-bit field in registers a and b is zero, then a simd<4>::add(a,b) operation with no direct implementation on a particular platform can be realized by a wider-width operation that is, such as simd<16>::add(a,b) on most platforms.

Another case is implementing the IDISA nonsaturating packl using the saturating pack found with SSE, for example. In this case, the default definition requires masking:

template<> inline SIMD_type simd<16>::packl(SIMD_type r1, SIMD_type r2) {

return _mm_packus_epi16(simd_andc(r2, simd<16>::himask()), simd_andc(r1, simd<16>::himask()));


But, if we know that the high byte of each of r1 and r2 are zero, then the masks are not required. This might be the case, for example, if we have a run of UTF-16 code units in the ASCII range.

IDISA Reverse Instruction Optimizer.

Various processor architectures provide combined SIMD operations that correspond to sequences of IDISA instructions. For example, the Intel PSADBW performs a packed sum of absolute differences corresponding to the following 5 IDISA operations.

t1 = simd<8>::abs(simd<8>::sub(a,b))
psadbw = simd<64>::add_hl(simd<32>::add_hl(simd<16>::add_hl(t1)))

The reverse instruction optimizer uses knowledge of these available optimized forms to generate optimized implementations where appropriate IDISA instruction sequences may be found. Note that the recognition may involve special case logic: psadbw can be efficiently used for the 8-field horizontal addition: simd<64>::add_hl(simd<32>::add_hl(simd<16>::add_hl(x))) using psadbw(x, 0).