= IDISA Toolkit Project =
== Introduction to IDISA ==
Although there are now a great many defined SIMD instruction
set architectures such as Altivec, VIS, SSE, AVX, in
widespread use, there is no widely accepted low-level
programming model for cross-platform SIMD programming.
An early attempt to define portable SIMD instructions was that
of Fisher and Dietz, who coined the term SWAR (SIMD Within a
Register).
Randall J. Fisher and Henry G. Dietz, Compiling for SIMD Within a Register
Lecture Notes in Computer Science 1656: Languages and Compilers for Parallel Computing
Springer, Berlin, 1999, pp. 290-305.
IDISA (Inductive Doubling Instruction Set Architecture) is our
uniform programming model for high-performance SIMD (Single
Instruction Multiple Data) programming on multiple computing
platforms. While all modern processor families support
SIMD instruction sets, the specific instructions available
vary considerably from platform to platform. Furthermore,
the instruction set designs for each platform tend to
involve relatively ad hoc combinations of operations,
field widths and vertical or horizontal SIMD processing
models. In contrast, the IDISA architecture provides
a simple, general model with uniform treatment of SIMD
operations at all power-of-2 field widths as support
for fully general vertical and horizontal SIMD programming.
=== IDISA Vertical Operations ===
The following IDISA notation in C++ template syntax present
a fully general structure for vertical SIMD
operations for any given basic binary operation on power-of-2 field
widths. Let ''n'' = 2^''k''^ be the field width in bits.
Let ''f'' be a basic binary operation defined on ''n'' bit quantities
producing an ''n'' bit result. Let ''N'' be the SIMD vector
size in bits where ''N'' = 2^''K''^.
Then ''v''=simd<''n''>::''f''(''a'',''b'') denotes the general pattern for a
vertical SIMD operation yielding an output SIMD vector v,
given two input SIMD vectors ''a'' and ''b''. For each
field ''v'',,''i'',, of ''v'', the value computed is ''f''(''a,,i,,'', ''b,,i,,'').
For example, given 128-bit SIMD vectors,
simd<8>::add(a,b) represents the simultaneous addition of
sixteen 8-bit fields.
See the list of [wiki:IDISA_Vertical IDISA Vertical] operations for the
individual operations and their semantics.
=== IDISA Horizontal Operations ===
A slight variant of this notation provides a fully general
structure for horizontal SIMD operations combining pairs
of adjacent fields.
Given binary operation ''f'' on ''n'' bit fields and two ''N'' bit
vectors ''a'' and ''b'', let ''c'' be the ''2N'' bit concatenation of ''a'' and ''b''.
Then ''v''=hsimd<''n''>::''f''(''a'', ''b'') denotes the application of ''f''
in the horizontal combination of all sets of adjacent ''n'' bit fields of ''c'' such that
''v,,i,,,''=''f''(''c,,2i,,'', ''c,,2i+1,,'').
See the list of [wiki:IDISA_Horizontal IDISA Horizontal Packing] operations for the
individual operations and their semantics.
=== IDISA Expanding Operations ===
See the list of [wiki:IDISA_Expanding IDISA Expanding] operations for the
individual operations and their semantics.
=== IDISA Field Movement Operations ===
See the list of [wiki:IDISA_Movement IDISA Field Movement] operations for the
individual operations and their semantics.
== Project ==
The IDISA toolkit project is to support the use of IDISA as
a standard programming model for portable SIMD programming.
The project has the following components.
=== IDISA Generator Kit ===
The IDISA generator kit is used to generate IDISA
implementations for given source language/compiler/architecture
combinations. For example, we could generate an IDISA
language consist of a C library using GCC vector conventions
for the Power PC Altivec instruction set, or a C++ library
using MSVC conventions for the Intel SSE2 instruction set.
However, it should also have the flexibility for non-SIMD implementations
such as implementation of a Python library using Python
conventions for operations on unbounded bitstreams.
The generator kit should include optimization technology to
ensure that the best possible IDISA implementation is realized for
any given platform.
Note that there are lots of potential tricks. Another case occurs with the simd<16>::pack
1. For example,
consider the implementation of simd<2>::add_hl(a), where
addition is natively supported for only larger field widths.
A direct implementation requires 1 shift, two mask and one add
operation.
simd<2>::add_hl(a) = simd<16>::add(simd<16>::srli(a, 1) & simd<2>::constant(1), a & simd<2>:constant(1))
But one of the masks can be eliminated by taking advantage
of the properties of 2-bit subtraction.
simd<2>::add_hl(a) = simd<16>::sub(a, simd<16>::srli(a, 1) & simd<2>::constant(1))
=== IDISA Test Generator ===
The test generator complements the generator kit by producing
a comprehensive test suite for correctness testing of IDISA
implementations.
=== IDISA Compile-Time Specialization Kit ===
The compile-time
specialization kit is used to provide optimized implementations
of IDISA under known static properties of operand values.
For example, if it is known that the high bit of each 4-bit
field in registers a and b is zero, then a simd<4>::add(a,b)
operation with no direct implementation on a particular
platform can be realized by a wider-width operation that is,
such as simd<16>::add(a,b) on most platforms.
Another case is implementing the IDISA nonsaturating
packl using the saturating pack found with SSE, for example.
In this case, the default definition requires masking:
template<>
inline SIMD_type simd<16>::packl(SIMD_type r1, SIMD_type r2) {
return _mm_packus_epi16(simd_andc(r2, simd<16>::himask()), simd_andc(r1, simd<16>::himask()));
}
But, if we know that the high byte of each of r1 and r2 are
zero, then the masks are not required. This might be the case,
for example, if we have a run of UTF-16 code units in the
ASCII range.
=== IDISA Reverse Instruction Optimizer. ===
Various processor
architectures provide combined SIMD operations that correspond
to sequences of IDISA instructions. For example, the Intel
PSADBW performs a packed sum of absolute differences corresponding
to the following 5 IDISA operations.
t1 = simd<8>::abs(simd<8>::sub(a,b))[[BR]]
psadbw = simd<64>::add(simd<32>::add(simd<16>::add(t1)))[[BR]]
The reverse instruction optimizer uses knowledge of these
available optimized forms to generate optimized implementations
where appropriate IDISA instruction sequences may be found.
Note that the recognition may involve special case logic:
psadbw can be efficiently used for the 8-field horizontal
addition: simd<64>::add(simd<32>::add(simd<16>::add(x)))
using psadbw(x, 0).