Version 2 (modified by cameron, 3 years ago) (diff) |
---|

# The ShuffleVector Project

This project investigates code generation for the LLVM shufflevector operation, particularly in the case that the shuffle mask is a compile-time constant.

For example, the shuffle mask pattern for a shuffle vector could be just a byte swap, for example.

%v3 = shufflevector <8 x i8> %v1, <8 x i8> undef, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6> ; yields <8 x i8>

Transforming this to {{{ %t0 = bitcast %v1 to i64 @llvm.bswap.i64(i64 %t0) }}} may allow efficient implementation on an architecture supporting byte swap, but not shuffle.

Generalizing this pattern, we may have arbitrary rotations expressed using shuffle masks. For example, consider the shufflevector of 4-bit fields:

%v3 = shufflevector <8 x i4> %v1, <8 x i4> undef, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4> ; yields <8 x i8>

Shuffles on 4-bit fields are generally not supported by SIMD instruction sets, but this one can be implemented by transforming to 16-bit vector shift operations.

%t0 = bitcast %v1 to <2 x i16> %t1 = shl %t0, <2 x i16> <i16 12, i16 12> %t2 = lshr4 %t0, <2 x i16> <i16 4, i16 4> %v3 = xor %t1, %t2

Can these examples be turned into general rules that systematically capture these special cases?

## Vectorized Sequential Code

If there is no known fully parallel implementation of a particular case, it may still be possible to partially parallelize by making a vectorized sequential loop.

## Shuffling Bit-By-Bit

There are many interesting possibilities with shuffle masks that perform bit-by-bit selection.

Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors.
For example the IDISA operation `simd<1>::mergel(v1, v2)` could be expressed as a shuffle vector
operation.

shufflevector <128 x i1> %v1, <128 x i1> %v2, <128 x i1> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, ...>

An architecture may not support shuffle at the bit-level, but bit-level merge can be implemented using four byte-level shuffles combined with shifting and bitwise logic.

## Parallel Extract and Parallel Deposit

Intel's Haswell Architecture has two new 64-bit instructions `pext` and `pdep` that can
be used for flexible extraction and placement of bits. Code generation for `shufflevector` in
terms of these operations is certainly worth study.