Version 10 (modified by cameron, 5 years ago) (diff) |
---|

# The ShuffleVector Project

This project investigates code generation for the LLVM shufflevector operation, particularly in the case that the shuffle mask is a compile-time constant.

For example, the shuffle mask pattern for a shuffle vector could be just a byte swap, for example.

%v3 = shufflevector <8 x i8> %v1, <8 x i8> undef, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1> ; yields <8 x i8>

Transforming this to `%t0 = bitcast %v1 to i64 @llvm.bswap.i64(i64 %t0) `
may allow efficient implementation
on an architecture supporting byte swap, but not shuffle.

Generalizing this pattern, we may have arbitrary rotations expressed using shuffle masks. For example, consider the shufflevector of 4-bit fields:

%v3 = shufflevector <8 x i4> %v1, <8 x i4> undef, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4> ; yields <8 x i8>

Shuffles on 4-bit fields are generally not supported by SIMD instruction sets, but this one can be implemented by transforming to 16-bit vector shift operations.

%t0 = bitcast %v1 to <2 x i16> %t1 = shl %t0, <2 x i16> <i16 12, i16 12> %t2 = lshr %t0, <2 x i16> <i16 4, i16 4> %v3 = xor %t1, %t2

Can these examples be turned into general rules that systematically capture these special cases?

The Shuffle Pattern Library attempts to do this.

## Lai ShengZhang?'s GitHub

https://github.com/laishzh/LLVM_ShuffleVector_Optimizer

## Vectorized Sequential Code

If there is no known fully parallel implementation of a particular case, it may still be possible to partially parallelize by making a vectorized sequential loop.

## Shuffling Bit-By-Bit

There are many interesting possibilities with shuffle masks that perform bit-by-bit selection. See the BitShuffle subproject.