Version 6 (modified by cameron, 4 years ago) (diff) |
---|

# The ShuffleVector Project

This project investigates code generation for the LLVM shufflevector operation, particularly in the case that the shuffle mask is a compile-time constant.

For example, the shuffle mask pattern for a shuffle vector could be just a byte swap, for example.

%v3 = shufflevector <8 x i8> %v1, <8 x i8> undef, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1> ; yields <8 x i8>

Transforming this to `%t0 = bitcast %v1 to i64 @llvm.bswap.i64(i64 %t0) `
may allow efficient implementation
on an architecture supporting byte swap, but not shuffle.

Generalizing this pattern, we may have arbitrary rotations expressed using shuffle masks. For example, consider the shufflevector of 4-bit fields:

%v3 = shufflevector <8 x i4> %v1, <8 x i4> undef, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4> ; yields <8 x i8>

Shuffles on 4-bit fields are generally not supported by SIMD instruction sets, but this one can be implemented by transforming to 16-bit vector shift operations.

%t0 = bitcast %v1 to <2 x i16> %t1 = shl %t0, <2 x i16> <i16 12, i16 12> %t2 = lshr %t0, <2 x i16> <i16 4, i16 4> %v3 = xor %t1, %t2

Can these examples be turned into general rules that systematically capture these special cases?

## Vectorized Sequential Code

If there is no known fully parallel implementation of a particular case, it may still be possible to partially parallelize by making a vectorized sequential loop.

## Shuffling Bit-By-Bit

There are many interesting possibilities with shuffle masks that perform bit-by-bit selection.

Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors.
For example the IDISA operation `simd<1>::mergel(v1, v2)` could be expressed as a shuffle vector
operation.

shufflevector <128 x i1> %v1, <128 x i1> %v2, <128 x i1> <i32 0, i32 128, i32 1, i32 129, i32 2, i32 130, i32 3, i32 131, ...>

An architecture may not support shuffle at the bit-level, but bit-level merge can be implemented using four byte-level shuffles combined with shifting and bitwise logic.

## Parallel Extract and Parallel Deposit

Intel's Haswell Architecture has two new `pext` and `pdep` that can
be used for flexible extraction and placement of bits.
These instructions work with either 32-bit or
64-bit general registers.

`pext` extracts bits from marked positions, producing a
compressed result.

For example, given an 8-bit value `hgfedcba` and an
8-bit extraction mask `01100101`, four bits would be
extracted to produce the result `0000gfca`.

`pext` operates in reverse, depositing bits at marked
positions, placing 0 elsewhere.

For example, given an 8-bit value `hgfedcba` and an
8-bit deposit mask `01100101`, four bits would be
extracted to produce the result `0dc00b0a`.

In these examples, bit patterns have been shown in the little-endian order used by in the x86 architecture.

Both operations can be modeled by shufflevector.

For `pext`, the first argument is the bit vector from
which bits are to be extracted, the second argument
is a vector of all 0 bits, and the selection
vector has the indexes of desired bits. For the
8-bit extraction mask, `01100101`, the extraction
positions of the four bits are 6,5,2,0 (little-endian
bit numbering), so the 8-bit vector could be
`<0, 2, 5, 6, 8, 8, 8, 8>`
Here the selector 8 specifies the position of a
bit in the second vector, known to be 0.

For `pdep`, we again have the bit vector from which bits
are to be taken as the first argument, a constant 0
vector as the second argument, and an index vector
for deposits. In the example, the vector for deposits is
`<0, 2, 5, 6, 8, 8, 8, 8>`

The code generation problem is thus to recognize these
shufflevector patterns and generate the appropriate `pdep` or
`pext` calls.

The `pext` requirements are:

- the two input vector arguments are the same type, either<32 x i1> or <64 x i1>
- the second input vector argument is a constant 0 vector
- the selection pattern consists of two parts: a strictly increasing sequence of positions from the first argument (<0, 2, 5, 6> in the example), followed by a selectors from the second argument (<8, 8, 8, 8> in the example).

For `pdep`, only the selection pattern differs. In this case, the selection pattern must consist of interspersed
selectors (<0, 1, 2, ...>) from the first vector and arbitrary selectors from the second vector.