Version 6 (modified by cameron, 3 years ago) (diff)


The ShuffleVector Project

This project investigates code generation for the LLVM shufflevector operation, particularly in the case that the shuffle mask is a compile-time constant.

For example, the shuffle mask pattern for a shuffle vector could be just a byte swap, for example.

%v3 = shufflevector <8 x i8> %v1, <8 x i8> undef,
                    <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>  ; yields <8 x i8>

Transforming this to %t0 = bitcast %v1 to i64 @llvm.bswap.i64(i64 %t0) may allow efficient implementation on an architecture supporting byte swap, but not shuffle.

Generalizing this pattern, we may have arbitrary rotations expressed using shuffle masks. For example, consider the shufflevector of 4-bit fields:

%v3 = shufflevector <8 x i4> %v1, <8 x i4> undef,
              <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4>  ; yields <8 x i8>

Shuffles on 4-bit fields are generally not supported by SIMD instruction sets, but this one can be implemented by transforming to 16-bit vector shift operations.

%t0 = bitcast %v1 to <2 x i16>
%t1 = shl %t0, <2 x i16> <i16 12, i16 12> 
%t2 = lshr %t0, <2 x i16> <i16 4, i16 4> 
%v3 = xor %t1, %t2

Can these examples be turned into general rules that systematically capture these special cases?

Vectorized Sequential Code

If there is no known fully parallel implementation of a particular case, it may still be possible to partially parallelize by making a vectorized sequential loop.

Shuffling Bit-By-Bit

There are many interesting possibilities with shuffle masks that perform bit-by-bit selection.

Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors. For example the IDISA operation simd<1>::mergel(v1, v2) could be expressed as a shuffle vector operation.

shufflevector <128 x i1> %v1, <128 x i1> %v2,
                 <128 x i1> <i32 0, i32 128, i32 1, i32 129, i32 2, i32 130, i32 3, i32 131, ...> 

An architecture may not support shuffle at the bit-level, but bit-level merge can be implemented using four byte-level shuffles combined with shifting and bitwise logic.

Parallel Extract and Parallel Deposit

Intel's Haswell Architecture has two new pext and pdep that can be used for flexible extraction and placement of bits. These instructions work with either 32-bit or 64-bit general registers.

pext extracts bits from marked positions, producing a compressed result.

For example, given an 8-bit value hgfedcba and an 8-bit extraction mask 01100101, four bits would be extracted to produce the result 0000gfca.

pext operates in reverse, depositing bits at marked positions, placing 0 elsewhere.

For example, given an 8-bit value hgfedcba and an 8-bit deposit mask 01100101, four bits would be extracted to produce the result 0dc00b0a.

In these examples, bit patterns have been shown in the little-endian order used by in the x86 architecture.

Both operations can be modeled by shufflevector.

For pext, the first argument is the bit vector from which bits are to be extracted, the second argument is a vector of all 0 bits, and the selection vector has the indexes of desired bits. For the 8-bit extraction mask, 01100101, the extraction positions of the four bits are 6,5,2,0 (little-endian bit numbering), so the 8-bit vector could be <0, 2, 5, 6, 8, 8, 8, 8> Here the selector 8 specifies the position of a bit in the second vector, known to be 0.

For pdep, we again have the bit vector from which bits are to be taken as the first argument, a constant 0 vector as the second argument, and an index vector for deposits. In the example, the vector for deposits is <0, 2, 5, 6, 8, 8, 8, 8>

The code generation problem is thus to recognize these shufflevector patterns and generate the appropriate pdep or pext calls.

The pext requirements are:

  1. the two input vector arguments are the same type, either<32 x i1> or <64 x i1>
  2. the second input vector argument is a constant 0 vector
  3. the selection pattern consists of two parts: a strictly increasing sequence of positions from the first argument (<0, 2, 5, 6> in the example), followed by a selectors from the second argument (<8, 8, 8, 8> in the example).

For pdep, only the selection pattern differs. In this case, the selection pattern must consist of interspersed selectors (<0, 1, 2, ...>) from the first vector and arbitrary selectors from the second vector.