Version 2 (modified by cameron, 6 years ago) (diff)


Bit-By-Bit Shuffle

Bit-by-bit shuffle works with vectors of i1 values.

Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors. For example the IDISA operation simd<1>::mergel(v1, v2) could be expressed as a shuffle vector operation.

shufflevector <128 x i1> %v1, <128 x i1> %v2,
                 <128 x i1> <i32 0, i32 128, i32 1, i32 129, i32 2, i32 130, i32 3, i32 131, ...> 

An architecture may not support shuffle at the bit-level, but bit-level merge can be implemented using four byte-level shuffles combined with shifting and bitwise logic.

Parallel Extract and Parallel Deposit

Intel's Haswell Architecture has two new pext and pdep that can be used for flexible extraction and placement of bits. These instructions work with either 32-bit or 64-bit general registers.

pext extracts bits from marked positions, producing a compressed result.

For example, given an 8-bit value hgfedcba and an 8-bit extraction mask 01100101, four bits would be extracted to produce the result 0000gfca.

pext operates in reverse, depositing bits at marked positions, placing 0 elsewhere.

For example, given an 8-bit value hgfedcba and an 8-bit deposit mask 01100101, four bits would be extracted to produce the result 0dc00b0a.

In these examples, bit patterns have been shown in the little-endian order used by in the x86 architecture.

Shuffle Vector Model for Parallel Deposit and Extract

For pext, the first argument is the bit vector from which bits are to be extracted, the second argument is a vector of all 0 bits, and the selection vector has the indexes of desired bits. For the 8-bit extraction mask, 01100101, the extraction positions of the four bits are 6,5,2,0 (little-endian bit numbering), so the 8-bit vector could be <0, 2, 5, 6, 8, 8, 8, 8> Here the selector 8 specifies the position of a bit in the second vector, known to be 0.

For pdep, we again have the bit vector from which bits are to be taken as the first argument, a constant 0 vector as the second argument, and an index vector for deposits. In the example, the vector for deposits is <0, 2, 5, 6, 8, 8, 8, 8>

The code generation problem is thus to recognize these shufflevector patterns and generate the appropriate pdep or pext calls.

The pext requirements are:

  1. the two input vector arguments are the same type, either <32 x i1> or <64 x i1>
  2. the second input vector argument is a constant 0 vector
  3. the selection pattern consists of two parts: a strictly increasing sequence of positions from the first argument (<0, 2, 5, 6> in the example), followed by a selectors from the second argument (<8, 8, 8, 8> in the example).

For pdep, only the selection pattern differs. In this case, the selection pattern must consist of interspersed selectors (<0, 1, 2, ...>) from the first vector and arbitrary selectors from the second vector.

LLVM Support for pdep and pext

From, we can see that LLVM supports the pdep and pext instructions, but only when directly accessed through the Intel intrinsics.

multiclass bmi_pdep_pext<string mnemonic, RegisterClass RC,
                         X86MemOperand x86memop, Intrinsic Int,
                         PatFrag ld_frag> {
  def rr : I<0xF5, MRMSrcReg, (outs RC:$dst), (ins RC:$src1, RC:$src2),
             !strconcat(mnemonic, "\t{$src2, $src1, $dst|$dst, $src1, $src2}"),
             [(set RC:$dst, (Int RC:$src1, RC:$src2))]>,
  def rm : I<0xF5, MRMSrcMem, (outs RC:$dst), (ins RC:$src1, x86memop:$src2),
             !strconcat(mnemonic, "\t{$src2, $src1, $dst|$dst, $src1, $src2}"),
             [(set RC:$dst, (Int RC:$src1, (ld_frag addr:$src2)))]>, VEX_4V;

let Predicates = [HasBMI2] in {
  defm PDEP32 : bmi_pdep_pext<"pdep{l}", GR32, i32mem,
                               int_x86_bmi_pdep_32, loadi32>, T8XD;
  defm PDEP64 : bmi_pdep_pext<"pdep{q}", GR64, i64mem,
                               int_x86_bmi_pdep_64, loadi64>, T8XD, VEX_W;
  defm PEXT32 : bmi_pdep_pext<"pext{l}", GR32, i32mem,
                               int_x86_bmi_pext_32, loadi32>, T8XS;
  defm PEXT64 : bmi_pdep_pext<"pext{q}", GR64, i64mem,
                               int_x86_bmi_pext_64, loadi64>, T8XS, VEX_W;