# Bit-By-Bit Shuffle

Bit-by-bit shuffle works with vectors of i1 values.

Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors.
For example the IDISA operation `simd<1>::mergel(v1, v2)` could be expressed as a shuffle vector
operation.

shufflevector <128 x i1> %v1, <128 x i1> %v2, <128 x i1> <i32 0, i32 128, i32 1, i32 129, i32 2, i32 130, i32 3, i32 131, ...>

An architecture may not support shuffle at the bit-level, but bit-level merge can be implemented using four byte-level shuffles combined with shifting and bitwise logic.

## Parallel Extract and Parallel Deposit

Intel's Haswell Architecture has two new `pext` and `pdep` that can
be used for flexible extraction and placement of bits.
These instructions work with either 32-bit or
64-bit general registers.

`pext` extracts bits from marked positions, producing a
compressed result.

For example, given an 8-bit value `hgfedcba` and an
8-bit extraction mask `01100101`, four bits would be
extracted to produce the result `0000gfca`.

`pext` operates in reverse, depositing bits at marked
positions, placing 0 elsewhere.

For example, given an 8-bit value `hgfedcba` and an
8-bit deposit mask `01100101`, four bits would be
extracted to produce the result `0dc00b0a`.

In these examples, bit patterns have been shown in the little-endian order used by in the x86 architecture.

### Shuffle Vector Model for Parallel Deposit and Extract

For `pext`, the first argument is the bit vector from
which bits are to be extracted, the second argument
is a vector of all 0 bits, and the selection
vector has the indexes of desired bits. For the
8-bit extraction mask, `01100101`, the extraction
positions of the four bits are 6,5,2,0 (little-endian
bit numbering), so the 8-bit vector could be
`<0, 2, 5, 6, 8, 8, 8, 8>`
Here the selector 8 specifies the position of a
bit in the second vector, known to be 0.

For `pdep`, we again have the bit vector from which bits
are to be taken as the first argument, a constant 0
vector as the second argument, and an index vector
for deposits. In the example, the vector for deposits is
`<0, 8, 1, 8, 8, 2, 3, 8>`

The code generation problem is thus to recognize these
shufflevector patterns and generate the appropriate `pdep` or
`pext` calls.

The `pext` requirements are:

- the two input vector arguments are the same type, either <32 x i1> or <64 x i1>
- the second input vector argument is a constant 0 vector
- the selection pattern consists of two parts: a strictly increasing sequence of positions from the first argument (<0, 2, 5, 6> in the example), followed by a selectors from the second argument (<8, 8, 8, 8> in the example).

For `pdep`, only the selection pattern differs. In this case, the selection pattern must consist of consecutive
selectors (<0, 1, 2, ...>) from the first vector interspersed with canoncial selectors for zeroes from the second vector.

### LLVM Support for pdep and pext

From x86InstInfo.td, we can see that LLVM supports the pdep and pext instructions, but only when directly accessed through the Intel intrinsics.

multiclass bmi_pdep_pext<string mnemonic, RegisterClass RC, X86MemOperand x86memop, Intrinsic Int, PatFrag ld_frag> { def rr : I<0xF5, MRMSrcReg, (outs RC:$dst), (ins RC:$src1, RC:$src2), !strconcat(mnemonic, "\t{$src2, $src1, $dst|$dst, $src1, $src2}"), [(set RC:$dst, (Int RC:$src1, RC:$src2))]>, VEX_4V; def rm : I<0xF5, MRMSrcMem, (outs RC:$dst), (ins RC:$src1, x86memop:$src2), !strconcat(mnemonic, "\t{$src2, $src1, $dst|$dst, $src1, $src2}"), [(set RC:$dst, (Int RC:$src1, (ld_frag addr:$src2)))]>, VEX_4V; } let Predicates = [HasBMI2] in { defm PDEP32 : bmi_pdep_pext<"pdep{l}", GR32, i32mem, int_x86_bmi_pdep_32, loadi32>, T8XD; defm PDEP64 : bmi_pdep_pext<"pdep{q}", GR64, i64mem, int_x86_bmi_pdep_64, loadi64>, T8XD, VEX_W; defm PEXT32 : bmi_pdep_pext<"pext{l}", GR32, i32mem, int_x86_bmi_pext_32, loadi32>, T8XS; defm PEXT64 : bmi_pdep_pext<"pext{q}", GR64, i64mem, int_x86_bmi_pext_64, loadi64>, T8XS, VEX_W; }