= Bit-By-Bit Shuffle =
Bit-by-bit shuffle works with vectors of i1 values.
Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors.
For example the IDISA operation {{{simd<1>::mergel(v1, v2)}}} could be expressed as a shuffle vector
operation.
{{{
shufflevector <128 x i1> %v1, <128 x i1> %v2,
<128 x i1>
}}}
An architecture may not support shuffle at the bit-level, but bit-level merge can be
implemented using four byte-level shuffles combined with shifting and bitwise logic.
== Parallel Extract and Parallel Deposit ==
Intel's Haswell Architecture has two new {{{pext}}} and {{{pdep}}} that can
be used for flexible extraction and placement of bits.
These instructions work with either 32-bit or
64-bit general registers.
{{{pext}}} extracts bits from marked positions, producing a
compressed result.
For example, given an 8-bit value {{{hgfedcba}}} and an
8-bit extraction mask {{{01100101}}}, four bits would be
extracted to produce the result {{{0000gfca}}}.
{{{pext}}} operates in reverse, depositing bits at marked
positions, placing 0 elsewhere.
For example, given an 8-bit value {{{hgfedcba}}} and an
8-bit deposit mask {{{01100101}}}, four bits would be
extracted to produce the result {{{0dc00b0a}}}.
In these examples, bit patterns have been shown in
the little-endian order used by in the x86 architecture.
=== Shuffle Vector Model for Parallel Deposit and Extract ===
For {{{pext}}}, the first argument is the bit vector from
which bits are to be extracted, the second argument
is a vector of all 0 bits, and the selection
vector has the indexes of desired bits. For the
8-bit extraction mask, {{{01100101}}}, the extraction
positions of the four bits are 6,5,2,0 (little-endian
bit numbering), so the 8-bit vector could be
{{{<0, 2, 5, 6, 8, 8, 8, 8>}}}
Here the selector 8 specifies the position of a
bit in the second vector, known to be 0.
For {{{pdep}}}, we again have the bit vector from which bits
are to be taken as the first argument, a constant 0
vector as the second argument, and an index vector
for deposits. In the example, the vector for deposits is
{{{<0, 8, 1, 8, 8, 2, 3, 8>}}}
The code generation problem is thus to recognize these
shufflevector patterns and generate the appropriate {{{pdep}}} or
{{{pext}}} calls.
The {{{pext}}} requirements are:
a. the two input vector arguments are the same type, either <32 x i1> or <64 x i1>
a. the second input vector argument is a constant 0 vector
a. the selection pattern consists of two parts: a strictly increasing sequence of positions from the first argument (<0, 2, 5, 6> in the example), followed by a selectors from the second argument (<8, 8, 8, 8> in the example).
For {{{pdep}}}, only the selection pattern differs. In this case, the selection pattern must consist of consecutive
selectors (<0, 1, 2, ...>) from the first vector interspersed with canoncial selectors for zeroes from the second vector.
=== LLVM Support for pdep and pext ===
From x86InstInfo.td, we can see that LLVM supports the pdep and pext instructions,
but only when directly accessed through the Intel intrinsics.
{{{
multiclass bmi_pdep_pext {
def rr : I<0xF5, MRMSrcReg, (outs RC:$dst), (ins RC:$src1, RC:$src2),
!strconcat(mnemonic, "\t{$src2, $src1, $dst|$dst, $src1, $src2}"),
[(set RC:$dst, (Int RC:$src1, RC:$src2))]>,
VEX_4V;
def rm : I<0xF5, MRMSrcMem, (outs RC:$dst), (ins RC:$src1, x86memop:$src2),
!strconcat(mnemonic, "\t{$src2, $src1, $dst|$dst, $src1, $src2}"),
[(set RC:$dst, (Int RC:$src1, (ld_frag addr:$src2)))]>, VEX_4V;
}
let Predicates = [HasBMI2] in {
defm PDEP32 : bmi_pdep_pext<"pdep{l}", GR32, i32mem,
int_x86_bmi_pdep_32, loadi32>, T8XD;
defm PDEP64 : bmi_pdep_pext<"pdep{q}", GR64, i64mem,
int_x86_bmi_pdep_64, loadi64>, T8XD, VEX_W;
defm PEXT32 : bmi_pdep_pext<"pext{l}", GR32, i32mem,
int_x86_bmi_pext_32, loadi32>, T8XS;
defm PEXT64 : bmi_pdep_pext<"pext{q}", GR64, i64mem,
int_x86_bmi_pext_64, loadi64>, T8XS, VEX_W;
}
}}}