43 | | Bit interleave can be expressed by a shuffle mask with alternating bits from the two vectors. |
44 | | For example the IDISA operation {{{simd<1>::mergel(v1, v2)}}} could be expressed as a shuffle vector |
45 | | operation. |
46 | | {{{ |
47 | | shufflevector <128 x i1> %v1, <128 x i1> %v2, |
48 | | <128 x i1> <i32 0, i32 128, i32 1, i32 129, i32 2, i32 130, i32 3, i32 131, ...> |
49 | | }}} |
50 | | |
51 | | An architecture may not support shuffle at the bit-level, but bit-level merge can be |
52 | | implemented using four byte-level shuffles combined with shifting and bitwise logic. |
53 | | |
54 | | == Parallel Extract and Parallel Deposit == |
55 | | |
56 | | Intel's Haswell Architecture has two new {{{pext}}} and {{{pdep}}} that can |
57 | | be used for flexible extraction and placement of bits. |
58 | | These instructions work with either 32-bit or |
59 | | 64-bit general registers. |
60 | | |
61 | | {{{pext}}} extracts bits from marked positions, producing a |
62 | | compressed result. |
63 | | |
64 | | For example, given an 8-bit value {{{hgfedcba}}} and an |
65 | | 8-bit extraction mask {{{01100101}}}, four bits would be |
66 | | extracted to produce the result {{{0000gfca}}}. |
67 | | |
68 | | {{{pext}}} operates in reverse, depositing bits at marked |
69 | | positions, placing 0 elsewhere. |
70 | | |
71 | | For example, given an 8-bit value {{{hgfedcba}}} and an |
72 | | 8-bit deposit mask {{{01100101}}}, four bits would be |
73 | | extracted to produce the result {{{0dc00b0a}}}. |
74 | | |
75 | | In these examples, bit patterns have been shown in |
76 | | the little-endian order used by in the x86 architecture. |
77 | | |
78 | | Both operations can be modeled by shufflevector. |
79 | | |
80 | | For {{{pext}}}, the first argument is the bit vector from |
81 | | which bits are to be extracted, the second argument |
82 | | is a vector of all 0 bits, and the selection |
83 | | vector has the indexes of desired bits. For the |
84 | | 8-bit extraction mask, {{{01100101}}}, the extraction |
85 | | positions of the four bits are 6,5,2,0 (little-endian |
86 | | bit numbering), so the 8-bit vector could be |
87 | | {{{<0, 2, 5, 6, 8, 8, 8, 8>}}} |
88 | | Here the selector 8 specifies the position of a |
89 | | bit in the second vector, known to be 0. |
90 | | |
91 | | For {{{pdep}}}, we again have the bit vector from which bits |
92 | | are to be taken as the first argument, a constant 0 |
93 | | vector as the second argument, and an index vector |
94 | | for deposits. In the example, the vector for deposits is |
95 | | {{{<0, 2, 5, 6, 8, 8, 8, 8>}}} |
96 | | |
97 | | The code generation problem is thus to recognize these |
98 | | shufflevector patterns and generate the appropriate {{{pdep}}} or |
99 | | {{{pext}}} calls. |
100 | | |
101 | | The {{{pext}}} requirements are: |
102 | | a. the two input vector arguments are the same type, either<32 x i1> or <64 x i1> |
103 | | a. the second input vector argument is a constant 0 vector |
104 | | a. the selection pattern consists of two parts: a strictly increasing sequence of positions from the first argument (<0, 2, 5, 6> in the example), followed by a selectors from the second argument (<8, 8, 8, 8> in the example). |
105 | | |
106 | | For {{{pdep}}}, only the selection pattern differs. In this case, the selection pattern must consist of interspersed |
107 | | selectors (<0, 1, 2, ...>) from the first vector and arbitrary selectors from the second vector. |
108 | | |