wiki:SSERemoval

Version 2 (modified by cameron, 5 years ago) (diff)

--

Eliminating the SSE2 operations from llvm-re.ll

In this exercise, we explore the possibility of replacing SSE2 operations with equivalent LLVM IR operations in order to make a program more portable (able to be mapped to different architectures).

One approach is to replace each call to an sse2 operation by a call to a routine written to emulate it entirely using LLVM operations. However, it may be that this approach creates extra complexity and slows down performance. The reason is that a general emulation may be much more complex than is necessary in the context of the use of a specific operation.

Most of the operations in the llvm-re.ll file actually fall into the category that the sse2 operations can be replaced by a single LLVM operation, plus some rearrangement and bitcasting.

Eliminating Shifts

The various sse2 shift operations can generally be replaced by a single LLVM lshr or shl operation.

In general, sse2 shift operations apply a single shift value to all fields in a vector. LLVM shifts are more capable in that they can shift each different field by a different amount. To shift by the same amount, we simply need a vector having all the same values.

Here is an example modification to llvm-re.ll.

   %2129 = call <2 x i64> @llvm.x86.sse2.psrli.q(<2 x i64> %2128, i32 32) #2

becomes

   %2129 = lshr <2 x i64> %2128, <i64 32, i64 32>

Eliminating sse2.pmovmskb

The SSE2 operation sse2.pmovmskb extracts the high bit of each byte from a 16-byte vector, returning a integer consisting of the 16 bits.

The LLVM IR instruction icmp slt can be used to implement this in a fairly straightforward manner. In general, some bitcasting is required, but you may not need it, if you are lucky.

Here is an example modification to llvm-re.ll.

   %482 = call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %481) #2
   %483 = trunc i32 %482 to i16

becomes

   %482 = icmp slt <16 x i8> %481, zeroinitializer
   %483 = bitcast <16 x i1> %482 to i16

Eliminating sse2.packuswb

This operation is used to extract bytes from words. In general this operation performs *unsigned saturation* before extracting the low byte of each word. But in the parabix application, it is only used when the high byte is already known to be 0. In this case, a single shufflevector instruction can be used. Once again, the type may not be right, but bitcasting can overcome this.

Here is an example modification to llvm-re.ll.

   %94 = bitcast <2 x i64> %93 to <8 x i16>
   %95 = bitcast <2 x i64> %92 to <8 x i16>
   %96 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %94, <8 x i16> %95) #2

becomes

   %94 = bitcast <2 x i64> %93 to <16 x i8>
   %95 = bitcast <2 x i64> %92 to <16 x i8>
   %96 = shufflevector <16 x i8> %94, <16 x i8> %95, <16 x i32><i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>

General Question

Can we build software tools that automatically identify such conversions to replace sse2 operations by LLVM IR such that performance is not degraded?

Can there be cases in which performance may actually be improved?