wiki:HorizontalSIMD

Version 3 (modified by cameron, 3 years ago) (diff)

--

Horizontal SIMD Operations with LLVM

Chapter 5 Horizontal Operations of Christopher J. Hughes book, "Single-Instruction Multiple-Data Execution," (Margaret Martonosi (ed.), Synthesis Lectures on Computer Architecture, Morgan and Claypool, 2015) has an excellent introduction and discussion of horizontal operations.

Broadcasts, Permutes and Shuffles

  • Broadcasts populate all fields of a vector with a single scalar value.
  • Permutes allow each field of an output vector to be taken from specified positions of an input vector.
  • Shuffles allow fields to be chosen from two or more input vectors.

LLVM shufflevector

The LLVM shufflevector instruction allows for fully general shuffles to be expressed in LLVM IR.

Broadcasts

Here is an LLVM function to broadcast a byte value to 16 positions in a 128-bit vector.

define <16 x i8> @broadcastbyte(i8 %a)  {
entry:
  %avec = bitcast i8 %a to <1 x i8>
  %r = shufflevector <1 x i8> %avec, <1 x i8> undef, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
  ret <16 x i8> %r
}

Note that we use an undef second input to indicate that its value does not matter.

Compiling this on an x86 machine with SSE2 SIMD instructions is straightforward with LLVM's llc tool.

llc -filetype=asm  broadcastbyte.ll

The generated assembly code extracted from broadcastbyte.s uses a move plus 4 SSE2 horizontal operations.

broadcastbyte:                          # @broadcastbyte
	.cfi_startproc
# BB#0:                                 # %entry
	movd	%edi, %xmm0
	punpcklbw	%xmm0, %xmm0    # xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	pshufd	$-60, %xmm0, %xmm0      # xmm0 = xmm0[0,1,0,3]
	pshuflw	$0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0,4,5,6,7]
	pshufhw	$0, %xmm0, %xmm0        # xmm0 = xmm0[0,1,2,3,4,4,4,4]
	retq

However, we can also see what happens if AVX instructions are available.

llc -filetype=asm -mattr=+avx broadcastbyte.ll

Now only a move plus 2 operations are required. The vpxor creates the vector of all zero values.

broadcastbyte:                          # @broadcastbyte
	.cfi_startproc
# BB#0:                                 # %entry
	vmovd	%edi, %xmm0
	vpxor	%xmm1, %xmm1, %xmm1
	vpshufb	%xmm1, %xmm0, %xmm0
	retq

Horizontal Packing: veven

As a simple example of a horizontal operation that extracts fields from 2 input vectors, Hughes defines the veven instruction to select only the even elements from each vector. The LLVM shufflevector is straightforward.

define <8 x i16> @veven(<8 x i16> %a, <8 x i16> %b)  {
entry:
  %t0 = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14> 
  ret <8 x i16> %t0
}

Compiling to SSE2 code, we get the following.

veven:                                  # @veven
	.cfi_startproc
# BB#0:                                 # %entry
	pshuflw	$-24, %xmm1, %xmm1      # xmm1 = xmm1[0,2,2,3,4,5,6,7]
	pshufhw	$-24, %xmm1, %xmm1      # xmm1 = xmm1[0,1,2,3,4,6,6,7]
	pshufd	$-24, %xmm1, %xmm1      # xmm1 = xmm1[0,2,2,3]
	pshuflw	$-24, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3,4,5,6,7]
	pshufhw	$-24, %xmm0, %xmm0      # xmm0 = xmm0[0,1,2,3,4,6,6,7]
	pshufd	$-24, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
	punpcklqdq	%xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0]
	retq

But there is a better implementation using the SSE2 packusbw instruction!

declare <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16>, <8 x i16>) #1

define <8 x i16> @veven_sse2(<8 x i16> %a, <8 x i16> %b)  {
entry:
  %a0 = and <8 x i16> %a, bitcast (<1 x i128> <i128 1324055902416102970674609367438786815> to <8 x i16>)
  %b0 = and <8 x i16> %b, bitcast (<1 x i128> <i128 1324055902416102970674609367438786815> to <8 x i16>)
  %r0 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %a0, <8 x i16> %b0)
  %r1 = bitcast <16 x i8> %r0 to <8 x i16>
  ret <8 x i16> %r1
}

By masking off the high byte of each 16-bit field, we can avoid the "saturation" of values when converting from 16-bits to 8-bits.

.LCPI0_0:
	.short	255                     # 0xff
	.short	255                     # 0xff
	.short	255                     # 0xff
	.short	255                     # 0xff
	.short	255                     # 0xff
	.short	255                     # 0xff
	.short	255                     # 0xff
	.short	255                     # 0xff
	.text
	.globl	veven_sse2
	.align	16, 0x90
	.type	veven_sse2,@function
veven_sse2:                             # @veven_sse2
	.cfi_startproc
# BB#0:                                 # %entry
	movdqa	.LCPI0_0(%rip), %xmm2   # xmm2 = [255,255,255,255,255,255,255,255]
	pand	%xmm2, %xmm0
	pand	%xmm2, %xmm1
	packuswb	%xmm1, %xmm0
	retq