wiki:HorizontalSIMD

Version 1 (modified by cameron, 3 years ago) (diff)

--

Horizontal SIMD Operations with LLVM

Chapter 5 Horizontal Operations of Christopher J. Hughes book, "Single-Instruction Multiple-Data Execution," (Margaret Martonosi (ed.), Synthesis Lectures on Computer Architecture, Morgan and Claypool, 2015) has an excellent introduction and discussion of horizontal operations.

Broadcasts, Permutes and Shuffles

  • Broadcasts populate all fields of a vector with a single scalar value.
  • Permutes allow each field of an output vector to be taken from specified positions of an input vector.
  • Shuffles allow fields to be chosen from two or more input vectors.

LLVM shufflevector

The LLVM shufflvector instruction allows for fully general shuffles to be expressed in LLVM IR.

Broadcasts

Here is an LLVM function to broadcast a byte value to 16 positions in a 128-bit vector.

define <16 x i8> @broadcastbyte(i8 %a)  {
entry:
  %avec = bitcast i8 %a to <1 x i8>
  %r = shufflevector <1 x i8> %avec, <1 x i8> undef, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
  ret <16 x i8> %r
}

Note that we use an undef second input to indicate that its value does not matter.

Compiling this on an x86 machine with SSE2 SIMD instructions is straightforward with LLVM's llc tool.

llc -filetype=asm  broadcastbyte.ll

The generated assembly code extracted from broadcastbyte.s uses a move plus 4 SSE2 horizontal operations.

broadcastbyte:                          # @broadcastbyte
	.cfi_startproc
# BB#0:                                 # %entry
	movd	%edi, %xmm0
	punpcklbw	%xmm0, %xmm0    # xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	pshufd	$-60, %xmm0, %xmm0      # xmm0 = xmm0[0,1,0,3]
	pshuflw	$0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0,4,5,6,7]
	pshufhw	$0, %xmm0, %xmm0        # xmm0 = xmm0[0,1,2,3,4,4,4,4]
	retq

However, we can also see what happens if AVX instructions are available.

llc -filetype=asm -mattr=+avx broadcastbyte.ll

Now only a move plus 2 operations are required. The vpxor creates the vector of all zero values.

broadcastbyte:                          # @broadcastbyte
	.cfi_startproc
# BB#0:                                 # %entry
	vmovd	%edi, %xmm0
	vpxor	%xmm1, %xmm1, %xmm1
	vpshufb	%xmm1, %xmm0, %xmm0
	retq

Horizontal Packing

Horizontal packing operations