# Changeset 253 for docs/ASPLOS09/asplos094-cameron.tex

Ignore:
Timestamp:
Jan 1, 2009, 4:53:06 PM (10 years ago)
Message:

Switch to SWAR consistently

File:
1 edited

Unmodified
Removed
• ## docs/ASPLOS09/asplos094-cameron.tex

 r252 processors in high-performance text processing applications such as UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression matching.   Direct architectural support for these algorithms in future SIMD matching.   Direct architectural support for these algorithms in future SWAR instruction sets could further increase performance as well as simplifying the programming task. A set of simple SWAR instruction set extensions are proposed significantly reduce instruction count in core parallel bit stream algorithms, often providing a 3X or better improvement.  The extensions are also shown to be useful for SIMD programming in other application areas, including providing a for SWAR programming in other application areas, including providing a systematic treatment for horizontal operations.  An implementation model for these extensions involves relatively simple circuitry added to the operand fetch components in a pipelined processor. these extensions involves relatively simple circuitry added to the operand fetch components in a pipelined processor. \end{abstract} In this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes).  Loading bit stream data into 128-bit SIMD registers, units (bytes).  Loading bit stream data into 128-bit registers, then, allows data from 128 consecutive code units to be represented and processed at once.  Bitwise logic and shift operations, bit scans, In application to UTF-8 to UTF-16 transcoding, a 3X to 25X speed-up is achieved in using parallel bit stream techniques on SIMD-capable uniprocessors is achieved in using parallel bit stream techniques on SWAR-capable uniprocessors employing the SSE or Altivec instruction sets\cite{PPoPP08}. In the broader context of XML parsing, further applications of these further enhancing this route to parallelization of text processing? This paper addresses this question through presentation and analysis of a constructive proposal: a set of SIMD instruction set analysis of a constructive proposal: a set of SWAR instruction set features based on the principle of systematic support for inductive doubling algorithms.  Inductive doubling refers occur in other applications as well.  In related work, efficient automatic interleaving based on power-of-2 strides has been reported quite useful for a number of SIMD kernels \cite{Nuzman}. reported quite useful for a number of SWAR kernels \cite{Nuzman}. The specific features presented herein will be referred to as IDISA: inductive doubling instruction set architecture. The remainder of this paper is organized as follows. The second section of this paper introduces IDISA and the SIMD notation used throughout this paper.  The third SWAR notation used throughout this paper.  The third section moves on to discuss an evaluation methodology for IDISA in comparison to two reference architectures \section{Inductive Doubling Architecture} This section presents an idealized model for a single-instruction multiple-data (SIMD) instruction set architecture designed specifically to support inductive doubling algorithms in the domain of parallel bit stream programming.   The architecture is idealized This section presents an idealized model for a SWAR instruction set architecture designed specifically to support inductive doubling algorithms in the domain of parallel bit stream programming. The architecture is idealized in the sense that we concentrate on only the necessary features for our purpose, without enumerating the additional operations that would be required for SIMD applications in other domains.  The goal is to focus SWAR applications in other domains.  The goal is to focus on the principles of inductive doubling support in a way that can accommodate a variety of realizations as other design constraints are brought to bear on the overall instruction set design.  First we introduce a simple model and notation for SIMD operations in general and then present the four key SWAR operations in general and then present the four key features of an idealized architecture in support of parallel bit stream programming. The idealized architecture supports typical SIMD integer The idealized architecture supports typical SWAR integer operations common to existing commodity architectures such as SSE and Altivec.   The architecture is defined to support SIMD and Altivec.   The architecture is defined to support SWAR operations on registers of size $N=2^K$ bits, for some integer $K$. Typical values of $K$ for commodity processors include $K=6$ for partitioned into $N/n$ fields of width $n=2^k$ bits for some values of $k \leq K$.   Typical values of $k$ used on commodity processors include $k = 3$ for SIMD operations on 8-bit fields (bytes), include $k = 3$ for SWAR operations on 8-bit fields (bytes), $k = 4$ for operations on 16-bit fields and $k = 5$ for operations on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit register $r$, using big-endian numbering. Throughout this paper, we focus on SWAR implementations using the following {\em three-register model}, although it is also possible to define IDISA extensions for a two-register model of the SSE instruction set. Let \verb:simd: represent the class of SIMD operations defined Throughout this paper, we focus on SWAR implementations using a {\em three-register model} involving two input registers and one output register, each of size $N=2^K$ bits. Let \verb:simd: represent the class of SWAR operations defined on fields of size $n$ using C++ template syntax.  Given a binary function $F_n$ on $n$-bit fields, we denote the SIMD binary function $F_n$ on $n$-bit fields, we denote the SWAR version of this operation as \verb#simd::F#.  Given two SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$, registers \verb:a: and \verb:b: holding values $a$ and $b$, respectively, the operation \verb#r=simd::F(a,b)# stores the value $r$ in the register \verb:r: as determined by \end{eqnarray} %\doublespace The SSE and Altivec instruction sets support each of the addition and subtraction operations in SIMD form each of the addition and subtraction operations in SWAR form for 8, 16 and 32-bit fields, while the SSE set also includes the 64-bit forms.  For example, \verb#simd<8>::add# in our we assume are also available.  For example, \verb#simd<1>::add# is equivalent to \verb:simd_xor:, the bitwise exclusive-or operation on SIMD registers. bitwise exclusive-or operation. The second key facility of the inductive doubling architecture is the potential application of half-operand modifiers to the fields of either or both of the operands of a SIMD the fields of either or both of the operands of a SWAR operation.  These modifiers select either the low $n/2$ This completes the description of IDISA.  As described, many of the features are already available with the SIMD facilities of features are already available with the SWAR facilities of existing commodity processors.  The extensions enumerated here are relatively straightforward.  The innovation as a built-in instruction on some processors.  For example, the SPU of the Cell Broadband Engine has a SIMD population count instruction \verb:si_cntb: for simultaneously has a SWAR population count instruction \verb:si_cntb: for simultaneously determining the number of 1 bits within each byte of a 16-byte register. We then go on to show how the transposition problem can be solved using IDISA-A or IDISA-B with a mere 24 three-register SIMD operations.  We also show with a mere 24 three-register SWAR operations.  We also show that this is optimal for any three-register instruction set model. outputs are each SWAR registers of size $N=2^K$ bits. The input consists of $N$ bytes of serial byte data, stored consecutively in eight SIMD registers each holding stored consecutively in eight SWAR registers each holding $N/8$ bytes.  The output consists of eight parallel registers, one each for the eight individual bit positions takes advantage of SWAR bit gathering operations that exist on some architectures.  This operation gathers one bit per byte from a particular position within each byte of a SIMD register. from a particular position within each byte of a register. For example, the {\tt pmovmskb} operation of the Intel SSE instruction set forms a 16-bit mask consisting of the high bit of each byte.  Similarly, the {\tt \verb:si_gbb:} operation of the synergistic processing units of the Cell Broadband Engine gathers together the low bit of each byte from the SIMD register. Cell Broadband Engine gathers together the low bit of each byte. % Figure \ref{gather} illustrates the % bit gathering process. Here we show that the inductive halving algorithm presented in the previous subsection is optimal in the following sense: no other algorithm on any 3-register SIMD architecture can use no other algorithm on any 3-register SWAR architecture can use fewer than 24 operations to transform eight serial registers of byte stream data into eight parallel registers of bit stream data. By 3-register SIMD architecture, we refer to any architecture that uses SIMD instructions consistent with our overall model of By 3-register SWAR architecture, we refer to any architecture that uses SWAR instructions consistent with our overall model of binary operations using two input register operands to produce one output register value. Algorithms for performing the inverse transform mirror those of the forward transform, employing SIMD merge operations of the forward transform, employing SWAR merge operations in place of pack operations.  The best algorithm known to us on the commodity SIMD architectures takes advantage to us on the commodity SWAR architectures takes advantage of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel# operations that are available with each of the SSE and Altivec instruction The total preprocessing cost is $4k$ for $k$ steps of deletion by central result induction versus $4k^2$ for the parallel-prefix method.  Using the computed deletion operation, only a single SIMD rotate operation per bit stream deletion operation, only a single SWAR rotate operation per bit stream per level is needed, in comparison with 4 operations per level for parallel-prefix compress. is a further application of parallel bit deletion. \subsection{Systematic Support for Horizontal SIMD Operations} In SIMD parlance, {\em horizontal} operations are \subsection{Systematic Support for Horizontal SWAR Operations} In SWAR parlance, {\em horizontal} operations are operations which combine values from two or more fields of the same register, in contrast to the normal before or after the horizontal operation and what the nature of the vertical combining operation is. Within this space, commodity SIMD architectures tend Within this space, commodity SWAR architectures tend to support only a very few combinations, without any particular attempt at systematic support for horizontal offsetting the circuit complexity of half-operand modifiers with potential elimination of dedicated logic for some {/ad hoc} horizontal SIMD operations. logic for some {/ad hoc} horizontal SWAR operations. Even if legacy support for these operations is required, it may be possible to provide that support through inductive doubling, its inclusion with the IDISA libraries has provided a suitable basis for portable SIMD implementations of parallel bit stream algorithms. SWAR implementations of parallel bit stream algorithms. Beyond this, the IDISA libraries have the additional benefit of allowing the implementation Figure \ref{pipeline-model} shows a block diagram for a pipelined SIMD processor implementing IDISA. The SIMD Register File (SRF) provides a file of $R = 2^A$ for a pipelined SWAR processor implementing IDISA. The SWAR Register File (SRF) provides a file of $R = 2^A$ registers each of width $N = 2^K$ bits. IDISA instructions identified by the Instruction Fetch Unit (IFU) are forwarded for decoding to the SIMD Unit (IFU) are forwarded for decoding to the SWAR Instruction Decode Unit (SIDU).  This unit decodes the instruction to produce signals identifying the source and destination operand registers, the half-operand modifiers, the field width specification and the SIMD operation field width specification and the SWAR operation to be applied. The SIDU supplies the source register information and the half-operand modifier information to the SIMD Operand Fetch Unit (SOFU). modifier information to the SWAR Operand Fetch Unit (SOFU). For each source operand, the SIDU provides an $A$-bit register address and two 1-bit signals $h$ and $l$ indicating the value The SIDU also supplies decoded field width signals $w_k$ for each field width $2^k$ to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).  Only one of the SWAR Instruction Execute Unit (SIEU).  Only one of the field width signals has the value 1. The SIDU also supplies decoded SIMD opcode information to SIEU and The SIDU also supplies decoded SWAR opcode information to SIEU and a decoded $A$-bit register address for the destination register to the SIMD Result Write Back Unit (SRWBU). the SWAR Result Write Back Unit (SRWBU). The SOFU is the key component of the IDISA model that logic as specified by the $h$, $l$, and field-width signals.  The possibly modified operand values are then provided to the SIEU for carrying out the SIMD operations. provided to the SIEU for carrying out the SWAR operations. A detailed model of SOFU logic is described in the following subsection. The SIEU differs from similar execution units in current commodity processors primarily by providing SIMD operations at each field width SWAR operations at each field width $n=2^k$ for $0 \leq k \leq K$.  This involves additional circuitry for field widths not supported IDISA sequences.  These include examples such as population count, count leading and/or trailing zeroes and parity. They also include specialized horizontal SIMD operations. They also include specialized horizontal SWAR operations. Thus, design tradeoffs can be made with the potential of reducing the chip area devoted to special purpose instructions In considering the architectural support for SIMD text processing using the method of parallel bit streams, SWAR text processing using the method of parallel bit streams, this paper has presented the principle of inductive doubling and a realization of that principle in the form of
Note: See TracChangeset for help on using the changeset viewer.