Changeset 253


Ignore:
Timestamp:
Jan 1, 2009, 4:53:06 PM (11 years ago)
Author:
cameron
Message:

Switch to SWAR consistently

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r252 r253  
    4242processors in high-performance text processing applications such as
    4343UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
    44 matching.   Direct architectural support for these algorithms in future SIMD
     44matching.   Direct architectural support for these algorithms in future SWAR
    4545instruction sets could further increase performance as well as simplifying the
    4646programming task. A set of simple SWAR instruction set extensions are proposed
     
    4949significantly reduce instruction count in core parallel bit stream algorithms,
    5050often providing a 3X or better improvement.  The extensions are also shown to be useful
    51 for SIMD programming in other application areas, including providing a
     51for SWAR programming in other application areas, including providing a
    5252systematic treatment for horizontal operations.  An implementation model for
    53 these extensions
    54 involves relatively simple circuitry added to the operand fetch components
    55 in a pipelined processor.
     53these extensions involves relatively simple circuitry added to the operand fetch
     54components in a pipelined processor.
    5655\end{abstract}
    5756
     
    8685In this method, byte-oriented character data is first transposed to eight
    8786parallel bit streams, one for each bit position within the character code
    88 units (bytes).  Loading bit stream data into 128-bit SIMD registers,
     87units (bytes).  Loading bit stream data into 128-bit registers,
    8988then, allows data from 128 consecutive code units to be represented and
    9089processed at once.  Bitwise logic and shift operations, bit scans,
     
    9392
    9493In application to UTF-8 to UTF-16 transcoding, a 3X to 25X speed-up
    95 is achieved in using parallel bit stream techniques on SIMD-capable uniprocessors
     94is achieved in using parallel bit stream techniques on SWAR-capable uniprocessors
    9695employing the SSE or Altivec instruction sets\cite{PPoPP08}.
    9796In the broader context of XML parsing, further applications of these
     
    116115further enhancing this route to parallelization of text processing?
    117116This paper addresses this question through presentation and
    118 analysis of a constructive proposal: a set of SIMD instruction set
     117analysis of a constructive proposal: a set of SWAR instruction set
    119118features based on the principle of systematic support for
    120119inductive doubling algorithms.  Inductive doubling refers
     
    128127occur in other applications as well.  In related work,
    129128efficient automatic interleaving based on power-of-2 strides has been
    130 reported quite useful for a number of SIMD kernels \cite{Nuzman}.
     129reported quite useful for a number of SWAR kernels \cite{Nuzman}.
    131130The specific features presented herein will be referred to
    132131as IDISA: inductive doubling instruction set architecture.
     
    134133The remainder of this paper is organized as follows.
    135134The second section of this paper introduces IDISA and the
    136 SIMD notation used throughout this paper.  The third
     135SWAR notation used throughout this paper.  The third
    137136section moves on to discuss an evaluation methodology
    138137for IDISA in comparison to two reference architectures
     
    153152\section{Inductive Doubling Architecture}
    154153
    155 This section presents an idealized model for a single-instruction
    156 multiple-data (SIMD) instruction set architecture designed
    157 specifically to support inductive doubling algorithms in the
    158 domain of parallel bit stream programming.   The architecture is idealized
     154This section presents an idealized model for a SWAR instruction set
     155architecture designed specifically to support inductive doubling
     156algorithms in the domain of parallel bit stream programming.   
     157The architecture is idealized
    159158in the sense that we concentrate on only the necessary features
    160159for our purpose, without enumerating the additional
    161160operations that would be required for
    162 SIMD applications in other domains.  The goal is to focus
     161SWAR applications in other domains.  The goal is to focus
    163162on the principles of inductive doubling support in a way
    164163that can accommodate a variety of realizations as other
    165164design constraints are brought to bear on the overall instruction set
    166165design.  First we introduce a simple model and notation for
    167 SIMD operations in general and then present the four key
     166SWAR operations in general and then present the four key
    168167features of an idealized architecture in support of parallel
    169168bit stream programming.
    170169
    171 The idealized architecture supports typical SIMD integer
     170
     171The idealized architecture supports typical SWAR integer
    172172operations common to existing commodity architectures such as SSE
    173 and Altivec.   The architecture is defined to support SIMD
     173and Altivec.   The architecture is defined to support SWAR
    174174operations on registers of size $N=2^K$ bits, for some integer $K$.
    175175Typical values of $K$ for commodity processors include $K=6$ for
     
    179179partitioned into $N/n$ fields of width $n=2^k$ bits for some values
    180180of $k \leq K$.   Typical values of $k$ used on commodity processors
    181 include $k = 3$ for SIMD operations on 8-bit fields (bytes),
     181include $k = 3$ for SWAR operations on 8-bit fields (bytes),
    182182$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
    183183on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
     
    186186register $r$, using big-endian numbering.
    187187
    188 Throughout this paper, we focus on
    189 SWAR implementations using the following {\em
    190 three-register model}, although it is also possible to define
    191 IDISA extensions for a two-register model of the SSE instruction set.
    192 Let \verb:simd<n>: represent the class of SIMD operations defined
     188Throughout this paper, we focus on SWAR implementations using a {\em
     189three-register model} involving two input registers
     190and one output register, each of size $N=2^K$ bits.
     191Let \verb:simd<n>: represent the class of SWAR operations defined
    193192on fields of size $n$ using C++ template syntax.  Given a
    194 binary function $F_n$ on $n$-bit fields, we denote the SIMD
     193binary function $F_n$ on $n$-bit fields, we denote the SWAR
    195194version of this operation as \verb#simd<n>::F#.  Given two
    196 SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
     195registers \verb:a: and \verb:b: holding values $a$ and $b$,
    197196respectively, the operation \verb#r=simd<n>::F(a,b)# stores
    198197the value $r$ in the register \verb:r: as determined by
     
    211210\end{eqnarray}
    212211%\doublespace
     212
    213213The SSE and Altivec instruction sets support
    214 each of the addition and subtraction operations in SIMD form
     214each of the addition and subtraction operations in SWAR form
    215215for 8, 16 and 32-bit fields, while the SSE set also includes
    216216the 64-bit forms.  For example, \verb#simd<8>::add# in our
     
    261261we assume are also available.  For example,
    262262\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
    263 bitwise exclusive-or operation on SIMD registers.
     263bitwise exclusive-or operation.
    264264
    265265The second key facility of the inductive doubling architecture
    266266is the potential application of half-operand modifiers to
    267 the fields of either or both of the operands of a SIMD
     267the fields of either or both of the operands of a SWAR
    268268operation.  These modifiers select either the
    269269low $n/2$
     
    337337
    338338This completes the description of IDISA.  As described, many of the
    339 features are already available with the SIMD facilities of
     339features are already available with the SWAR facilities of
    340340existing commodity processors.  The extensions enumerated
    341341here are relatively straightforward.  The innovation
     
    467467as a built-in instruction
    468468on some processors.  For example, the SPU of the Cell Broadband Engine
    469 has a SIMD population count instruction \verb:si_cntb: for simultaneously
     469has a SWAR population count instruction \verb:si_cntb: for simultaneously
    470470determining the
    471471number of 1 bits within each byte of a 16-byte register.
     
    548548We then go on to show how the transposition problem
    549549can be solved using IDISA-A or IDISA-B
    550 with a mere 24 three-register SIMD operations.  We also show
     550with a mere 24 three-register SWAR operations.  We also show
    551551that this is optimal for any three-register instruction set model.
    552552
     
    563563outputs are each SWAR registers of size $N=2^K$ bits.
    564564The input consists of $N$ bytes of serial byte data,
    565 stored consecutively in eight SIMD registers each holding
     565stored consecutively in eight SWAR registers each holding
    566566$N/8$ bytes.  The output consists of eight parallel
    567567registers, one each for the eight individual bit positions
     
    583583takes advantage of SWAR bit gathering operations that exist
    584584on some architectures.  This operation gathers one bit per byte
    585 from a particular position within each byte of a SIMD register.
     585from a particular position within each byte of a register.
    586586For example, the {\tt pmovmskb} operation of the Intel
    587587SSE instruction set forms a 16-bit mask
    588588consisting of the high bit of each byte.  Similarly, the
    589589{\tt \verb:si_gbb:} operation of the synergistic processing units of the
    590 Cell Broadband Engine gathers together the low bit of each byte
    591 from the SIMD register. 
     590Cell Broadband Engine gathers together the low bit of each byte.
    592591% Figure \ref{gather} illustrates the
    593592% bit gathering process.
     
    763762Here we show that the inductive halving algorithm presented in
    764763the previous subsection is optimal in the following sense:
    765 no other algorithm on any 3-register SIMD architecture can use
     764no other algorithm on any 3-register SWAR architecture can use
    766765fewer than 24 operations to transform eight serial registers
    767766of byte stream data into eight parallel registers of bit stream data.
    768 By 3-register SIMD architecture, we refer to any architecture
    769 that uses SIMD instructions consistent with our overall model of
     767By 3-register SWAR architecture, we refer to any architecture
     768that uses SWAR instructions consistent with our overall model of
    770769binary operations using two input register operands to produce
    771770one output register value.
     
    810809
    811810Algorithms for performing the inverse transform mirror those
    812 of the forward transform, employing SIMD merge operations
     811of the forward transform, employing SWAR merge operations
    813812in place of pack operations.  The best algorithm known
    814 to us on the commodity SIMD architectures takes advantage
     813to us on the commodity SWAR architectures takes advantage
    815814of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
    816815operations that are available with each of the SSE and Altivec instruction
     
    986985The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
    987986induction versus $4k^2$ for the parallel-prefix method.  Using the computed
    988 deletion operation, only a single SIMD rotate operation per bit stream
     987deletion operation, only a single SWAR rotate operation per bit stream
    989988per level is needed, in comparison with 4 operations per level for parallel-prefix
    990989compress.
     
    11481147is a further application of parallel bit deletion.
    11491148
    1150 \subsection{Systematic Support for Horizontal SIMD Operations}
    1151 
    1152 In SIMD parlance, {\em horizontal} operations are
     1149\subsection{Systematic Support for Horizontal SWAR Operations}
     1150
     1151In SWAR parlance, {\em horizontal} operations are
    11531152operations which combine values from two or more fields
    11541153of the same register, in contrast to the normal
     
    11741173before or after the horizontal operation and what the
    11751174nature of the vertical combining operation is.
    1176 Within this space, commodity SIMD architectures tend
     1175Within this space, commodity SWAR architectures tend
    11771176to support only a very few combinations, without any
    11781177particular attempt at systematic support for horizontal
     
    12221221offsetting the circuit complexity of half-operand
    12231222modifiers with potential elimination of dedicated
    1224 logic for some {/ad hoc} horizontal SIMD operations.
     1223logic for some {/ad hoc} horizontal SWAR operations.
    12251224Even if legacy support for these operations is required,
    12261225it may be possible to provide that support through
     
    12691268inductive doubling, its inclusion with the IDISA
    12701269libraries has provided a suitable basis for portable
    1271 SIMD implementations of parallel bit stream algorithms.
     1270SWAR implementations of parallel bit stream algorithms.
    12721271Beyond this, the IDISA libraries have the additional
    12731272benefit of allowing the implementation
     
    12861285
    12871286Figure \ref{pipeline-model} shows a block diagram
    1288 for a pipelined SIMD processor implementing IDISA.
    1289 The SIMD Register File (SRF) provides a file of $R = 2^A$
     1287for a pipelined SWAR processor implementing IDISA.
     1288The SWAR Register File (SRF) provides a file of $R = 2^A$
    12901289registers each of width $N = 2^K$ bits. 
    12911290IDISA instructions identified by the Instruction Fetch
    1292 Unit (IFU) are forwarded for decoding to the SIMD
     1291Unit (IFU) are forwarded for decoding to the SWAR
    12931292Instruction Decode Unit (SIDU).  This unit decodes
    12941293the instruction to produce
    12951294signals identifying the source and destination
    12961295operand registers, the half-operand modifiers, the
    1297 field width specification and the SIMD operation
     1296field width specification and the SWAR operation
    12981297to be applied.
    12991298
    13001299The SIDU supplies the source register information and the half-operand
    1301 modifier information to the SIMD Operand Fetch Unit (SOFU).
     1300modifier information to the SWAR Operand Fetch Unit (SOFU).
    13021301For each source operand, the SIDU provides an $A$-bit register
    13031302address and two 1-bit signals $h$ and $l$ indicating the value
     
    13071306The SIDU also supplies decoded field width signals $w_k$
    13081307for each field width $2^k$ to both the SOFU and to the
    1309 SIMD Instruction Execute Unit (SIEU).  Only one of the
     1308SWAR Instruction Execute Unit (SIEU).  Only one of the
    13101309field width signals has the value 1.
    1311 The SIDU also supplies decoded SIMD opcode information to SIEU and
     1310The SIDU also supplies decoded SWAR opcode information to SIEU and
    13121311a decoded $A$-bit register address for the destination register to
    1313 the SIMD Result Write Back Unit (SRWBU).
     1312the SWAR Result Write Back Unit (SRWBU).
    13141313
    13151314The SOFU is the key component of the IDISA model that
     
    13221321logic as specified by the $h$, $l$, and field-width
    13231322signals.  The possibly modified operand values are then
    1324 provided to the SIEU for carrying out the SIMD operations.
     1323provided to the SIEU for carrying out the SWAR operations.
    13251324A detailed model of SOFU logic is described in the following
    13261325subsection.
     
    13281327The SIEU differs from similar execution units in
    13291328current commodity processors primarily by providing
    1330 SIMD operations at each field width
     1329SWAR operations at each field width
    13311330$n=2^k$ for $0 \leq k \leq K$.  This involves
    13321331additional circuitry for field widths not supported
     
    14591458IDISA sequences.  These include examples such as population
    14601459count, count leading and/or trailing zeroes and parity.
    1461 They also include specialized horizontal SIMD operations.
     1460They also include specialized horizontal SWAR operations.
    14621461Thus, design tradeoffs can be made with the potential of
    14631462reducing the chip area devoted to special purpose instructions
     
    14671466
    14681467In considering the architectural support for
    1469 SIMD text processing using the method of parallel bit streams,
     1468SWAR text processing using the method of parallel bit streams,
    14701469this paper has presented the principle of inductive doubling
    14711470and a realization of that principle in the form of
Note: See TracChangeset for help on using the changeset viewer.