Changeset 238 for docs/ASPLOS09/asplos094cameron.tex
 Timestamp:
 Dec 16, 2008, 7:15:17 AM (11 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r237 r238 150 150 as IDISA: inductive doubling instruction set architecture. 151 151 152 To assess the benefits of IDISA features in comparison to153 those found in existing SIMD support on commodity154 processors, we will focus on an idealized threeregister155 model of SIMD instruction sets and evaluation of156 parallel bit stream and other computational kernels with157 respect to these features. The threeregister model is158 based on the general approach to binary SIMD operations159 as ones that operate on the contents of two source operand160 registers to produce a value to be written back to a161 single destination operand register. This threeregister162 model is directly used by the Altivec SIMD instructions163 of the Power PC, for example. On the Intel SSE platform,164 the threeregister model is used as a programmer's165 interface for the Clanguage intrinsics, while the166 underlying instructions use a tworegister model with167 destructive updating. The computational kernels we study consist168 of 100\% branchfree code operating on registers, without169 memory access For such kernels, we use straightline170 instruction count as the performance metric of interest,171 assuming that pipelined processors are welldesigned172 to handle latencies.173 174 175 152 The remainder of this paper is organized as follows. 176 153 … … 182 159 is also made. 183 160 184 The third section provides a short first example of 161 The third section then discusses the evaluation methodology 162 for IDISA. Two reference architectures are briefly described as 163 well as criteria for evaluation IDISA against these architectures. 164 165 The fourth section provides a short first example of 185 166 the inductive doubling principle in action through 186 167 the case of population count. Although this operation … … 189 170 frequently enough in the general computing milieux to find 190 171 its way into some instruction set architectures, typically 191 at one particular field width. By way of comparison, the 192 inductive doubling architecture sacrifices some 193 performance at the chosen field width, while offering a more 194 general solution with frequently better performance at 195 other field widths. 196 197 The fourth section then moves on to consider the performancecritical 172 at one particular field width. 173 %By way of comparison, the 174 %inductive doubling architecture sacrifices some 175 %performance at the chosen field width, while offering a more 176 %general solution with frequently better performance at 177 %other field widths. 178 179 The fifth section then moves on to consider the performancecritical 198 180 and key task of conversion between serial byte streams and parallel 199 181 bit streams. A first algorithm that uses the existing SIMD … … 206 188 architecture is considerably simpler and easier to program. 207 189 208 The fifth section then briefly considers the inverse transposition190 The sixth section then briefly considers the inverse transposition 209 191 process, converting parallel bit stream data back into byte 210 192 streams. Again, an algorithm to carry out this task requires … … 212 194 form, but is reduced to a simpler 24 operations using IDISA. 213 195 214 The s ixth section then goes on to consider the problem of196 The seventh section then goes on to consider the problem of 215 197 parallel bit deletion. This operation is performancecritical 216 198 to any applications that require filtering or … … 224 206 alternative. 225 207 226 The seventh section considers the issue of 227 horizontal SIMD operations, that is, operations for combining 228 sets of adjacent fields within individual SIMD registers rather than 229 corresponding fields within sets of registers. While existing 230 SIMD instruction set architectures tend to only support a few 231 ad hoc horizontal combinations, IDISA is shown to provide a systematic 232 means for efficient horizontal combinations of any kind. 208 The eighth section then considers the potential role of 209 IDISA in supporting applications beyond parallel bit streams. 210 Additional examples from several domains are presented. 211 Perhaps most importantly, however, the role of IDISA 212 in supporting a fully systematic treatment of {\em horizontal} 213 SWAR operations for any application area is discussed. 233 214 234 215 An implementation model for IDISA is then considered 235 in section 8of the paper, focusing on a pipelined216 in section 9 of the paper, focusing on a pipelined 236 217 SIMD architecture featuring a modified operand fetch stage. 237 218 A gatecount analysis of one feasible implementation is … … 243 224 more general alternatives provided through IDISA. 244 225 245 The ninth section concludes the paper with a summary of results226 The tenth section concludes the paper with a summary of results 246 227 and discussion of areas for future work. 247 228 … … 428 409 on SSE. 429 410 430 This completes the description of the proposed inductive 431 doubling architecture. As described, many of the features 432 are already available with the SIMD facilities of 411 This completes the description of IDISA. As described, many of the 412 features are already available with the SIMD facilities of 433 413 existing commodity processors. The extensions enumerated 434 414 here are relatively straightforward. The innovation … … 438 418 can dramatically reduce instruction count in core parallel bit 439 419 stream algorithms, with a factor of 3 reduction being typical. 440 The next section goes on to illustrate a first simple example 441 for the task of population count. 420 421 \section{Evaluation Methodology} 422 423 IDISA represents a set of instruction set features that 424 could potentially be added to any SWAR processor. The goal 425 in this paper is to evaluate these features independent 426 of any artifacts that may be due to any particular realization, 427 while still considering realistic models based on existing 428 commodity instruction set architectures. For the purpose 429 of IDISA evaluation, then, we define two reference architectures. 430 For concreteness, IDISA and the two reference architectures will 431 each be considered as 128bit processors employing the 432 threeregister SWAR model defined in the previous 433 section. 434 435 Reference architecture A (RefA) consists of a limited register 436 processor providing a set of core binary operations defined 437 for 8, 16, 32 and 64 bit fields. The core binary operations 438 will be assumed to be those defined by the SSE instruction 439 set for 16bit fields. In addition, we assume that 440 shift immediate operations for each field width exist, 441 e.g., \verb#simd<8>::srli<1>(x)# for a right logical 442 shift of each 8bit field by 1. We also assume that 443 a constant load operation \verb#simd::const<n>(c)# 444 loads the constant value $c$ into each $n$ bit field. 445 446 Reference architecture B (RefB) consists of a registerrich 447 processor incorporating all the operations of reference 448 architecture A as well as the following additional facilities 449 inspired by the Altivec instruction set. 450 For each of the 8, 16, 32 and 64 bit widths, a binary rotate left 451 logical instruction \verb#simd<n>::rotl(a,b)# rotates each field 452 of $a$ by the rotation count in the corresponding field of $b$. 453 A threeregister \verb#simd<1>::if(a, b, c)# bitwise logical 454 operator implements the logic $a & b  ~a & c$, patterned 455 after the Altivec \verb:vec_sel: operation. Finally, 456 a \verb#simd<8>::permute(a, b, c)# selects an arbitrary 457 permutation of bytes from the concatenation of $a$ and $b$ based 458 on the set of indices in $c$. 459 460 Two versions of IDISA are assessed against these reference 461 architectures as follows. IDISAA has all the facilities 462 of RefA extended with halfoperand modifiers and all core 463 operations at field widths of 2, 4 and 128. IDISAB is 464 similarly defined and extended based on RefB. Algorithms 465 for both RefA and IDISAA are assessed assuming that 466 any required constants must be loaded as needed; this 467 reflects the limited register assumption. On the other, 468 assessment for both RefB and IDISAB will make the 469 assumption that sufficiently many registers exist that 470 constants can be kept preloaded. 471 472 In each case, the processors are assumed to be welldesigned 473 pipeline processors with a throughput of one SWAR instruction 474 each processor cycle for straightline code free of memory 475 access. Such processors are certainly feasible with a 476 suitable investment in circuitry, although practical designs 477 may make sacrifices to achieve close to optimal throughput 478 with less circuitry. However, the assumption makes for 479 straightforward performance evaluation based on instruction 480 count for straightline computational kernels. Furthermore, 481 the assumption also eliminates artifacts due to possibly different 482 latencies in reference and IDISA architectures. Because 483 the same assumption is made for reference and IDISA 484 architectures, determination of the additional circuit 485 complexity due to IDISA features is unaffected by the 486 assumption. 487 488 In the remainder of this paper, then, IDISAA and IDISAB 489 models are evaluated against their respective reference 490 architectures on straightline computational kernels 491 used in parallel bit stream processing and other applications. 492 As XML and other sequential text processing applications 493 tend to use memory in an efficient streaming model, the 494 applications tend to be computebound rather than IObound. 495 Thus, the focus on computational kernels addresses the 496 primary concern for performance improvement of these applications. 442 497 443 498 \section{Population Count} 444 499 445 \begin{figure} 500 \begin{figure}[h] 446 501 \begin{center}\small 447 502 \begin{verbatim} … … 467 522 \end{verbatim} 468 523 \end{center} 469 \caption{I deal SIMDPopulation Count}524 \caption{IDISA Population Count} 470 525 \label{inductivepopcount} 471 526 \end{figure}
Note: See TracChangeset
for help on using the changeset viewer.