[227] | 1 | %----------------------------------------------------------------------------- |
---|
| 2 | % |
---|
| 3 | % asplos094-cameron.tex |
---|
| 4 | % Robert D. Cameron and Dan Lin |
---|
| 5 | % |
---|
| 6 | % Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos |
---|
| 7 | % |
---|
| 8 | %----------------------------------------------------------------------------- |
---|
| 9 | \input epsf |
---|
| 10 | |
---|
| 11 | %\documentclass[preprint,natbib,10pt]{sigplanconf} |
---|
| 12 | %\documentclass[natbib,10pt]{sigplanconf} |
---|
| 13 | \documentclass[10pt]{sigplanconf} |
---|
| 14 | |
---|
| 15 | \usepackage{amsmath} |
---|
| 16 | \usepackage{graphicx} |
---|
| 17 | |
---|
| 18 | \begin{document} |
---|
| 19 | |
---|
| 20 | \conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.} |
---|
| 21 | \CopyrightYear{2009} |
---|
| 22 | \copyrightdata{978-1-60558-215-3/09/03} |
---|
| 23 | |
---|
| 24 | |
---|
| 25 | \titlebanner{banner above paper title} % These are ignored unless |
---|
| 26 | \preprintfooter{short description of paper} % 'preprint' option specified. |
---|
| 27 | |
---|
| 28 | \title{Architectural Support for SWAR Text Processing with Parallel Bit |
---|
| 29 | Streams: The Inductive Doubling Principle} |
---|
| 30 | %\subtitle{Subtitle Text, if any} |
---|
| 31 | |
---|
| 32 | \authorinfo{Robert D. Cameron \and Dan Lin} |
---|
| 33 | {School of Computing Science, Simon Fraser University} |
---|
[255] | 34 | {\tt \{cameron, lindanl\}@cs.sfu.ca} |
---|
[227] | 35 | |
---|
| 36 | |
---|
| 37 | \maketitle |
---|
| 38 | |
---|
| 39 | \begin{abstract} |
---|
| 40 | Parallel bit stream algorithms exploit the SWAR (SIMD within a |
---|
| 41 | register) capabilities of commodity |
---|
| 42 | processors in high-performance text processing applications such as |
---|
| 43 | UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression |
---|
[253] | 44 | matching. Direct architectural support for these algorithms in future SWAR |
---|
[227] | 45 | instruction sets could further increase performance as well as simplifying the |
---|
| 46 | programming task. A set of simple SWAR instruction set extensions are proposed |
---|
| 47 | for this purpose based on the principle of systematic support for inductive |
---|
| 48 | doubling as an algorithmic technique. These extensions are shown to |
---|
| 49 | significantly reduce instruction count in core parallel bit stream algorithms, |
---|
| 50 | often providing a 3X or better improvement. The extensions are also shown to be useful |
---|
[253] | 51 | for SWAR programming in other application areas, including providing a |
---|
[227] | 52 | systematic treatment for horizontal operations. An implementation model for |
---|
[253] | 53 | these extensions involves relatively simple circuitry added to the operand fetch |
---|
| 54 | components in a pipelined processor. |
---|
[227] | 55 | \end{abstract} |
---|
| 56 | |
---|
[259] | 57 | \category{C.1.2}{PROCESSOR ARCHITECTURES}{Multiple Data Stream Architectures (Multiprocessors)}[Single-instruction-stream, multiple-data-stream processors (SIMD)] |
---|
[227] | 58 | |
---|
[254] | 59 | \terms Design, Performance |
---|
[227] | 60 | |
---|
[254] | 61 | \keywords inductive doubling, parallel bit streams, SWAR |
---|
[227] | 62 | |
---|
[247] | 63 | \sloppy |
---|
[227] | 64 | |
---|
| 65 | \section{Introduction} |
---|
| 66 | |
---|
| 67 | In the landscape of parallel computing research, finding ways to |
---|
| 68 | exploit intrachip (multicore) and intraregister (SWAR) parallelism |
---|
| 69 | for text processing and other non-numeric applications is particularly |
---|
| 70 | challenging. Indeed, in documenting this landscape, a widely cited Berkeley |
---|
| 71 | study \cite{Landscape} identifies the finite-state machine algorithms associated |
---|
| 72 | with text processing to be the hardest of the thirteen ``dwarves'' |
---|
| 73 | to parallelize, concluding that nothing seems to help. Indeed, |
---|
| 74 | the study even speculates that applications in this area may simply be |
---|
[258] | 75 | ``embarrassingly sequential,'' easy to tackle for traditional |
---|
[227] | 76 | sequential processing approaches suitable for uniprocessors, |
---|
| 77 | but perhaps fundamentally unsuited to parallel methods. |
---|
| 78 | |
---|
| 79 | One approach that shows some promise, however, is the |
---|
| 80 | method of parallel bit streams, recently applied to |
---|
[247] | 81 | UTF-8 to UTF-16 transcoding \cite{u8u16, PPoPP08}, XML parsing \cite{CASCON08, Herdy} |
---|
[227] | 82 | and amino acid sequencing\cite{Green}. |
---|
| 83 | In this method, byte-oriented character data is first transposed to eight |
---|
| 84 | parallel bit streams, one for each bit position within the character code |
---|
[253] | 85 | units (bytes). Loading bit stream data into 128-bit registers, |
---|
[227] | 86 | then, allows data from 128 consecutive code units to be represented and |
---|
| 87 | processed at once. Bitwise logic and shift operations, bit scans, |
---|
| 88 | population counts and other bit-based operations are then used to carry |
---|
| 89 | out the work. |
---|
| 90 | |
---|
[247] | 91 | In application to UTF-8 to UTF-16 transcoding, a 3X to 25X speed-up |
---|
[253] | 92 | is achieved in using parallel bit stream techniques on SWAR-capable uniprocessors |
---|
[247] | 93 | employing the SSE or Altivec instruction sets\cite{PPoPP08}. |
---|
| 94 | In the broader context of XML parsing, further applications of these |
---|
| 95 | techniques demonstrate the utility of parallel bit stream techniques |
---|
[227] | 96 | in delivering performance benefits through a significant portion of the |
---|
| 97 | web technology stack. In an XML statistics gathering application, |
---|
| 98 | including the implementation of XML well-formedness checking, an |
---|
| 99 | overall 3X to 10X performance improvement is achieved in using |
---|
| 100 | the parallel bit stream methods in comparison with a similarly |
---|
[247] | 101 | coded application using such well known parsers as Expat and Xerces \cite{CASCON08}. |
---|
| 102 | In an application involving transformation between different XML formats |
---|
| 103 | (GML and SVG), an implementation using parallel bit stream technology |
---|
| 104 | required a mere 15 cycles per byte, while a range of other technologies |
---|
| 105 | required from 25 to 200 cycles per byte \cite{Herdy}. |
---|
[258] | 106 | Ongoing work is further applying the parallel bit stream methods to |
---|
[247] | 107 | parallel hash value computation and parallel regular expression matching |
---|
[227] | 108 | for the purpose of validating XML datatype declarations in |
---|
[247] | 109 | accord with XML Schema \cite{CASCON08}. |
---|
[227] | 110 | |
---|
| 111 | Given these promising initial results in the application of |
---|
| 112 | parallel bit stream methods, what role might architectural support play in |
---|
| 113 | further enhancing this route to parallelization of text processing? |
---|
| 114 | This paper addresses this question through presentation and |
---|
[253] | 115 | analysis of a constructive proposal: a set of SWAR instruction set |
---|
[227] | 116 | features based on the principle of systematic support for |
---|
| 117 | inductive doubling algorithms. Inductive doubling refers |
---|
| 118 | to a general property of certain kinds of algorithm that |
---|
| 119 | systematically double the values of field widths or other |
---|
| 120 | data attributes with each iteration. In essence, the goal |
---|
| 121 | of the proposed features is to support such algorithms |
---|
| 122 | with specific facilities to transition between successive power-of-2 field |
---|
[254] | 123 | widths. These transitions are quite frequent in |
---|
| 124 | parallel bit stream programming as well as other applications. |
---|
[227] | 125 | The specific features presented herein will be referred to |
---|
| 126 | as IDISA: inductive doubling instruction set architecture. |
---|
| 127 | |
---|
| 128 | The remainder of this paper is organized as follows. |
---|
| 129 | The second section of this paper introduces IDISA and the |
---|
[253] | 130 | SWAR notation used throughout this paper. The third |
---|
[245] | 131 | section moves on to discuss an evaluation methodology |
---|
| 132 | for IDISA in comparison to two reference architectures |
---|
| 133 | motivated by the SSE and Altivec instruction sets. |
---|
[238] | 134 | The fourth section provides a short first example of |
---|
[227] | 135 | the inductive doubling principle in action through |
---|
[245] | 136 | the case of population count. Sections 5 through 7 |
---|
| 137 | then address the application of IDISA to core |
---|
| 138 | algorithms in text processing with parallel bit |
---|
| 139 | streams. |
---|
[238] | 140 | The eighth section then considers the potential role of |
---|
| 141 | IDISA in supporting applications beyond parallel bit streams. |
---|
[245] | 142 | Section 9 addresses IDISA implementation while |
---|
| 143 | Section 10 concludes the paper with a summary of |
---|
| 144 | results and directions for further work. |
---|
[227] | 145 | |
---|
| 146 | |
---|
| 147 | \section{Inductive Doubling Architecture} |
---|
| 148 | |
---|
[254] | 149 | This section presents IDISA as an idealized model for a SWAR instruction set |
---|
[253] | 150 | architecture designed specifically to support inductive doubling |
---|
[254] | 151 | algorithms. The architecture is idealized |
---|
[227] | 152 | in the sense that we concentrate on only the necessary features |
---|
| 153 | for our purpose, without enumerating the additional |
---|
| 154 | operations that would be required for |
---|
[253] | 155 | SWAR applications in other domains. The goal is to focus |
---|
[227] | 156 | on the principles of inductive doubling support in a way |
---|
| 157 | that can accommodate a variety of realizations as other |
---|
| 158 | design constraints are brought to bear on the overall instruction set |
---|
| 159 | design. First we introduce a simple model and notation for |
---|
[253] | 160 | SWAR operations in general and then present the four key |
---|
[255] | 161 | features of IDISA. |
---|
[227] | 162 | |
---|
[253] | 163 | |
---|
[254] | 164 | IDISA supports typical SWAR integer operations using a {\em |
---|
| 165 | three-register model} involving two input registers |
---|
| 166 | and one output register. Each register is of size $N=2^K$ bits, |
---|
| 167 | for some integer $K$. |
---|
[227] | 168 | Typical values of $K$ for commodity processors include $K=6$ for |
---|
| 169 | the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for |
---|
| 170 | the 128-bit registers of SSE and Altivec technology and $K=8$ for |
---|
[254] | 171 | the upcoming Intel AVX technology. |
---|
| 172 | The registers may be |
---|
[227] | 173 | partitioned into $N/n$ fields of width $n=2^k$ bits for some values |
---|
| 174 | of $k \leq K$. Typical values of $k$ used on commodity processors |
---|
[253] | 175 | include $k = 3$ for SWAR operations on 8-bit fields (bytes), |
---|
[227] | 176 | $k = 4$ for operations on 16-bit fields and $k = 5$ for operations |
---|
| 177 | on 32-bit fields. Whenever a register $r$ is partitioned into $n$-bit |
---|
| 178 | fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$. |
---|
| 179 | Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of |
---|
| 180 | register $r$, using big-endian numbering. |
---|
| 181 | |
---|
[253] | 182 | Let \verb:simd<n>: represent the class of SWAR operations defined |
---|
[227] | 183 | on fields of size $n$ using C++ template syntax. Given a |
---|
[253] | 184 | binary function $F_n$ on $n$-bit fields, we denote the SWAR |
---|
[254] | 185 | version of this operation as \verb#simd<n>::F#. Given two input |
---|
[253] | 186 | registers \verb:a: and \verb:b: holding values $a$ and $b$, |
---|
[227] | 187 | respectively, the operation \verb#r=simd<n>::F(a,b)# stores |
---|
[254] | 188 | the value $r$ in the output register \verb:r: as determined by |
---|
[227] | 189 | the simultaneous calculation of individual field values in |
---|
| 190 | accord with the following equation. |
---|
[255] | 191 | \begin{eqnarray} |
---|
| 192 | r_i &=& F_n(a_i, b_i) |
---|
| 193 | \end{eqnarray} |
---|
[227] | 194 | |
---|
| 195 | For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left |
---|
| 196 | logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned |
---|
| 197 | integers as follows. |
---|
| 198 | %\singlespace |
---|
| 199 | \begin{eqnarray} |
---|
| 200 | \mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\ |
---|
| 201 | \mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\ |
---|
| 202 | \mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n |
---|
[255] | 203 | \end{eqnarray} |
---|
| 204 | |
---|
| 205 | The Altivec instruction set includes each of these operations |
---|
| 206 | for 8, 16 and 32-bit fields directly following the three-register |
---|
| 207 | model. The SSE set uses a two-register model with the result |
---|
| 208 | being copied back to one of the input operands. However, the |
---|
| 209 | C language intrinsics commonly used to access these instructions |
---|
| 210 | reflect the three-register model. The SSE set extends these |
---|
| 211 | operations to include operations on 64-bit fields, but |
---|
| 212 | constrains the shift instructions, requiring that all field shifts by the same amount. |
---|
[253] | 213 | |
---|
[227] | 214 | Given these definitions and notation, we now present |
---|
| 215 | the four key elements of an inductive doubling architecture. |
---|
| 216 | The first is a definition of a core set of binary functions |
---|
| 217 | on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$. |
---|
| 218 | The second is a set of {\em half-operand modifiers} that allow |
---|
| 219 | the inductive processing of fields of size $2n$ in terms of |
---|
| 220 | combinations of $n$-bit values selected from the fields. |
---|
| 221 | The third is the definition of packing operations that compress |
---|
| 222 | two consecutive registers of $n$-bit values into a single |
---|
| 223 | register of $n/2$-bit values. The fourth is the definition |
---|
| 224 | of merging operations that produce a set of $2n$ bit fields |
---|
| 225 | by concatenating corresponding $n$-bit fields from two |
---|
| 226 | parallel registers. Each of these features is described below. |
---|
[255] | 227 | |
---|
[227] | 228 | For the purpose of direct and efficient support for inductive |
---|
[255] | 229 | doubling algorithms, the provision of |
---|
[227] | 230 | a core set of operations at field widths of 2 and 4 as |
---|
[248] | 231 | well as the more traditional field widths of 8, 16 and 32 |
---|
[255] | 232 | is key. In essence, inductive doubling algorithms |
---|
[227] | 233 | work by establishing some base property at either single |
---|
| 234 | or 2-bit fields. Each iteration of the algorithm then |
---|
| 235 | goes on to establish the property for the power-of-2 |
---|
| 236 | field width. In order for this inductive step to be |
---|
| 237 | most conveniently and efficiently expressed, the |
---|
| 238 | core operations needed for the step should be available |
---|
| 239 | at each field width. In the case of work with parallel |
---|
| 240 | bit streams, the operations \verb:add:, \verb:sub:, |
---|
| 241 | \verb:sll:, \verb:srl: (shift right logical), and \verb:rotl: |
---|
| 242 | (rotate left) comprise the core. In other domains, |
---|
| 243 | additional operations may be usefully included in the |
---|
| 244 | core depending on the work that needs to be performed |
---|
| 245 | at each inductive doubling level. |
---|
| 246 | |
---|
| 247 | Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$ |
---|
| 248 | also includes fields of width 1. These are included for |
---|
| 249 | logical consistency, but are easily implemented by mapping |
---|
| 250 | directly to appropriate bitwise logic operations, which |
---|
| 251 | we assume are also available. For example, |
---|
| 252 | \verb#simd<1>::add# is equivalent to \verb:simd_xor:, the |
---|
[253] | 253 | bitwise exclusive-or operation. |
---|
[227] | 254 | |
---|
| 255 | The second key facility of the inductive doubling architecture |
---|
| 256 | is the potential application of half-operand modifiers to |
---|
[253] | 257 | the fields of either or both of the operands of a SWAR |
---|
[227] | 258 | operation. These modifiers select either the |
---|
| 259 | low $n/2$ |
---|
| 260 | bits of each $n$-bit field (modifier ``\verb:l:'') or the |
---|
| 261 | high $n/2$ bits (modifier ``\verb:h:''). When required, |
---|
| 262 | the modifier ``\verb:x:'' means that the full $n$ bits |
---|
| 263 | should be used, unmodified. The semantics of these |
---|
| 264 | modifiers are given by the following equations. |
---|
| 265 | %\singlespace |
---|
| 266 | \begin{eqnarray} |
---|
| 267 | l(r_n) & = & r_n \bmod 2^{n/2} \\ |
---|
| 268 | h(r_n) & = & r_n / 2^{n/2} \\ |
---|
| 269 | x(r_n) & = & r_n |
---|
| 270 | \end{eqnarray} |
---|
| 271 | %\doublespace |
---|
| 272 | In our notation, the half-operand modifiers are |
---|
| 273 | specified as optional template (compile-time) parameters |
---|
| 274 | for each of the binary functions. Thus, |
---|
[236] | 275 | \verb#simd<4>::add<h,l>(a,b)# is an operation which adds |
---|
[242] | 276 | the 2-bit quantity found in the high 2-bits of each 4-bit field |
---|
[227] | 277 | of its first operand (\verb:a:) |
---|
| 278 | together with the corresponding 2-bit quantity found in the |
---|
[242] | 279 | low 2-bits of its second operand (\verb:b:). |
---|
[227] | 280 | In general, the purpose of the half-operand modifiers |
---|
| 281 | in support of inductive doubling is to allow the processing |
---|
| 282 | of $n$-bit fields to easily expressed in terms of |
---|
| 283 | combination of the results determined by processing |
---|
| 284 | $n/2$ bit fields. |
---|
| 285 | |
---|
| 286 | The third facility of the inductive doubling architecture |
---|
| 287 | is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$. |
---|
| 288 | The field values of \verb#r=simd<n>::pack(a,b)# are |
---|
| 289 | defined by the following equations. |
---|
| 290 | %\singlespace |
---|
| 291 | \begin{eqnarray} |
---|
| 292 | r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\ |
---|
| 293 | r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n |
---|
| 294 | \end{eqnarray} |
---|
| 295 | %\doublespace |
---|
| 296 | Here conv is a function which performs conversion of an $n$-bit |
---|
| 297 | value to an $n/2$ bit value by signed saturation (although |
---|
| 298 | conversion by unsigned saturation would also suit our purpose). |
---|
| 299 | |
---|
| 300 | Half-operand modifiers may also be used with the pack |
---|
| 301 | operations. Thus packing with conversion by masking off all |
---|
| 302 | but the low $n/2$ bits of each field may be |
---|
[255] | 303 | be performed using the operation \verb#simd<n>::pack<l,l>#. |
---|
[227] | 304 | |
---|
| 305 | The final facility of the inductive doubling architecture is |
---|
[255] | 306 | a set of merging operations that produce $2n$-bit fields |
---|
[227] | 307 | by concatenating corresponding $n$-bit fields from the |
---|
[255] | 308 | operand registers. The |
---|
[227] | 309 | operations \verb#r=simd<n>::mergeh(a,b)# and |
---|
| 310 | \verb#s=simd<n>::mergel(a,b)# |
---|
| 311 | are defined by the following equations. |
---|
| 312 | %\singlespace |
---|
| 313 | \begin{eqnarray} |
---|
| 314 | r_{2n}[i] & = & a[i] \times 2^n + b[i] \\ |
---|
| 315 | s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)] |
---|
| 316 | \end{eqnarray} |
---|
| 317 | %\doublespace |
---|
| 318 | Both SSE and Altivec provide versions of pack and merge operations |
---|
| 319 | for certain field widths. The pack operations are provided |
---|
| 320 | with operands having 16-bit or 32-bit fields on each platform, although |
---|
| 321 | with some variation in how conversion is carried out. |
---|
| 322 | The merge operations are provided at 8-bit, 16-bit and 32-bit |
---|
| 323 | field widths on both architectures and also at the 64-bit level |
---|
| 324 | on SSE. |
---|
| 325 | |
---|
[238] | 326 | This completes the description of IDISA. As described, many of the |
---|
[253] | 327 | features are already available with the SWAR facilities of |
---|
[227] | 328 | existing commodity processors. The extensions enumerated |
---|
| 329 | here are relatively straightforward. The innovation |
---|
| 330 | is to specifically tackle the design of facilities to |
---|
| 331 | offer systematic support for transitions between power-of-2 field widths. |
---|
| 332 | As we shall show in the remainder of this paper, these facilities |
---|
| 333 | can dramatically reduce instruction count in core parallel bit |
---|
| 334 | stream algorithms, with a factor of 3 reduction being typical. |
---|
| 335 | |
---|
[238] | 336 | \section{Evaluation Methodology} |
---|
| 337 | |
---|
| 338 | IDISA represents a set of instruction set features that |
---|
| 339 | could potentially be added to any SWAR processor. The goal |
---|
| 340 | in this paper is to evaluate these features independent |
---|
[241] | 341 | of artifacts that may be due to any particular realization, |
---|
[238] | 342 | while still considering realistic models based on existing |
---|
| 343 | commodity instruction set architectures. For the purpose |
---|
| 344 | of IDISA evaluation, then, we define two reference architectures. |
---|
| 345 | For concreteness, IDISA and the two reference architectures will |
---|
| 346 | each be considered as 128-bit processors employing the |
---|
| 347 | three-register SWAR model defined in the previous |
---|
| 348 | section. |
---|
| 349 | |
---|
| 350 | Reference architecture A (RefA) consists of a limited register |
---|
| 351 | processor providing a set of core binary operations defined |
---|
| 352 | for 8, 16, 32 and 64 bit fields. The core binary operations |
---|
| 353 | will be assumed to be those defined by the SSE instruction |
---|
| 354 | set for 16-bit fields. In addition, we assume that |
---|
| 355 | shift immediate operations for each field width exist, |
---|
| 356 | e.g., \verb#simd<8>::srli<1>(x)# for a right logical |
---|
| 357 | shift of each 8-bit field by 1. We also assume that |
---|
[251] | 358 | a constant load operation \verb#simd::constant<n>(c)# |
---|
[255] | 359 | loads the constant value $c$ into each $n$ bit field. |
---|
| 360 | The pack and merge facilities of SSE will also be assumed. |
---|
[238] | 361 | |
---|
| 362 | Reference architecture B (RefB) consists of a register-rich |
---|
| 363 | processor incorporating all the operations of reference |
---|
| 364 | architecture A as well as the following additional facilities |
---|
| 365 | inspired by the Altivec instruction set. |
---|
| 366 | For each of the 8, 16, 32 and 64 bit widths, a binary rotate left |
---|
| 367 | logical instruction \verb#simd<n>::rotl(a,b)# rotates each field |
---|
| 368 | of $a$ by the rotation count in the corresponding field of $b$. |
---|
[255] | 369 | A three-input \verb#simd<1>::if(a,b,c)# bitwise logical |
---|
| 370 | operator implements the logic $r=a \wedge b \vee \neg a \wedge c$, patterned |
---|
[238] | 371 | after the Altivec \verb:vec_sel: operation. Finally, |
---|
[240] | 372 | a \verb#simd<8>::permute(a,b,c)# selects an arbitrary |
---|
[238] | 373 | permutation of bytes from the concatenation of $a$ and $b$ based |
---|
| 374 | on the set of indices in $c$. |
---|
| 375 | |
---|
| 376 | Two versions of IDISA are assessed against these reference |
---|
| 377 | architectures as follows. IDISA-A has all the facilities |
---|
| 378 | of RefA extended with half-operand modifiers and all core |
---|
| 379 | operations at field widths of 2, 4 and 128. IDISA-B is |
---|
| 380 | similarly defined and extended based on RefB. Algorithms |
---|
| 381 | for both RefA and IDISA-A are assessed assuming that |
---|
| 382 | any required constants must be loaded as needed; this |
---|
| 383 | reflects the limited register assumption. On the other, |
---|
| 384 | assessment for both RefB and IDISA-B will make the |
---|
| 385 | assumption that sufficiently many registers exist that |
---|
| 386 | constants can be kept preloaded. |
---|
| 387 | |
---|
[242] | 388 | In each case, the processors are assumed to be |
---|
| 389 | pipelined processors with a throughput of one SWAR instruction |
---|
[238] | 390 | each processor cycle for straight-line code free of memory |
---|
[254] | 391 | access. This assumption makes for |
---|
[238] | 392 | straightforward performance evaluation based on instruction |
---|
[255] | 393 | count for straight-line computational kernels. |
---|
| 394 | Furthermore, the assumption also eliminates artifacts due to |
---|
| 395 | possibly different latencies in reference and IDISA architectures. |
---|
| 396 | Because the same assumption is made for reference and IDISA |
---|
[238] | 397 | architectures, determination of the additional circuit |
---|
| 398 | complexity due to IDISA features is unaffected by the |
---|
| 399 | assumption. |
---|
| 400 | |
---|
| 401 | In the remainder of this paper, then, IDISA-A and IDISA-B |
---|
| 402 | models are evaluated against their respective reference |
---|
| 403 | architectures on straight-line computational kernels |
---|
| 404 | used in parallel bit stream processing and other applications. |
---|
| 405 | As XML and other sequential text processing applications |
---|
| 406 | tend to use memory in an efficient streaming model, the |
---|
| 407 | applications tend to be compute-bound rather than IO-bound. |
---|
| 408 | Thus, the focus on computational kernels addresses the |
---|
| 409 | primary concern for performance improvement of these applications. |
---|
| 410 | |
---|
[258] | 411 | The additional circuit complexity to realize IDISA-A and |
---|
[239] | 412 | IDISA-B designs over their reference models will be |
---|
| 413 | addressed in the penultimate section. That discussion |
---|
| 414 | will focus primarily on the complexity of implementing |
---|
| 415 | half-operand modifier logic, but will also address |
---|
| 416 | the extension of the core operations to operate on |
---|
| 417 | 2-bit, 4-bit and 128-bit fields, as well. |
---|
| 418 | |
---|
[227] | 419 | \section{Population Count} |
---|
| 420 | |
---|
[255] | 421 | \begin{figure} |
---|
[227] | 422 | \begin{center}\small |
---|
| 423 | \begin{verbatim} |
---|
| 424 | c = (x & 0x55555555) + ((x >> 1) & 0x55555555); |
---|
| 425 | c = (c & 0x33333333) + ((c >> 2) & 0x33333333); |
---|
| 426 | c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F); |
---|
| 427 | c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF); |
---|
| 428 | c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF); |
---|
| 429 | \end{verbatim} |
---|
| 430 | \end{center} |
---|
| 431 | \caption{Population Count Reference Algorithm} |
---|
| 432 | \label{HD-pop} |
---|
| 433 | \end{figure} |
---|
| 434 | |
---|
| 435 | \begin{figure} |
---|
| 436 | \begin{center}\small |
---|
| 437 | \begin{verbatim} |
---|
[236] | 438 | c = simd<2>::add<h,l>(x, x); |
---|
| 439 | c = simd<4>::add<h,l>(c, c); |
---|
| 440 | c = simd<8>::add<h,l>(c, c); |
---|
| 441 | c = simd<16>::add<h,l>(c, c); |
---|
| 442 | c = simd<32>::add<h,l>(c, c); |
---|
[227] | 443 | \end{verbatim} |
---|
| 444 | \end{center} |
---|
[238] | 445 | \caption{IDISA Population Count} |
---|
[227] | 446 | \label{inductivepopcount} |
---|
| 447 | \end{figure} |
---|
| 448 | |
---|
| 449 | As an initial example to illustrate the principle of inductive doubling |
---|
| 450 | in practice, consider the problem of {\em population count}: determining |
---|
| 451 | the number of one bits within a particular bit field. It is important |
---|
[239] | 452 | enough for such operations as calculating Hamming distance to be included |
---|
| 453 | as a built-in instruction |
---|
[227] | 454 | on some processors. For example, the SPU of the Cell Broadband Engine |
---|
[253] | 455 | has a SWAR population count instruction \verb:si_cntb: for simultaneously |
---|
[227] | 456 | determining the |
---|
| 457 | number of 1 bits within each byte of a 16-byte register. |
---|
| 458 | In text processing with parallel bit streams, population count has direct |
---|
| 459 | application to keeping track of line numbers for error reporting, for example. |
---|
| 460 | Given a bit block identifying the positions of newline characters within |
---|
| 461 | a block of characters being processed, the population count of the |
---|
| 462 | bit block can be used to efficiently and conveniently be used to update |
---|
| 463 | the line number upon completion of block processing. |
---|
| 464 | |
---|
| 465 | Figure \ref{HD-pop} presents a traditional divide-and-conquer |
---|
| 466 | implementation for a 32-bit integer {\tt x} adapted from |
---|
| 467 | Warren \cite{HackersDelight}, while |
---|
[230] | 468 | Figure \ref{inductivepopcount} shows the corresponding IDISA |
---|
| 469 | implementation for a vector of 32-bit fields. Each implementation employs |
---|
[227] | 470 | five steps of inductive doubling to produce population counts |
---|
| 471 | within 32 bit fields. The traditional implementation employs |
---|
| 472 | explicit masking and shifting operations, while these |
---|
| 473 | operations are implicit within the semantics of the inductive |
---|
[230] | 474 | doubling instructions shown in Figure \ref{inductivepopcount}. |
---|
[227] | 475 | In each implementation, the first step determines the |
---|
| 476 | the population counts within 2-bit fields |
---|
| 477 | by adding the high bit of each such field to the low bit |
---|
| 478 | to produce a set of 2-bit counts in {\tt c}. |
---|
| 479 | In the second step, the counts within 4-bit fields of {\tt c} are determined |
---|
| 480 | by adding the counts of the corresponding high and low 2-bit subfields. |
---|
| 481 | Continuing in this fashion, |
---|
| 482 | the final population counts within 32-bit fields are determined in five steps. |
---|
| 483 | |
---|
| 484 | With the substitution of longer mask constants replicated for four |
---|
| 485 | 32-bit fields, the implementation of Figure \ref{HD-pop} can be |
---|
[231] | 486 | directly adapted to SWAR processing using 128-bit registers. |
---|
| 487 | Each binary operator is replaced by a corresponding binary |
---|
[241] | 488 | SWAR operation. Without the IDISA features, a |
---|
[239] | 489 | straightforward RefA implementation of population count for |
---|
[230] | 490 | 32-bit fields thus employs 10 operations to load or generate |
---|
[227] | 491 | mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a |
---|
[231] | 492 | total of 30 operations to complete the task. Employing |
---|
| 493 | optimization identified by Warren, this can be reduced to |
---|
| 494 | 20 operations, 5 of which are required to generate mask constants. |
---|
[230] | 495 | At the cost of register pressure, it is possible that these constants |
---|
[239] | 496 | could be kept preloaded in long vector processing. In accord |
---|
| 497 | with our evaluation model, the RefB cost is thus 15 operations. |
---|
| 498 | As the IDISA implementation requires no constants at all, |
---|
| 499 | both the IDISA-A and IDISA-B cost is 5 operations. |
---|
| 500 | At our assumed one CPU cycle per instruction model, IDISA-A |
---|
| 501 | offers a 4X improvement over RefA, while IDISA-B offers a 3X |
---|
| 502 | improvement over its comparator. |
---|
[227] | 503 | |
---|
[239] | 504 | The pattern illustrated by population count is typical. |
---|
| 505 | An inductive doubling algorithm of $n$ steps typically applies |
---|
| 506 | mask or shift operations at each step for each of the |
---|
[255] | 507 | two operands being combined in the step. In general, |
---|
| 508 | the mask constants shown in Figure \ref{HD-pop} recur; these |
---|
| 509 | are termed ``magic masks'' by Knuth \cite{v4pf1a}. |
---|
| 510 | If the algorithm employs a single operation at each step, then a total |
---|
[239] | 511 | of $3n$ operations are the required in a RefB implementation, |
---|
| 512 | and possibly $4n$ for a RefA implementation including the |
---|
| 513 | cost of loading masks. IDISA-A and IDISA-B implementations |
---|
| 514 | typically eliminate the explicit mask and shift operations |
---|
[231] | 515 | through appropriate half-operand modifiers, reducing the |
---|
[239] | 516 | total instruction count to $n$. Thus a 3X to 4X improvement |
---|
| 517 | obtains in these cases. |
---|
[231] | 518 | |
---|
[227] | 519 | \section{Transposition to Parallel Bit Streams} |
---|
| 520 | |
---|
| 521 | In this section, we consider the first major |
---|
[241] | 522 | application of IDISA: transposition of byte stream data to parallel bit stream |
---|
[227] | 523 | form. Of course, this operation is critical to the |
---|
| 524 | method of parallel bit streams and all applications |
---|
| 525 | of the method can benefit from a highly efficient |
---|
| 526 | transposition process. Before considering how |
---|
[241] | 527 | the IDISA supports this |
---|
[227] | 528 | transposition process, however, we first consider |
---|
| 529 | algorithms on existing architectures. Two algorithms |
---|
| 530 | are presented; the best of these requires 72 |
---|
[241] | 531 | SWAR operations under the RefB model to perform |
---|
[227] | 532 | transposition of eight serial registers of byte stream data into |
---|
| 533 | eight parallel registers of bit stream data. |
---|
| 534 | |
---|
| 535 | We then go on to show how the transposition problem |
---|
[241] | 536 | can be solved using IDISA-A or IDISA-B |
---|
[253] | 537 | with a mere 24 three-register SWAR operations. We also show |
---|
[227] | 538 | that this is optimal for any three-register instruction set model. |
---|
| 539 | |
---|
[255] | 540 | \begin{figure} |
---|
[227] | 541 | \begin{center} |
---|
[255] | 542 | \includegraphics[width=87mm, trim= 40 50 0 50]{S2P_IO.pdf} |
---|
| 543 | \caption{Serial to Parallel Transposition} |
---|
[227] | 544 | \label{s2p-spec} |
---|
| 545 | \end{center} |
---|
| 546 | |
---|
| 547 | \end{figure} |
---|
| 548 | Figure \ref{s2p-spec} illustrates the input-output requirements of |
---|
| 549 | the transposition problem. We assume that inputs and |
---|
[241] | 550 | outputs are each SWAR registers of size $N=2^K$ bits. |
---|
[227] | 551 | The input consists of $N$ bytes of serial byte data, |
---|
[253] | 552 | stored consecutively in eight SWAR registers each holding |
---|
[227] | 553 | $N/8$ bytes. The output consists of eight parallel |
---|
| 554 | registers, one each for the eight individual bit positions |
---|
| 555 | within a byte. Upon completion of the transposition process, |
---|
| 556 | each output register is to hold the $N$ bits corresponding |
---|
| 557 | to the selected bit position in the sequence of $N$ input |
---|
| 558 | bytes. |
---|
| 559 | |
---|
| 560 | \subsection{Bit Gathering Algorithm} |
---|
| 561 | |
---|
[248] | 562 | % \begin{figure}[tbh] |
---|
| 563 | % \begin{center} |
---|
| 564 | % \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf} |
---|
| 565 | % \caption{Serial to Parallel Transposition Using Bit-Gathering} |
---|
| 566 | % \label{gather} |
---|
| 567 | % \end{center} |
---|
| 568 | % \end{figure} |
---|
[227] | 569 | One straightforward algorithm for implementing the transposition process |
---|
[241] | 570 | takes advantage of SWAR bit gathering operations that exist |
---|
[227] | 571 | on some architectures. This operation gathers one bit per byte |
---|
[253] | 572 | from a particular position within each byte of a register. |
---|
[248] | 573 | For example, the {\tt pmovmskb} operation of the Intel |
---|
| 574 | SSE instruction set forms a 16-bit mask |
---|
[227] | 575 | consisting of the high bit of each byte. Similarly, the |
---|
| 576 | {\tt \verb:si_gbb:} operation of the synergistic processing units of the |
---|
[253] | 577 | Cell Broadband Engine gathers together the low bit of each byte. |
---|
[248] | 578 | % Figure \ref{gather} illustrates the |
---|
| 579 | % bit gathering process. |
---|
[227] | 580 | |
---|
[248] | 581 | Using bit gathering, each bit stream of output is assembled 16 positions |
---|
| 582 | at a time. Bits from the input register must be shifted into |
---|
| 583 | position, the gather operation performed and the result inserted |
---|
| 584 | into position in the output register. For the 8 streams, this |
---|
| 585 | requires at least 22 operations for 16 positions, or 176 operations |
---|
| 586 | to complete the transposition task. |
---|
[227] | 587 | |
---|
| 588 | \subsection{BytePack Algorithm} |
---|
| 589 | |
---|
[248] | 590 | A more efficient transposition algorithm on commodity |
---|
[241] | 591 | SWAR architectures involves three |
---|
[227] | 592 | stages of binary division transformation. This is similar |
---|
| 593 | to the three stage bit matrix inversion described by |
---|
[241] | 594 | Warren \cite{HackersDelight}, although modified to use SWAR operations. |
---|
[227] | 595 | In each stage, input streams are divided into two half-length output streams. |
---|
| 596 | The first stage separates the bits at even numbered positions from those |
---|
| 597 | at odd number positions. The two output streams from the first |
---|
| 598 | stage are then further divided in the second stage. |
---|
| 599 | The stream comprising even numbered bits from the original byte stream |
---|
| 600 | divides into one stream consisting of bits from positions 0 and 4 of each |
---|
| 601 | byte in the original stream and a second stream consisting of bits |
---|
| 602 | from positions 2 and 6 of each original byte. The stream of bits from |
---|
[241] | 603 | odd positions is similarly divided into streams for bits from each of the |
---|
[227] | 604 | positions 1 and 5 and bits from positions 2 and 6. |
---|
| 605 | Finally, each of the four streams resulting from the second stage are |
---|
| 606 | divided into the desired individual bit streams in the third stage. |
---|
| 607 | |
---|
[241] | 608 | % \begin{figure}[tbh] |
---|
| 609 | % \begin{center}\small |
---|
| 610 | % \begin{verbatim} |
---|
| 611 | % s0h = simd<16>::srli<8>(s0); |
---|
[251] | 612 | % s0l = simd_and(s0, simd<16>::constant(0x00FF)); |
---|
[241] | 613 | % s1h = simd<16>::srli<8>(s1); |
---|
[251] | 614 | % s1l = simd_and(s1, simd<16>::constant(0x00FF)); |
---|
[241] | 615 | % t0 = simd<16>::pack(s0h, s1h); |
---|
| 616 | % t1 = simd<16>::pack(s0l, s1l); |
---|
| 617 | % t0_l1 = simd<16>::slli<1>(t0); |
---|
| 618 | % t0_r1 = simd<16>::srli<1>(t1); |
---|
[251] | 619 | % mask = simd<8>::constant(0xAA); |
---|
[241] | 620 | % p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask)); |
---|
| 621 | % p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask)); |
---|
| 622 | % \end{verbatim} |
---|
| 623 | % \end{center} |
---|
| 624 | % \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm} |
---|
| 625 | % \label{s2pstep} |
---|
| 626 | % \end{figure} |
---|
| 627 | % |
---|
| 628 | |
---|
[255] | 629 | \begin{figure} |
---|
[233] | 630 | \begin{center}\small |
---|
[227] | 631 | \begin{verbatim} |
---|
[255] | 632 | even={0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}; |
---|
| 633 | odd ={1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}; |
---|
[251] | 634 | mask = simd<8>::constant(0xAA); |
---|
[241] | 635 | t0 = simd<8>::permute(s0, s1, even); |
---|
| 636 | t1 = simd<8>::permute(s0, s1, odd); |
---|
| 637 | p0 = simd_if(mask, t0, simd<16>::srli<1>(t1)); |
---|
| 638 | p1 = simd_if(mask, simd<16>::slli<1>(t0), t1); |
---|
[227] | 639 | \end{verbatim} |
---|
| 640 | \end{center} |
---|
[241] | 641 | \caption{RefB Transposition Step in BytePack Stage 1} |
---|
[227] | 642 | \label{s2pstep} |
---|
| 643 | \end{figure} |
---|
| 644 | |
---|
| 645 | The binary division transformations are accomplished in each stage |
---|
| 646 | using byte packing, shifting and masking. In each stage, a |
---|
| 647 | transposition step combines each pair of serial input registers to |
---|
[241] | 648 | produce a pair of parallel output registers. |
---|
| 649 | Figure \ref{s2pstep} shows a stage 1 transposition step in a |
---|
| 650 | Ref B implementation. Using the permute facility, the even |
---|
| 651 | and odd bytes, respectively, from two serial input registers |
---|
| 652 | \verb:s0: and \verb:s1: are packed into temporary registers |
---|
| 653 | \verb:t0: and \verb:t1:. The even and odd bits are then |
---|
| 654 | separated into two parallel output registers \verb:p0: and \verb:p1: |
---|
| 655 | by selecting alternating bits using a mask. This step is applied |
---|
[227] | 656 | four times in stage 1; stages 2 and 3 also consist of four applications |
---|
[241] | 657 | of a similar step with different shift and mask constants. |
---|
| 658 | Overall, 6 operations per step are required, yielding a total |
---|
| 659 | of 72 operations to transpose 128 bytes to parallel bit stream |
---|
| 660 | form in the RefB implementation. |
---|
[227] | 661 | |
---|
[241] | 662 | In a RefA implementation, byte packing may also be achieved |
---|
| 663 | by the \verb#simd<16>::pack# with 4 additional operations to |
---|
| 664 | prepare operands. Essentially, the RefB implementation |
---|
[258] | 665 | uses single permute instructions to implement the equivalent of |
---|
[241] | 666 | \verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#. |
---|
| 667 | The RefA implementation also requires 3 logic operations to implement |
---|
| 668 | each \verb#simd_if#. |
---|
| 669 | Assuming that mask loads are only need once per 128 bytes, |
---|
| 670 | a total of 148 operations are required in the RefB implementation. |
---|
[227] | 671 | |
---|
| 672 | \subsection{Inductive Halving Algorithm} |
---|
| 673 | |
---|
[241] | 674 | Using IDISA, it is possible to design |
---|
[227] | 675 | a transposition algorithm that is both easier to understand and requires |
---|
[258] | 676 | many fewer operations than the the BytePack algorithm described above. |
---|
[227] | 677 | We call it the inductive halving algorithm for serial to parallel |
---|
| 678 | transposition, because it proceeds by reducing byte streams to |
---|
| 679 | two sets of nybble streams in a first stage, dividing the nybble |
---|
[258] | 680 | streams into streams of bit-pairs in a second stage and finally |
---|
| 681 | dividing the bit-pair streams into bit streams in the third stage. |
---|
[227] | 682 | |
---|
| 683 | Figure \ref{halvingstep} shows one step in stage 1 of the inductive |
---|
[241] | 684 | halving algorithm, comprising just two IDISA-A operations. |
---|
[227] | 685 | The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte |
---|
| 686 | from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts |
---|
[258] | 687 | the low nybble of each byte. As in the BytePack algorithm, this step is |
---|
[227] | 688 | applied 4 times in stage 1, for a total of 8 operations. |
---|
| 689 | |
---|
| 690 | Stage 2 of the inductive halving algorithm reduces nybble streams |
---|
| 691 | to streams of bit pairs. The basic step in this algorithm consists |
---|
| 692 | of one \verb#simd<4>::pack<h,h># operation to extract the high pair |
---|
| 693 | of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the |
---|
[255] | 694 | low pair of each nybble. Four applications of this step complete stage 2. |
---|
[227] | 695 | |
---|
[255] | 696 | \begin{figure} |
---|
| 697 | \small |
---|
| 698 | \begin{verbatim} |
---|
| 699 | p0 = simd<8>::pack<h,h>(s0, s1); |
---|
| 700 | p1 = simd<8>::pack<l,l>(s0, s1); |
---|
| 701 | \end{verbatim} |
---|
| 702 | \caption{Step in Inductive Halving Algorithm Stage 1} |
---|
| 703 | \label{halvingstep} |
---|
| 704 | \end{figure} |
---|
| 705 | |
---|
| 706 | |
---|
[227] | 707 | Stage 3 similarly uses four applications of a step that uses a |
---|
| 708 | \verb#simd<2>::pack<h,h># operation to extract the high bit of |
---|
| 709 | each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of |
---|
[254] | 710 | each pair. Under either IDISA-A or IDISA-B models, |
---|
| 711 | the complete algorithm to transform eight serial |
---|
[227] | 712 | byte registers s0 through s7 into the eight parallel bit stream |
---|
[254] | 713 | registers bit0 through bit7 requires a mere 24 instructions per 128 |
---|
| 714 | input bytes. |
---|
[227] | 715 | |
---|
[254] | 716 | % \begin{figure}[tbh] |
---|
| 717 | % \small |
---|
| 718 | % \begin{verbatim} |
---|
| 719 | % hnybble0 = simd<8>::pack<h,h>(s0, s1); |
---|
| 720 | % lnybble0 = simd<8>::pack<l,l>(s0, s1); |
---|
| 721 | % hnybble1 = simd<8>::pack<h,h>(s2, s3); |
---|
| 722 | % lnybble1 = simd<8>::pack<l,l>(s2, s3); |
---|
| 723 | % hnybble2 = simd<8>::pack<h,h>(s4, s5); |
---|
| 724 | % lnybble2 = simd<8>::pack<l,l>(s4, s5); |
---|
| 725 | % hnybble3 = simd<8>::pack<h,h>(s6, s7); |
---|
| 726 | % lnybble3 = simd<8>::pack<l,l>(s6, s7); |
---|
| 727 | % hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1); |
---|
| 728 | % hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1); |
---|
| 729 | % lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1); |
---|
| 730 | % ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1); |
---|
| 731 | % hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3); |
---|
| 732 | % hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3); |
---|
| 733 | % lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3); |
---|
| 734 | % ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3); |
---|
| 735 | % bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1); |
---|
| 736 | % bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1); |
---|
| 737 | % bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1); |
---|
| 738 | % bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1); |
---|
| 739 | % bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1); |
---|
| 740 | % bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1); |
---|
| 741 | % bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1); |
---|
| 742 | % bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1); |
---|
| 743 | % \end{verbatim} |
---|
| 744 | % \caption{Complete Inductive Halving Algorithm} |
---|
| 745 | % \label{halvingalgorithm} |
---|
| 746 | % \end{figure} |
---|
[227] | 747 | |
---|
| 748 | \subsection{Optimality of the Inductive Halving Algorithm} |
---|
| 749 | |
---|
| 750 | Here we show that the inductive halving algorithm presented in |
---|
| 751 | the previous subsection is optimal in the following sense: |
---|
[253] | 752 | no other algorithm on any 3-register SWAR architecture can use |
---|
[227] | 753 | fewer than 24 operations to transform eight serial registers |
---|
| 754 | of byte stream data into eight parallel registers of bit stream data. |
---|
[253] | 755 | By 3-register SWAR architecture, we refer to any architecture |
---|
| 756 | that uses SWAR instructions consistent with our overall model of |
---|
[227] | 757 | binary operations using two input register operands to produce |
---|
| 758 | one output register value. |
---|
| 759 | |
---|
| 760 | Observe that the $N$ data bits from each input register must be |
---|
| 761 | distributed $N/8$ each to the 8 output registers by virtue of |
---|
| 762 | the problem definition. Each output register can effectively |
---|
| 763 | be given a 3-bit address; the partitioning problem can be viewed |
---|
| 764 | as moving data to the correct address. However, each |
---|
| 765 | operation can move results into at most one register. |
---|
| 766 | At most this can result in the assignment of one correct address |
---|
| 767 | bit for each of the $N$ input bits. As all $8N$ input bits |
---|
| 768 | need to be moved to a register with a correct 3-bit address, |
---|
| 769 | a minimum of 24 operations is required. |
---|
| 770 | |
---|
[237] | 771 | \subsection{End-to-End Significance} |
---|
| 772 | |
---|
| 773 | In a study of several XML technologies applied to |
---|
| 774 | the problem of GML to SVG transformation, the parabix |
---|
| 775 | implementation (parallel bit streams for XML) was |
---|
| 776 | found to the fastest with a cost of approximately |
---|
| 777 | 15 CPU cycles per input byte \cite{Herdy}. Within parabix, |
---|
| 778 | transposition to parallel bit stream form requires |
---|
| 779 | approximately 1.1 cycles per byte \cite{CASCON08}. |
---|
| 780 | All other things being equal, a 3X speed-up of transposition |
---|
| 781 | alone would improve end-to-end performance in a |
---|
| 782 | complete XML processing application by more than 4\%. |
---|
| 783 | |
---|
| 784 | |
---|
[227] | 785 | \section{Parallel to Serial Conversion} |
---|
| 786 | |
---|
| 787 | Parallel bit stream applications may apply string editing |
---|
| 788 | operations in bit space to substitute, delete or insert |
---|
| 789 | parallel sets of bits at particular positions. In such cases, |
---|
| 790 | the inverse transform that converts a set of parallel bit |
---|
| 791 | streams back into byte space is needed. In the example of |
---|
| 792 | UTF-8 to UTF-16 transcoding, the inverse transform is |
---|
| 793 | actually used twice for each application of the forward |
---|
| 794 | transform, to separately compute the high and low byte |
---|
| 795 | streams of each UTF-16 code unit. Those two byte streams |
---|
[258] | 796 | are subsequently merged to form the final result. |
---|
[227] | 797 | |
---|
| 798 | Algorithms for performing the inverse transform mirror those |
---|
[253] | 799 | of the forward transform, employing SWAR merge operations |
---|
[227] | 800 | in place of pack operations. The best algorithm known |
---|
[253] | 801 | to us on the commodity SWAR architectures takes advantage |
---|
[227] | 802 | of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel# |
---|
| 803 | operations that are available with each of the SSE and Altivec instruction |
---|
[243] | 804 | sets. To perform the full inverse transform of 8 parallel |
---|
| 805 | registers of bit stream data into 8 serial registers of byte stream data, |
---|
| 806 | a RefA implementation requires 120 operations, while a RefB |
---|
| 807 | implementation reduces this to 72. |
---|
[227] | 808 | |
---|
[254] | 809 | % \begin{figure}[tbh] |
---|
| 810 | % \begin{center}\small |
---|
| 811 | % \begin{verbatim} |
---|
| 812 | % bit01_r0 = simd<1>::mergeh(bit0, bit1); |
---|
| 813 | % bit01_r1 = simd<1>::mergel(bit0, bit1); |
---|
| 814 | % bit23_r0 = simd<1>::mergeh(bit2, bit3); |
---|
| 815 | % bit23_r1 = simd<1>::mergel(bit2, bit3); |
---|
| 816 | % bit45_r0 = simd<1>::mergeh(bit4, bit5); |
---|
| 817 | % bit45_r1 = simd<1>::mergel(bit4, bit5); |
---|
| 818 | % bit67_r0 = simd<1>::mergeh(bit6, bit7); |
---|
| 819 | % bit67_r1 = simd<1>::mergel(bit6, bit7); |
---|
| 820 | % bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0); |
---|
| 821 | % bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0); |
---|
| 822 | % bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1); |
---|
| 823 | % bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1); |
---|
| 824 | % bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0); |
---|
| 825 | % bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0); |
---|
| 826 | % bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1); |
---|
| 827 | % bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1); |
---|
| 828 | % s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0); |
---|
| 829 | % s1 = simd<4>::mergel(bit0123_r0, bit4567_r0); |
---|
| 830 | % s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1); |
---|
| 831 | % s3 = simd<4>::mergel(bit0123_r1, bit4567_r1); |
---|
| 832 | % s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2); |
---|
| 833 | % s5 = simd<4>::mergel(bit0123_r2, bit4567_r2); |
---|
| 834 | % s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3); |
---|
| 835 | % s7 = simd<4>::mergel(bit0123_r3, bit4567_r3); |
---|
| 836 | % \end{verbatim} |
---|
| 837 | % \end{center} |
---|
| 838 | % \label{p2s-inductive} |
---|
| 839 | % \caption{Parallel to Serial Transposition by Inductive Doubling} |
---|
| 840 | % \end{figure} |
---|
[227] | 841 | |
---|
[254] | 842 | |
---|
[243] | 843 | An algorithm employing only 24 operations using IDISA-A/B is relatively |
---|
[227] | 844 | straightforward.. In stage 1, parallel registers for individual bit streams |
---|
| 845 | are first merged with bit-level interleaving |
---|
| 846 | using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel# |
---|
| 847 | operations. For each of the four pairs of consecutive |
---|
| 848 | even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7), |
---|
[258] | 849 | two consecutive registers of bit-pair data are produced. |
---|
[227] | 850 | In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel# |
---|
[258] | 851 | are then applied to merge to bit-pair streams to produce streams |
---|
[227] | 852 | of nybbles for the high and low nybble of each byte. Finally, |
---|
| 853 | the nybble streams are merged in stage 3 to produce the |
---|
| 854 | desired byte stream data. The full inductive doubling |
---|
[254] | 855 | algorithm for parallel to serial transposition thus |
---|
| 856 | requires three stages of 8 instructions each. The algorithm is again |
---|
| 857 | optimal, requiring the fewest operations |
---|
[227] | 858 | of any possible algorithm using any 3-register instruction set |
---|
[254] | 859 | model. |
---|
[227] | 860 | |
---|
[255] | 861 | \begin{figure*} |
---|
[235] | 862 | \begin{center} |
---|
| 863 | \begin{tabular}{|c||c|c|c|c|c|c|c|c|} |
---|
| 864 | \hline |
---|
| 865 | \verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010: \\ \hline |
---|
| 866 | \verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F: \\ \hline |
---|
| 867 | \verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline |
---|
| 868 | \verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1 \\ \hline |
---|
| 869 | \verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline |
---|
| 870 | \verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline |
---|
| 871 | \verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline |
---|
| 872 | \verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline |
---|
| 873 | \end{tabular} |
---|
| 874 | \end{center} |
---|
| 875 | \label{centraldelstep} |
---|
| 876 | \caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction} |
---|
| 877 | \end{figure*} |
---|
| 878 | |
---|
[252] | 879 | The existence of high-performance algorithms for transformation of |
---|
| 880 | character data between byte stream and parallel bit stream form |
---|
| 881 | in both directions makes it possible to consider applying these |
---|
| 882 | transformations multiple times during text processing applications. |
---|
| 883 | Just as the time domain and frequency domain each have their |
---|
| 884 | use in signal processing applications, the byte stream form and |
---|
| 885 | parallel bit stream form can then each be used at will in |
---|
| 886 | character stream applications. |
---|
| 887 | |
---|
| 888 | |
---|
| 889 | |
---|
| 890 | \section{Parallel Bit Deletion} |
---|
| 891 | |
---|
| 892 | |
---|
[227] | 893 | Parallel bit deletion is an important operation that allows string |
---|
| 894 | editing operations to be carried out while in parallel bit stream |
---|
| 895 | form. It is also fundamental to UTF-8 to UTF-16 transcoding |
---|
| 896 | using parallel bit streams, allowing the excess code unit |
---|
| 897 | positions for UTF-8 two-, three- and four-byte sequences to |
---|
| 898 | be deleted once the sixteen parallel bit streams of UTF-16 have |
---|
| 899 | been computed \cite{PPoPP08}. |
---|
| 900 | |
---|
| 901 | Parallel bit deletion is specified using a deletion mask. |
---|
| 902 | A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits |
---|
| 903 | to be deleted and 0s at positions identifying bits to be retained. |
---|
| 904 | For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel |
---|
| 905 | bit streams \verb:abcdefgh: and \verb:ABCDEFGH:. Parallel deletion of elements from both bit streams in |
---|
| 906 | accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:. |
---|
| 907 | |
---|
| 908 | Bit deletion may be performed using |
---|
| 909 | the parallel-prefix compress algorithm documented by |
---|
| 910 | Warren and attributed to Steele \cite{HackersDelight}. This algorithm uses |
---|
| 911 | only logic and shifts with a constant parameter to carry |
---|
| 912 | out the deletion process. However, it requires $k^2$ |
---|
| 913 | preprocessing steps for a final field width parameter |
---|
| 914 | of size $2^k$, as well as 4 operations per deletion step |
---|
| 915 | per stream. Using the inductive doubling instruction set architecture |
---|
| 916 | it is possible to carry out bit deletion much more efficiently. |
---|
| 917 | |
---|
| 918 | Deletion within fixed size fields or registers may produce results that are either |
---|
| 919 | left justified or right-justified. For example, a five-element stream \verb:bdefh: within an |
---|
| 920 | eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't |
---|
| 921 | care positions marked `\verb:x:'. Concatenating an adjacent right-justified result with a |
---|
| 922 | left-justified result produces an important intermediate form known as a |
---|
| 923 | {\em central deletion result}. For example, \verb:xxbd: and \verb:efhx: may be respective |
---|
| 924 | right-justified and left-justified results from the application of the |
---|
| 925 | 4-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element |
---|
| 926 | stream segments \verb:abcd: and \verb:efgh:. Concatenation of \verb:xxbd: and \verb:efhx: produces |
---|
| 927 | the central result \verb:xxbdefhx:, which may easily be converted to a either a |
---|
| 928 | left or a right justified 8-element result by an appropriate shift operation. |
---|
| 929 | |
---|
| 930 | |
---|
| 931 | The observation about how two $n$-bit central deletion results can |
---|
| 932 | combine to yield a $2n$ central deletion result provides the basis |
---|
| 933 | for an inductive doubling algorithm. Figure \ref{centraldelstep} |
---|
| 934 | illustrates the inductive process for the transition from 8-bit central |
---|
| 935 | deletion results to 16-bit central deletion results. The top row shows |
---|
| 936 | the original deletion mask, while the second row shows the original |
---|
| 937 | bit stream to which deletions are to be applied, with deleted bits zeroed out. |
---|
| 938 | The third row shows the central result for each 8-bit field as the |
---|
| 939 | result of the previous inductive step. |
---|
| 940 | |
---|
| 941 | To perform the $8 \rightarrow 16$ central deletion step, we first form |
---|
| 942 | the population counts of 4-bit fields of the original deletion mask as |
---|
| 943 | shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying |
---|
| 944 | an 8-bit central result, we perform a right shift by the population count |
---|
| 945 | of the low half of the field. Similarly, |
---|
| 946 | left-justification requires a left-shift by the population count in the |
---|
| 947 | high half of the field. |
---|
| 948 | |
---|
| 949 | The left and right shifts can be performed simultaneously using a rotate |
---|
| 950 | left instruction. Right justification by shifting an $n$ bit field |
---|
| 951 | $i$ positions to the right is equivalent to a left rotate of $n-i$ |
---|
[255] | 952 | positions. Given a register value \verb:c8: preloaded with |
---|
| 953 | the value 8 in each 8-bit field, the right rotation |
---|
| 954 | amounts are computed by the operation |
---|
| 955 | \verb#rj=simd<8>::sub<x,l>(c8, cts_4)# producing values shown in row 5, |
---|
[227] | 956 | except that don't care fields (which won't be subsequently used) |
---|
| 957 | are marked \verb:XX:. |
---|
| 958 | |
---|
| 959 | The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)# |
---|
[255] | 960 | producing the values shown in row 6, and are then combined with the right shift amounts |
---|
| 961 | by the selection operation \verb#rot_8=simd_if(mask0xFF00, rj, lj)# |
---|
[227] | 962 | as shown in row 7. Using these computed values, the inductive step |
---|
[231] | 963 | is completed by application of the operation \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)# |
---|
[227] | 964 | as shown in row 8. |
---|
| 965 | |
---|
| 966 | At each inductive doubling level, it requires 4 operations to compute the |
---|
[258] | 967 | required deletion information and one operation per bit stream to perform deletion. |
---|
[227] | 968 | Note that, if deletion is to be applied to a set of eight parallel bit streams, |
---|
| 969 | the computed deletion information is used for each stream without recomputation, |
---|
| 970 | thus requiring 12 operations per inductive level. |
---|
| 971 | |
---|
| 972 | In comparison to the parallel-prefix compress method, the method of central |
---|
| 973 | deletion results using the inductive doubling architecture has far fewer operations. |
---|
| 974 | The total preprocessing cost is $4k$ for $k$ steps of deletion by central result |
---|
| 975 | induction versus $4k^2$ for the parallel-prefix method. Using the computed |
---|
[253] | 976 | deletion operation, only a single SWAR rotate operation per bit stream |
---|
[227] | 977 | per level is needed, in comparison with 4 operations per level for parallel-prefix |
---|
| 978 | compress. |
---|
| 979 | |
---|
| 980 | |
---|
| 981 | |
---|
[234] | 982 | \section{Beyond Parallel Bit Streams} |
---|
| 983 | |
---|
[248] | 984 | IDISA has a variety of applications in domains beyond |
---|
| 985 | text processing with parallel bit streams. We present |
---|
| 986 | a number of examples in this section, including, |
---|
| 987 | most significantly, a full general solution to the problem of supporting |
---|
[234] | 988 | {\em horizontal} SWAR operations. |
---|
| 989 | |
---|
[248] | 990 | \subsection{Parity} |
---|
[234] | 991 | |
---|
[254] | 992 | % \begin{figure}[h] |
---|
| 993 | % \begin{center}\small |
---|
| 994 | % \begin{verbatim} |
---|
| 995 | % y = x ^ (x >> 1); |
---|
| 996 | % y = y ^ (y >> 2); |
---|
| 997 | % y = y ^ (y >> 4); |
---|
| 998 | % y = y ^ (y >> 8); |
---|
| 999 | % y = y ^ (y >>16); |
---|
| 1000 | % y = y & 1; |
---|
| 1001 | % \end{verbatim} |
---|
| 1002 | % \end{center} |
---|
| 1003 | % \caption{Parity Reference Algorithm} |
---|
| 1004 | % \label{HD-parity} |
---|
| 1005 | % \end{figure} |
---|
[234] | 1006 | |
---|
[255] | 1007 | \begin{figure}[tb] |
---|
[234] | 1008 | \begin{center}\small |
---|
| 1009 | \begin{verbatim} |
---|
[236] | 1010 | y = simd<2>::xor<h,l>(x, x); |
---|
| 1011 | y = simd<4>::xor<h,l>(y, y); |
---|
| 1012 | y = simd<8>::xor<h,l>(y, y); |
---|
[255] | 1013 | y = simd<16>::xor<h,l>(y, y); |
---|
| 1014 | y = simd<32>::xor<h,l>(y, y); |
---|
[234] | 1015 | \end{verbatim} |
---|
| 1016 | \end{center} |
---|
| 1017 | \caption{IDISA Parity Implementation} |
---|
| 1018 | \label{ID-parity} |
---|
| 1019 | \end{figure} |
---|
| 1020 | |
---|
| 1021 | Parity has important applications for error-correcting |
---|
| 1022 | codes such as the various Hamming codes for detecting |
---|
| 1023 | and correcting numbers of bit errors dependent on the |
---|
[255] | 1024 | number of parity bits added. |
---|
[254] | 1025 | Figure \ref{ID-parity} shows an IDISA-A parity implementation |
---|
| 1026 | with only 5 operations required for 32-bit fields, |
---|
| 1027 | slightly more than a 2X improvement over the 11 operations |
---|
[255] | 1028 | required in a RefB implementation following Warren |
---|
| 1029 | \cite{HackersDelight}. The improvement is less than |
---|
[234] | 1030 | 3X seen in other cases because one of the operands need |
---|
| 1031 | not be modified before applying the exclusive-or operation. |
---|
| 1032 | |
---|
[254] | 1033 | \subsection{Bit Reverse} |
---|
| 1034 | |
---|
| 1035 | Bit reverse is an important operation needed in a number |
---|
| 1036 | of low level codecs. Following Warren's inductive |
---|
| 1037 | doubling implementation using masks and shifts \cite{HackersDelight}, |
---|
| 1038 | a RefA implementation on 32-bit fields requires 28 |
---|
| 1039 | operations, while a straightforward IDISA-A implementation |
---|
| 1040 | using \verb#simd<n>::rotli# at each inductive doubling |
---|
| 1041 | level requires only 5 operations. |
---|
[234] | 1042 | \subsection{Packed DNA Representation} |
---|
| 1043 | |
---|
| 1044 | DNA sequences are often represented as strings consisting |
---|
| 1045 | of the four nucleotide codes A, C, G and T. Internally, |
---|
| 1046 | these sequences are frequently represented in internal |
---|
| 1047 | form as packed sequences of 2-bit values. The IDISA |
---|
| 1048 | \verb#simd<8>:pack# and \verb#simd<4>:pack# operations |
---|
| 1049 | allow these packed representations to be quickly computed |
---|
| 1050 | from byte-oriented string values by two steps of inductive |
---|
| 1051 | halving. Similarly, conversion back to string form |
---|
| 1052 | can use two steps of inductive merging. Without direct |
---|
| 1053 | support for these pack and merge operations, the SWAR |
---|
| 1054 | implementations of these conversions require the cost |
---|
| 1055 | of explicit masking and shifting in combination with |
---|
| 1056 | the 16-bit to 8-bit packing and 8-bit to 16-bit |
---|
| 1057 | merging operations supported by existing SWAR facilities |
---|
| 1058 | on commodity processors. |
---|
| 1059 | |
---|
[255] | 1060 | \subsection{String/Decimal/Integer Conversion} |
---|
| 1061 | |
---|
| 1062 | Just as DNA sequences represent an important use case for |
---|
| 1063 | SWAR operations on 2-bit fields, packed sequences of |
---|
| 1064 | decimal or hexadecimal digits represent a common use case |
---|
| 1065 | for 4-bit fields. These representations can be used |
---|
| 1066 | both as an intermediate form in numeric string to integer |
---|
| 1067 | conversion and as a direct representation for |
---|
| 1068 | packed binary coded decimal. |
---|
| 1069 | |
---|
| 1070 | \begin{figure} |
---|
[234] | 1071 | \begin{center}\small |
---|
| 1072 | \begin{verbatim} |
---|
[255] | 1073 | b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F) |
---|
| 1074 | b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF) |
---|
| 1075 | b=(d & 0x0000FFFF) + 10000 * (d >> 16) |
---|
[234] | 1076 | \end{verbatim} |
---|
| 1077 | \end{center} |
---|
| 1078 | \caption{BCD to Integer Reference Algorithm} |
---|
| 1079 | \label{BCD2int} |
---|
| 1080 | \end{figure} |
---|
| 1081 | |
---|
[255] | 1082 | \begin{figure} |
---|
[234] | 1083 | \begin{center}\small |
---|
| 1084 | \begin{verbatim} |
---|
[258] | 1085 | t1=simd<8>:constant(10) |
---|
| 1086 | t2=simd<16>:constant(100) |
---|
| 1087 | t3=simd<32>:constant(10000) |
---|
[255] | 1088 | b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d) |
---|
| 1089 | b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b) |
---|
| 1090 | b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b) |
---|
[234] | 1091 | \end{verbatim} |
---|
| 1092 | \end{center} |
---|
[250] | 1093 | \caption{IDISA BCD to Integer} |
---|
[234] | 1094 | \label{ID-BCD2int} |
---|
| 1095 | \end{figure} |
---|
| 1096 | |
---|
| 1097 | Figure \ref{BCD2int} shows a three-step inductive |
---|
| 1098 | doubling implementation for conversion of 32-bit packed BCD |
---|
| 1099 | values to integer form. The 32-bit value consists |
---|
| 1100 | of 8 4-bit decimal digits. Pairs of digits are |
---|
| 1101 | first combined by multiplying the higher digit |
---|
| 1102 | of the pair by 10 and adding. Pairs of these |
---|
| 1103 | two-digit results are then further combined by |
---|
| 1104 | multiplying the value of the higher of the two-digit |
---|
| 1105 | results by 100 and adding. The final step is |
---|
| 1106 | to combine four-digit results by multiplying the |
---|
| 1107 | higher one by 10000 and adding. Overall, 20 |
---|
| 1108 | operations are required for this implementation |
---|
[252] | 1109 | as well as the corresponding RefA implementation |
---|
| 1110 | for sets of 32-bit fields. Under the RefB model, preloading of |
---|
| 1111 | 6 constants into registers for repeated use can reduce the |
---|
| 1112 | number of operations to 14 at the cost of register |
---|
[234] | 1113 | pressure. |
---|
| 1114 | |
---|
| 1115 | The IDISA implementation of this algorithm is shown |
---|
| 1116 | in Figure \ref{ID-BCD2int}. This implementation |
---|
| 1117 | shows an interesting variation in the use of |
---|
| 1118 | half-operand modifiers, with only one operand |
---|
| 1119 | of each of the addition and multiplication operations |
---|
[252] | 1120 | modified at each level. Overall, the IDISA-A implementation |
---|
| 1121 | requires 9 operations, while the IDISA-B model requires |
---|
| 1122 | 6 operations with 3 preloaded registers. |
---|
| 1123 | In either case, this represents more than a 2X |
---|
[234] | 1124 | reduction in instruction count as well as a 2X reduction |
---|
| 1125 | in register pressure. |
---|
| 1126 | |
---|
| 1127 | |
---|
[249] | 1128 | \subsection{Further Applications} |
---|
[255] | 1129 | |
---|
| 1130 | |
---|
| 1131 | Further applications of IDISA can often be found |
---|
| 1132 | by searching for algorithms employing the magic masks |
---|
[258] | 1133 | \verb:0x55555555:, \verb:0x33333333:, and so on. |
---|
| 1134 | Examples include the bit-slice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06} |
---|
[255] | 1135 | and integer contraction and dilation for quadtrees and |
---|
[258] | 1136 | octrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}. |
---|
[248] | 1137 | Pixel packing from 32 bit fields into a 5:5:5 representation |
---|
| 1138 | is a further application of parallel bit deletion. |
---|
[234] | 1139 | |
---|
[253] | 1140 | \subsection{Systematic Support for Horizontal SWAR Operations} |
---|
[234] | 1141 | |
---|
[253] | 1142 | In SWAR parlance, {\em horizontal} operations are |
---|
[227] | 1143 | operations which combine values from two or more fields |
---|
| 1144 | of the same register, in contrast to the normal |
---|
| 1145 | {\em vertical} operations which combine corresponding |
---|
| 1146 | fields of different registers. Horizontal operations |
---|
[231] | 1147 | can be found that combine two (e.g., \verb:haddpd: on SSE3), |
---|
| 1148 | four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE) |
---|
| 1149 | or sixteen values (e.g., \verb:vcmpequb: on Altivec). Some |
---|
[227] | 1150 | horizontal operations have a vertical component as well. |
---|
[231] | 1151 | For example, \verb:psadbw: first forms the absolute value of |
---|
[227] | 1152 | the difference of eight corresponding byte fields before |
---|
| 1153 | performing horizontal add of the eight values, while |
---|
[231] | 1154 | \verb:vsum4ubs: on Altivec performs horizontal add of sets of |
---|
[227] | 1155 | four unsigned 8-bit fields within one register |
---|
| 1156 | and then combines the result horizontally with |
---|
| 1157 | corresponding 32-bit fields of a second registers. |
---|
| 1158 | |
---|
| 1159 | The space of potential horizontal operations thus has |
---|
| 1160 | many dimensions, including not only the particular |
---|
| 1161 | combining operation and the operand field width, but |
---|
| 1162 | also the number of fields being combined, whether a |
---|
| 1163 | vertical combination is applied and whether it is applied |
---|
| 1164 | before or after the horizontal operation and what the |
---|
| 1165 | nature of the vertical combining operation is. |
---|
[253] | 1166 | Within this space, commodity SWAR architectures tend |
---|
[227] | 1167 | to support only a very few combinations, without any |
---|
| 1168 | particular attempt at systematic support for horizontal |
---|
| 1169 | operations in general. |
---|
| 1170 | |
---|
[231] | 1171 | In contrast to this {\em ad hoc} support on commodity |
---|
| 1172 | processors, IDISA offers a completely systematic treatment |
---|
| 1173 | of horizontal operations without any special features beyond |
---|
| 1174 | the inductive doubling features already described. |
---|
| 1175 | In the simplest case, any vertical operation |
---|
| 1176 | \verb#simd<n>::F# on $n$-bit fields gives rise to |
---|
| 1177 | an immediate horizontal operation |
---|
[236] | 1178 | \verb#simd<n>::F<h,l>(r, r)# for combining adjacent |
---|
[231] | 1179 | pairs of $n/2$ bit fields. |
---|
[236] | 1180 | For example, \verb#simd<16>::add<h,l># adds values |
---|
[227] | 1181 | in adjacent 8 bit fields to produce 16 bit results, |
---|
[236] | 1182 | while \verb#simd<32>::min<h,l># can produce the |
---|
[231] | 1183 | minimum value of adjacent 16-bit fields. |
---|
| 1184 | Thus any binary horizontal operation can be implemented |
---|
[236] | 1185 | in a single IDISA instruction making use of the \verb:<h,l>: |
---|
[231] | 1186 | operand modifier combination. |
---|
[227] | 1187 | |
---|
[231] | 1188 | Horizontal combinations of four adjacent fields can also be |
---|
| 1189 | realized in a general way through two steps of inductive |
---|
| 1190 | doubling. For example, consider the or-across operation \verb:si_orx: |
---|
[227] | 1191 | of the SPU, that performs a logical or operation |
---|
| 1192 | on four 32-bit fields. This four field combination |
---|
[231] | 1193 | can easily be implemented with the following two operations. |
---|
[227] | 1194 | %\begin{singlespace} |
---|
| 1195 | \begin{verbatim} |
---|
[236] | 1196 | t = simd<64>::or<h,l>(x, x) |
---|
| 1197 | t = simd<128>::or<h,l>(t, t) |
---|
[227] | 1198 | \end{verbatim} |
---|
| 1199 | %\end{singlespace} |
---|
| 1200 | |
---|
[231] | 1201 | In general, systematic support for horizontal |
---|
| 1202 | combinations of sets of $2^h$ adjacent fields may |
---|
| 1203 | be realized through $h$ inductive double steps |
---|
| 1204 | in a similar fashion. |
---|
[258] | 1205 | Thus, IDISA essentially offers systematic support |
---|
[231] | 1206 | for horizontal operations entirely through the |
---|
[237] | 1207 | use of \verb:<h,l>: half-operand modifier |
---|
[231] | 1208 | combinations. |
---|
[227] | 1209 | |
---|
[231] | 1210 | Systematic support for general horizontal operations |
---|
| 1211 | under IDISA also creates opportunity for a design tradeoff: |
---|
| 1212 | offsetting the circuit complexity of half-operand |
---|
| 1213 | modifiers with potential elimination of dedicated |
---|
[255] | 1214 | logic for some {\em ad hoc} horizontal SWAR operations. |
---|
[231] | 1215 | Even if legacy support for these operations is required, |
---|
| 1216 | it may be possible to provide that support through |
---|
| 1217 | software or firmware rather than a full hardware |
---|
| 1218 | implementation. Evaluation of these possibilities |
---|
| 1219 | in the context of particular architectures is a potential |
---|
| 1220 | area for further work. |
---|
[227] | 1221 | |
---|
[234] | 1222 | |
---|
[227] | 1223 | \section{Implementation} |
---|
| 1224 | |
---|
[245] | 1225 | IDISA may be implemented as a software |
---|
| 1226 | abstraction on top of existing SWAR architectures or |
---|
| 1227 | directly in hardware. In this section, we briefly |
---|
| 1228 | discuss implementation of IDISA libraries before |
---|
| 1229 | moving on to consider hardware design. Although |
---|
| 1230 | a full realization of IDISA in hardware is beyond our |
---|
| 1231 | current capabilities, our goal is to develop a sufficiently |
---|
| 1232 | detailed design to assess the costs of IDISA implementation |
---|
| 1233 | in terms of the incremental complexity over the RefA |
---|
| 1234 | and RefB architectures. The cost factors we consider, then, |
---|
| 1235 | are the implementation of the half-operand modifiers, and |
---|
| 1236 | the extension of core operations to the 2-bit, 4-bit and |
---|
| 1237 | 128-bit field widths. In each case, we also discuss |
---|
| 1238 | design tradeoffs. |
---|
[227] | 1239 | |
---|
[228] | 1240 | \subsection{IDISA Libraries} |
---|
[227] | 1241 | |
---|
[228] | 1242 | Implementation of IDISA instructions using template |
---|
| 1243 | and macro libraries has been useful in developing |
---|
| 1244 | and assessing the correctness of many of the algorithms |
---|
| 1245 | presented here. Although these implementations do not |
---|
| 1246 | deliver the performance benefits associated with |
---|
| 1247 | direct hardware implementation of IDISA, they |
---|
| 1248 | have been quite useful in providing a practical means |
---|
| 1249 | for portable implementation of parallel bit stream |
---|
| 1250 | algorithms on multiple SWAR architectures. However, |
---|
| 1251 | one additional facility has also proven necessary for |
---|
| 1252 | portability of parallel bit stream algorithms across |
---|
| 1253 | big-endian and little-endian architectures: the |
---|
| 1254 | notion of shift-forward and shift-back operations. |
---|
| 1255 | In essence, shift forward means shift to the left |
---|
| 1256 | on little-endian systems and shift to the right on |
---|
| 1257 | big-endian systems, while shift back has the reverse |
---|
| 1258 | interpretation. Although this concept is unrelated to |
---|
| 1259 | inductive doubling, its inclusion with the IDISA |
---|
| 1260 | libraries has provided a suitable basis for portable |
---|
[253] | 1261 | SWAR implementations of parallel bit stream algorithms. |
---|
[228] | 1262 | Beyond this, the IDISA libraries have the additional |
---|
| 1263 | benefit of allowing the implementation |
---|
| 1264 | of inductive doubling algorithms at a higher level |
---|
| 1265 | abstraction, without need for programmer coding of |
---|
| 1266 | the underlying shift and mask operations. |
---|
[227] | 1267 | |
---|
[228] | 1268 | \subsection{IDISA Model} |
---|
[233] | 1269 | \begin{figure}[tbh] |
---|
| 1270 | \begin{center} |
---|
[255] | 1271 | \includegraphics[width=85mm, trim= 40 350 0 50]{IDISA.pdf} |
---|
[233] | 1272 | \caption{IDISA Block Diagram} |
---|
| 1273 | \label{pipeline-model} |
---|
| 1274 | \end{center} |
---|
| 1275 | \end{figure} |
---|
| 1276 | |
---|
| 1277 | Figure \ref{pipeline-model} shows a block diagram |
---|
[253] | 1278 | for a pipelined SWAR processor implementing IDISA. |
---|
| 1279 | The SWAR Register File (SRF) provides a file of $R = 2^A$ |
---|
[228] | 1280 | registers each of width $N = 2^K$ bits. |
---|
| 1281 | IDISA instructions identified by the Instruction Fetch |
---|
[253] | 1282 | Unit (IFU) are forwarded for decoding to the SWAR |
---|
[228] | 1283 | Instruction Decode Unit (SIDU). This unit decodes |
---|
| 1284 | the instruction to produce |
---|
| 1285 | signals identifying the source and destination |
---|
| 1286 | operand registers, the half-operand modifiers, the |
---|
[253] | 1287 | field width specification and the SWAR operation |
---|
[228] | 1288 | to be applied. |
---|
[227] | 1289 | |
---|
[228] | 1290 | The SIDU supplies the source register information and the half-operand |
---|
[253] | 1291 | modifier information to the SWAR Operand Fetch Unit (SOFU). |
---|
[228] | 1292 | For each source operand, the SIDU provides an $A$-bit register |
---|
| 1293 | address and two 1-bit signals $h$ and $l$ indicating the value |
---|
| 1294 | of the decoded half-operand modifiers for this operand. |
---|
| 1295 | Only one of these values may be 1; both are 0 if |
---|
| 1296 | no modifier is specified. |
---|
[244] | 1297 | The SIDU also supplies decoded field width signals $w_k$ |
---|
| 1298 | for each field width $2^k$ to both the SOFU and to the |
---|
[253] | 1299 | SWAR Instruction Execute Unit (SIEU). Only one of the |
---|
[244] | 1300 | field width signals has the value 1. |
---|
[253] | 1301 | The SIDU also supplies decoded SWAR opcode information to SIEU and |
---|
[228] | 1302 | a decoded $A$-bit register address for the destination register to |
---|
[253] | 1303 | the SWAR Result Write Back Unit (SRWBU). |
---|
[227] | 1304 | |
---|
[228] | 1305 | The SOFU is the key component of the IDISA model that |
---|
| 1306 | differs from that found in a traditional SWAR |
---|
| 1307 | processor. For each of the two $A$-bit source |
---|
| 1308 | register addresses, SOFU is first responsible for |
---|
| 1309 | fetching the raw operand values from the SRF. |
---|
| 1310 | Then, before supplying operand values to the |
---|
| 1311 | SIEU, the SOFU applies the half-operand modification |
---|
| 1312 | logic as specified by the $h$, $l$, and field-width |
---|
| 1313 | signals. The possibly modified operand values are then |
---|
[253] | 1314 | provided to the SIEU for carrying out the SWAR operations. |
---|
[228] | 1315 | A detailed model of SOFU logic is described in the following |
---|
| 1316 | subsection. |
---|
| 1317 | |
---|
| 1318 | The SIEU differs from similar execution units in |
---|
| 1319 | current commodity processors primarily by providing |
---|
[253] | 1320 | SWAR operations at each field width |
---|
[228] | 1321 | $n=2^k$ for $0 \leq k \leq K$. This involves |
---|
| 1322 | additional circuitry for field widths not supported |
---|
[244] | 1323 | in existing processors. In our evaluation model, |
---|
| 1324 | IDISA-A adds support for 2-bit, 4-bit and 128-bit |
---|
| 1325 | field widths in comparison with the RefA architecture, |
---|
| 1326 | while IDISA-B similarly extends RefB. |
---|
[228] | 1327 | |
---|
| 1328 | When execution of the SWAR instruction is |
---|
| 1329 | completed, the result value is then provided |
---|
| 1330 | to the SRWBU to update the value stored in the |
---|
| 1331 | SRF at the address specified by the $A$-bit |
---|
| 1332 | destination operand. |
---|
| 1333 | |
---|
| 1334 | \subsection{Operand Fetch Unit Logic} |
---|
| 1335 | |
---|
[244] | 1336 | The SOFU is responsible for implementing the half-operand |
---|
| 1337 | modification logic for each of up to two input operands fetched |
---|
| 1338 | from SRF. For each operand, this logic is implemented |
---|
| 1339 | using the decoded half-operand modifiers signals $h$ and $l$, |
---|
| 1340 | the decoded field width signals $w_k$ and the 128-bit operand |
---|
| 1341 | value $r$ fetched from SRF to produce a modified 128-bit operand |
---|
| 1342 | value $s$ following the requirements of equations (4), (5) and |
---|
| 1343 | (6) above. Those equations must be applied for each possible |
---|
| 1344 | modifier and each field width to determine the possible values $s[i]$ |
---|
[245] | 1345 | for each bit position $i$. For example, consider bit |
---|
| 1346 | position 41, whose binary 7-bit address is $0101001$. |
---|
| 1347 | Considering the address bits left to right, each 1 bit |
---|
| 1348 | corresponds to a field width for which this bit lies in the |
---|
| 1349 | lower $n/2$ bits (widths 2, 16, 64), while each 0 bit corresponds to a field |
---|
| 1350 | width for which this bit lies in the high $n/2$ bits. |
---|
| 1351 | In response to the half-operand modifier signal $h$, |
---|
[249] | 1352 | bit $s[41]$ may receive a value from the corresponding high bit in the field |
---|
| 1353 | of width 2, 16 or 64, namely $r[40]$, |
---|
[245] | 1354 | $r[33]$ or $r[9]$. Otherwise, this bit receives the value $r[41]$, |
---|
| 1355 | in the case of no half-operand modifier, or a low half-operand modifier |
---|
| 1356 | in conjunction with a field width signal $w_2$, $w_{16}$ or $w_{64}$. |
---|
| 1357 | The overall logic for determining this bit value is thus given as follows. |
---|
| 1358 | \begin{eqnarray*} |
---|
| 1359 | s[41] & = & h \wedge (w_2 \wedge r[40] \vee w_{16} \wedge r[33] \vee w_{64} \wedge r[9]) \\ |
---|
| 1360 | & & \vee \neg h \wedge (\neg l \vee w_2 \vee w_{16} \vee w_{64}) \wedge r[41] |
---|
| 1361 | \end{eqnarray*} |
---|
[228] | 1362 | |
---|
[245] | 1363 | Similar logic is determined for each of the 128 bit positions. |
---|
| 1364 | For each of the 7 field widths, 64 bits are in the low $n/2$ bits, |
---|
| 1365 | resulting in 448 2-input and gates for the $w_k \wedge r[i]$ terms. |
---|
| 1366 | For 120 of the bit positions, or gates are needed to combine these |
---|
| 1367 | terms; $441 -120 = 321$ 2-input or gates are required. Another |
---|
| 1368 | 127 2-input and gates combine these values with the $h$ signal. |
---|
| 1369 | In the case of a low-half-operand modifier, the or-gates combining $w_k$ |
---|
| 1370 | signals can share circuitry. For each bit position $i=2^k+j$ one |
---|
| 1371 | additional or gate is required beyond that for position $j$. |
---|
| 1372 | Thus 127 2-input or gates are required. Another 256 2-input and gates |
---|
[246] | 1373 | are required for combination with the $\neg h$ and $r[i]$ terms. The terms for |
---|
[245] | 1374 | the low and high half-operand modifiers are then combined with an |
---|
[258] | 1375 | additional 127 2-input or gates. Thus, the circuitry complexity |
---|
[245] | 1376 | for the combinational logic implementation of half-operand |
---|
| 1377 | modifiers within the SOFU is 1279 2-input gates per operand, |
---|
| 1378 | or 2558 gates in total. |
---|
| 1379 | |
---|
| 1380 | The gate-level complexity of half-operand modifiers as described is nontrivial, |
---|
| 1381 | but modest. However, one possible design tradeoff is to |
---|
| 1382 | differentiate the two operands, permitting a high half-operand |
---|
| 1383 | modifier to be used only with the first operand and a low-modifier |
---|
| 1384 | to be used only with the second operand. This would exclude |
---|
| 1385 | \verb#<h,h># and \verb#<l,l># modifier combinations and also |
---|
| 1386 | certain combinations for noncommutative core operations. |
---|
| 1387 | The principal |
---|
| 1388 | consequence for the applications considered here would be with |
---|
| 1389 | respect to the pack operations in forward transposition, but it |
---|
| 1390 | may be possible to address this through SIEU circuitry. |
---|
| 1391 | If this approach were to be taken, the gate complexity of |
---|
| 1392 | half-operand modification would be reduced by slightly more than half. |
---|
| 1393 | |
---|
| 1394 | \subsection{2-Bit and 4-Bit Field Widths} |
---|
| 1395 | |
---|
| 1396 | Beyond the half-operand modifiers, extension of core SWAR |
---|
| 1397 | operations to 2-bit and 4-bit field widths is critical to |
---|
| 1398 | inductive doubling support. The principal operations |
---|
| 1399 | that need to be supported in this way are addition, pack, merge |
---|
| 1400 | merge, and rotate. |
---|
| 1401 | |
---|
| 1402 | Addition for 4-bit fields in a 128-bit SWAR processor may be implemented |
---|
| 1403 | as a modification to that for 8-bit fields by incorporating logic to |
---|
[258] | 1404 | disable carry propagation at the 16 mid-field boundaries. For 2-bit |
---|
[245] | 1405 | fields, disabling carry propagation at 32 additional boundaries suffices, |
---|
| 1406 | although it may be simpler to directly implement the simple logic |
---|
| 1407 | of 2-bit adders. |
---|
| 1408 | |
---|
| 1409 | Pack and merge require bit selection logic for each field width. |
---|
| 1410 | A straightforward implementation model for each operation |
---|
| 1411 | uses 128 2-input and gates to select the desired bits from the |
---|
| 1412 | operand registers and another 128 2-input or gates to feed |
---|
| 1413 | these results into the destination register. |
---|
| 1414 | |
---|
| 1415 | Rotation for 2-bit fields involves simple logic for selecting |
---|
| 1416 | between the 2 bits of each field of the operand being |
---|
| 1417 | rotated on the basis of the low bit of each field of the |
---|
| 1418 | rotation count. Rotation for 4-bit fields is more complex, |
---|
| 1419 | but can also be based on 1-of-4 selection circuitry |
---|
| 1420 | involving the low 2 bits of the rotation count fields. |
---|
| 1421 | |
---|
| 1422 | \subsection{128-Bit Field Widths} |
---|
| 1423 | |
---|
| 1424 | For completeness, the IDISA model requires core operations |
---|
| 1425 | to be implemented at the full register width, as well |
---|
| 1426 | as power-of-2 partitions. This may be problematic for |
---|
| 1427 | operations such as addition due to the inherent delays |
---|
| 1428 | in 128-bit carry propagation. However, the primary |
---|
| 1429 | role of 128 bit operations in inductive doubling |
---|
| 1430 | is to combine two 64-bit fields using \verb#<h,l># |
---|
| 1431 | operand combinations. In view of this, it may be |
---|
| 1432 | reasonable to define hardware support for such |
---|
| 1433 | combinations to be based on 64-bit logic, with support |
---|
| 1434 | for 128-bit logic implemented through firmware or software. |
---|
| 1435 | |
---|
| 1436 | \subsection{Final Notes and Further Tradeoffs} |
---|
| 1437 | |
---|
| 1438 | In order to present IDISA as a concept for design extension of |
---|
| 1439 | any SWAR architecture, our discussion of gate-level implementation |
---|
| 1440 | is necessarily abstract. Additional circuitry is sure to be |
---|
| 1441 | required, for example, in implementation of SIDU. However, |
---|
| 1442 | in context of the 128-bit reference architectures studied, |
---|
| 1443 | our analysis suggests realistic IDISA implementation well |
---|
[246] | 1444 | within a 10,000 gate budget. |
---|
[245] | 1445 | |
---|
| 1446 | However, the additional circuitry required may be offset |
---|
| 1447 | by elimination of special-purpose instructions found in |
---|
| 1448 | existing processors that could instead be implemented through efficient |
---|
| 1449 | IDISA sequences. These include examples such as population |
---|
| 1450 | count, count leading and/or trailing zeroes and parity. |
---|
[253] | 1451 | They also include specialized horizontal SWAR operations. |
---|
[245] | 1452 | Thus, design tradeoffs can be made with the potential of |
---|
| 1453 | reducing the chip area devoted to special purpose instructions |
---|
| 1454 | in favor of more general IDISA features. |
---|
| 1455 | |
---|
[227] | 1456 | \section{Conclusions} |
---|
| 1457 | |
---|
[235] | 1458 | In considering the architectural support for |
---|
[253] | 1459 | SWAR text processing using the method of parallel bit streams, |
---|
[235] | 1460 | this paper has presented the principle of inductive doubling |
---|
| 1461 | and a realization of that principle in the form of |
---|
| 1462 | IDISA, a modified SWAR instruction set architecture. |
---|
[245] | 1463 | IDISA offers support for SWAR operations at all power-of-2 |
---|
[235] | 1464 | field widths, including 2-bit and 4-bit widths, in particular, |
---|
| 1465 | as well as half-operand modifiers and pack and merge operations |
---|
| 1466 | to support efficient transition between successive power-of-two |
---|
[258] | 1467 | field widths. The principal innovation is the notion of |
---|
[235] | 1468 | half-operand modifiers that eliminate the cost associated |
---|
| 1469 | with the explicit mask and shift operations required for |
---|
| 1470 | such transitions. |
---|
[227] | 1471 | |
---|
| 1472 | Several algorithms key to parallel bit stream methods |
---|
| 1473 | have been examined and shown to benefit from dramatic |
---|
| 1474 | reductions in instruction count compared to the best |
---|
[245] | 1475 | known algorithms on reference architectures. This |
---|
| 1476 | includes both a reference architecture modeled on |
---|
| 1477 | the SWAR capabilities of the SSE family as well as |
---|
| 1478 | an architecture incorporating the powerful permute |
---|
| 1479 | or shuffle capabilities found in Altivec or Cell BE processors. |
---|
| 1480 | In the case of transposition algorithms to and from parallel bit stream |
---|
[227] | 1481 | form, the architecture has been shown to make possible |
---|
[248] | 1482 | straightforward inductive doubling algorithms with a 3X |
---|
| 1483 | speedup over the best known versions on permute-capable |
---|
| 1484 | reference architectures, achieving the lowest total number |
---|
| 1485 | of operations of any possible 3-register SWAR architecture. |
---|
[227] | 1486 | |
---|
[235] | 1487 | Applications of IDISA in other areas have also been |
---|
| 1488 | examined. The support for 2-bit and 4-bit field widths |
---|
| 1489 | in SWAR processing is beneficial for packed DNA representations |
---|
| 1490 | and packed decimal representations respectively. Additional |
---|
| 1491 | inductive doubling examples are presented and the phenomenon |
---|
| 1492 | of power-of-2 transitions discussed more broadly. |
---|
| 1493 | Most significantly, IDISA supports a fully general approach |
---|
| 1494 | to horizontal SWAR operations, offering a considerable |
---|
| 1495 | improvement over the {\em ad hoc} sets of special-purpose |
---|
| 1496 | horizontal operations found in existing SWAR instruction sets. |
---|
[227] | 1497 | |
---|
[235] | 1498 | An IDISA implementation model has been presented employing |
---|
| 1499 | a customized operand fetch unit to implement the half-operand |
---|
| 1500 | modifier logic. Gate-level implementation of this unit |
---|
[245] | 1501 | and operations at the 2-bit and 4-bit field widths have |
---|
| 1502 | been analyzed and found to be quite reasonable within |
---|
[257] | 1503 | a 10,000 gate budget. Design tradeoffs to reduce the cost |
---|
[245] | 1504 | have also been discussed, possibly even leading to a |
---|
| 1505 | net complexity reduction through elimination of instructions |
---|
| 1506 | that implement special-case versions of inductive doubling. |
---|
[227] | 1507 | |
---|
[235] | 1508 | Future research may consider the extension of inductive doubling |
---|
| 1509 | support in further ways. For example, it may be possible |
---|
| 1510 | to develop a pipelined architecture supporting two or three |
---|
[248] | 1511 | steps of inductive doubling in a single operation. |
---|
[235] | 1512 | |
---|
[227] | 1513 | %\appendix |
---|
| 1514 | %\section{Appendix Title} |
---|
| 1515 | % |
---|
| 1516 | %This is the text of the appendix, if you need one. |
---|
| 1517 | |
---|
| 1518 | \acks |
---|
| 1519 | |
---|
| 1520 | This research was supported by a Discovery Grant from the |
---|
| 1521 | Natural Sciences and Engineering Research Council of Canada. |
---|
| 1522 | |
---|
| 1523 | %\bibliographystyle{plainnat} |
---|
| 1524 | \bibliographystyle{plain} |
---|
[254] | 1525 | |
---|
[257] | 1526 | \begin{thebibliography}{10} |
---|
| 1527 | |
---|
| 1528 | \bibitem{Landscape} |
---|
| 1529 | Krste Asanovic, Ras Bodik, Bryan~Christopher Catanzaro, Joseph~James Gebis, |
---|
| 1530 | Parry Husbands, Kurt Keutzer, David~A. Patterson, William~Lester Plishker, |
---|
| 1531 | John Shalf, Samuel~Webb Williams, and Katherine~A. Yelick. |
---|
| 1532 | \newblock The landscape of parallel computing research: A view from berkeley. |
---|
| 1533 | \newblock Technical Report UCB/EECS-2006-183, EECS Department, University of |
---|
| 1534 | California, Berkeley, Dec 2006. |
---|
| 1535 | |
---|
| 1536 | \bibitem{u8u16} |
---|
| 1537 | Robert~D. Cameron. |
---|
| 1538 | \newblock {\tt u8u16} -- a high-speed {UTF-8 to UTF-16} transcoder using |
---|
| 1539 | parallel bit streams. |
---|
| 1540 | \newblock Technical Report TR 2007-18, Simon Fraser University, Burnaby, BC, |
---|
| 1541 | Canada, 2007. |
---|
| 1542 | |
---|
| 1543 | \bibitem{PPoPP08} |
---|
| 1544 | Robert~D. Cameron. |
---|
| 1545 | \newblock A case study in {SIMD} text processing with parallel bit streams. |
---|
| 1546 | \newblock In {\em {ACM Symposium on Principles and Practice of Parallel |
---|
| 1547 | Programming (PPoPP)}}, {Salt Lake City, Utah}, 2008. |
---|
| 1548 | |
---|
| 1549 | \bibitem{CASCON08} |
---|
| 1550 | Robert~D. Cameron, Kenneth~S. Herdy, and Dan Lin. |
---|
| 1551 | \newblock High performance {XML} parsing using parallel bit stream technology. |
---|
| 1552 | \newblock In {\em {CASCON '08: Proceedings of the 2008 conference of the Centre |
---|
| 1553 | for Advanced Studies on Collaborative research}}, {Toronto, Ontario , |
---|
| 1554 | Canada}, 2008. |
---|
| 1555 | |
---|
| 1556 | \bibitem{Green} |
---|
| 1557 | James~R. Green, Hanan Mahmoud, and Michel Dumontier. |
---|
| 1558 | \newblock Towards real time protein identification using the {Cell BE}. |
---|
| 1559 | \newblock In {\em Workshop on Cell BE and Heterogeneous Multicore Systems: |
---|
| 1560 | Architectures and Applications at CASCON '08}, {Toronto, Ontario , Canada}, |
---|
| 1561 | 2008. |
---|
| 1562 | |
---|
| 1563 | \bibitem{Herdy} |
---|
| 1564 | Kenneth~S. Herdy, David~S. Burggraf, and Robert~D. Cameron. |
---|
| 1565 | \newblock High performance {GML to SVG} transformation for the visual |
---|
| 1566 | presentation of geographic data in web-based mapping systems. |
---|
| 1567 | \newblock In {\em {Proceedings of SVG Open 2008}}, {Nuremburg, Germany}, 2008. |
---|
| 1568 | |
---|
| 1569 | \bibitem{v4pf1a} |
---|
| 1570 | Donald~E. Knuth. |
---|
| 1571 | \newblock {The Art of Computer Programming Volume 4 Pre-Fascicle 1A: Bitwise |
---|
| 1572 | Tricks and Techniques}. |
---|
| 1573 | \newblock Draft of 22 December 2008, Stanford University. |
---|
| 1574 | |
---|
| 1575 | \bibitem{Raman08} |
---|
| 1576 | R.~Raman and D.S. Wise. |
---|
| 1577 | \newblock Converting to and from dilated integers. |
---|
| 1578 | \newblock {\em {IEEE Transactions on Computers}}, 57(4):567--573, 2008. |
---|
| 1579 | |
---|
| 1580 | \bibitem{DBLP:conf/cans/RebeiroSD06} |
---|
| 1581 | Chester Rebeiro, A.~David Selvakumar, and A.~S.~L. Devi. |
---|
[258] | 1582 | \newblock Bitslice implementation of AES. |
---|
[257] | 1583 | \newblock In David Pointcheval, Yi~Mu, and Kefei Chen, editors, {\em CANS}, |
---|
| 1584 | volume 4301 of {\em Lecture Notes in Computer Science}, pages 203--212. |
---|
| 1585 | Springer, 2006. |
---|
| 1586 | |
---|
| 1587 | \bibitem{Stocco95} |
---|
| 1588 | Leo Stocco and G\"{u}nther Schrack. |
---|
| 1589 | \newblock Integer dilation and contraction for quadtrees and octrees. |
---|
| 1590 | \newblock In {\em {Proceedings of the IEEE Pacific Rim Conference on |
---|
| 1591 | Communications, Computers, and Signal Processing}}, pages 426--428, |
---|
| 1592 | {Victoria, B.C.}, 1995. |
---|
| 1593 | |
---|
| 1594 | \bibitem{HackersDelight} |
---|
| 1595 | Henry~S. Warren. |
---|
| 1596 | \newblock {\em Hacker's Delight}. |
---|
| 1597 | \newblock Addison-Wesley, 2002. |
---|
| 1598 | |
---|
| 1599 | \end{thebibliography} |
---|
[227] | 1600 | |
---|
| 1601 | |
---|
| 1602 | \end{document} |
---|