[227] | 1 | %----------------------------------------------------------------------------- |
---|
| 2 | % |
---|
| 3 | % asplos094-cameron.tex |
---|
| 4 | % Robert D. Cameron and Dan Lin |
---|
| 5 | % |
---|
| 6 | % Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos |
---|
| 7 | % |
---|
| 8 | %----------------------------------------------------------------------------- |
---|
| 9 | \input epsf |
---|
| 10 | |
---|
| 11 | %\documentclass[preprint,natbib,10pt]{sigplanconf} |
---|
| 12 | %\documentclass[natbib,10pt]{sigplanconf} |
---|
| 13 | \documentclass[10pt]{sigplanconf} |
---|
| 14 | |
---|
| 15 | \usepackage{amsmath} |
---|
| 16 | \usepackage{graphicx} |
---|
| 17 | |
---|
| 18 | \begin{document} |
---|
| 19 | |
---|
| 20 | \conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.} |
---|
| 21 | \CopyrightYear{2009} |
---|
| 22 | \copyrightdata{978-1-60558-215-3/09/03} |
---|
| 23 | |
---|
| 24 | |
---|
| 25 | \titlebanner{banner above paper title} % These are ignored unless |
---|
| 26 | \preprintfooter{short description of paper} % 'preprint' option specified. |
---|
| 27 | |
---|
| 28 | \title{Architectural Support for SWAR Text Processing with Parallel Bit |
---|
| 29 | Streams: The Inductive Doubling Principle} |
---|
| 30 | %\subtitle{Subtitle Text, if any} |
---|
| 31 | |
---|
| 32 | \authorinfo{Robert D. Cameron \and Dan Lin} |
---|
| 33 | {School of Computing Science, Simon Fraser University} |
---|
| 34 | {\{cameron, lindanl\}@cs.sfu.ca} |
---|
| 35 | |
---|
| 36 | |
---|
| 37 | \maketitle |
---|
| 38 | |
---|
| 39 | \begin{abstract} |
---|
| 40 | Parallel bit stream algorithms exploit the SWAR (SIMD within a |
---|
| 41 | register) capabilities of commodity |
---|
| 42 | processors in high-performance text processing applications such as |
---|
| 43 | UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression |
---|
| 44 | matching. Direct architectural support for these algorithms in future SIMD |
---|
| 45 | instruction sets could further increase performance as well as simplifying the |
---|
| 46 | programming task. A set of simple SWAR instruction set extensions are proposed |
---|
| 47 | for this purpose based on the principle of systematic support for inductive |
---|
| 48 | doubling as an algorithmic technique. These extensions are shown to |
---|
| 49 | significantly reduce instruction count in core parallel bit stream algorithms, |
---|
| 50 | often providing a 3X or better improvement. The extensions are also shown to be useful |
---|
| 51 | for SIMD programming in other application areas, including providing a |
---|
| 52 | systematic treatment for horizontal operations. An implementation model for |
---|
| 53 | these extensions |
---|
| 54 | involves relatively simple circuitry added to the operand fetch components |
---|
| 55 | in a pipelined processor. |
---|
| 56 | \end{abstract} |
---|
| 57 | |
---|
| 58 | \category{CR-number}{subcategory}{third-level} |
---|
| 59 | |
---|
| 60 | \terms |
---|
| 61 | term1, term2 |
---|
| 62 | |
---|
| 63 | \keywords |
---|
| 64 | keyword1, keyword2 |
---|
| 65 | |
---|
| 66 | |
---|
| 67 | |
---|
| 68 | \section{Introduction} |
---|
| 69 | |
---|
| 70 | In the landscape of parallel computing research, finding ways to |
---|
| 71 | exploit intrachip (multicore) and intraregister (SWAR) parallelism |
---|
| 72 | for text processing and other non-numeric applications is particularly |
---|
| 73 | challenging. Indeed, in documenting this landscape, a widely cited Berkeley |
---|
| 74 | study \cite{Landscape} identifies the finite-state machine algorithms associated |
---|
| 75 | with text processing to be the hardest of the thirteen ``dwarves'' |
---|
| 76 | to parallelize, concluding that nothing seems to help. Indeed, |
---|
| 77 | the study even speculates that applications in this area may simply be |
---|
| 78 | ``embarrasingly sequential,'' easy to tackle for traditional |
---|
| 79 | sequential processing approaches suitable for uniprocessors, |
---|
| 80 | but perhaps fundamentally unsuited to parallel methods. |
---|
| 81 | |
---|
| 82 | One approach that shows some promise, however, is the |
---|
| 83 | method of parallel bit streams, recently applied to |
---|
| 84 | UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08} |
---|
| 85 | and amino acid sequencing\cite{Green}. |
---|
| 86 | In this method, byte-oriented character data is first transposed to eight |
---|
| 87 | parallel bit streams, one for each bit position within the character code |
---|
| 88 | units (bytes). Loading bit stream data into 128-bit SIMD registers, |
---|
| 89 | then, allows data from 128 consecutive code units to be represented and |
---|
| 90 | processed at once. Bitwise logic and shift operations, bit scans, |
---|
| 91 | population counts and other bit-based operations are then used to carry |
---|
| 92 | out the work. |
---|
| 93 | |
---|
| 94 | The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X |
---|
| 95 | speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors |
---|
| 96 | employing the SSE or Altivec instruction sets. A full implementation |
---|
| 97 | for each of these platforms is documented using literate programming techniques |
---|
| 98 | and available as an open source code base \cite{u8u16}. |
---|
| 99 | Although this transcoding task represents but a single text processing kernel, |
---|
| 100 | it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more |
---|
| 101 | of processing time\cite{NicolaJohn03, Perkins05, Psaila06}. The example is also interesting in that |
---|
| 102 | it illustrates a text processing task that can largely |
---|
| 103 | be carried out using SIMD operations even though |
---|
| 104 | UTF-8 character sequences are variable length. Indeed, one weakness of the |
---|
| 105 | actual implementation is that it reverts to byte-at-a-time processing |
---|
| 106 | at block boundaries, employing a block shortening strategy to |
---|
| 107 | reduce block lengths to as low as 125 bytes from 128. |
---|
| 108 | With more capable SIMD instruction set architectures providing |
---|
| 109 | better facilities for multiregister shifts and larger register |
---|
| 110 | files, a solution employing SIMD techniques for virtually all |
---|
| 111 | of the main work would maintain better data alignment, avoid |
---|
| 112 | problems of register pressure and be easier to parallelize across |
---|
| 113 | multiple cores. It should also naturally scale to 256-bit SIMD technology |
---|
| 114 | such as the expected AVX technology of Intel. |
---|
| 115 | |
---|
| 116 | The report addressing the broader problem of XML parsing is perhaps more |
---|
| 117 | interesting, demonstrating the utility of parallel bit stream techniques |
---|
| 118 | in delivering performance benefits through a significant portion of the |
---|
| 119 | web technology stack. In an XML statistics gathering application, |
---|
| 120 | including the implementation of XML well-formedness checking, an |
---|
| 121 | overall 3X to 10X performance improvement is achieved in using |
---|
| 122 | the parallel bit stream methods in comparison with a similarly |
---|
| 123 | coded application using such well known parsers as Expat and Xerces. |
---|
| 124 | Although still a work in progress, the parser has functional |
---|
| 125 | coverage of XML and related specifications comparable to, but somewhat |
---|
| 126 | beyond Expat. The study also identifies promising further |
---|
| 127 | work in extending the parallel bit stream methods to parallel |
---|
| 128 | hash value computation and parallel regular expression matching |
---|
| 129 | for the purpose of validating XML datatype declarations in |
---|
| 130 | accord with XML Schema. |
---|
| 131 | |
---|
| 132 | Given these promising initial results in the application of |
---|
| 133 | parallel bit stream methods, what role might architectural support play in |
---|
| 134 | further enhancing this route to parallelization of text processing? |
---|
| 135 | This paper addresses this question through presentation and |
---|
| 136 | analysis of a constructive proposal: a set of SIMD instruction set |
---|
| 137 | features based on the principle of systematic support for |
---|
| 138 | inductive doubling algorithms. Inductive doubling refers |
---|
| 139 | to a general property of certain kinds of algorithm that |
---|
| 140 | systematically double the values of field widths or other |
---|
| 141 | data attributes with each iteration. In essence, the goal |
---|
| 142 | of the proposed features is to support such algorithms |
---|
| 143 | with specific facilities to transition between successive power-of-2 field |
---|
| 144 | widths. These transitions are quite frequent in several critical |
---|
| 145 | algorithms for parallel bit streams. These transitions also |
---|
| 146 | occur in other applications as well. In related work, |
---|
| 147 | efficient automatic interleaving based on power-of-2 strides has been |
---|
| 148 | reported quite useful for a number of SIMD kernels \cite{Nuzman}. |
---|
| 149 | The specific features presented herein will be referred to |
---|
| 150 | as IDISA: inductive doubling instruction set architecture. |
---|
| 151 | |
---|
| 152 | To assess the benefits of IDISA features in comparison to |
---|
| 153 | those found in existing SIMD support on commodity |
---|
| 154 | processors, we will focus on an idealized three-register |
---|
| 155 | model of SIMD instruction sets and evaluation of |
---|
| 156 | parallel bit stream and other computational kernels with |
---|
| 157 | respect to these features. The three-register model is |
---|
| 158 | based on the general approach to binary SIMD operations |
---|
| 159 | as ones that operate on the contents of two source operand |
---|
| 160 | registers to produce a value to be written back to a |
---|
| 161 | single destination operand register. This three-register |
---|
| 162 | model is directly used by the Altivec SIMD instructions |
---|
| 163 | of the Power PC, for example. On the Intel SSE platform, |
---|
| 164 | the three-register model is used as a programmer's |
---|
| 165 | interface for the C-language intrinsics, while the |
---|
| 166 | underlying instructions use a two-register model with |
---|
| 167 | destructive updating. The computational kernels we study consist |
---|
| 168 | of 100\% branch-free code operating on registers, without |
---|
| 169 | memory access For such kernels, we use straight-line |
---|
| 170 | instruction count as the performance metric of interest, |
---|
| 171 | assuming that pipelined processors are well-designed |
---|
| 172 | to handle latencies. |
---|
| 173 | |
---|
| 174 | |
---|
| 175 | The remainder of this paper is organized as follows. |
---|
| 176 | |
---|
| 177 | The second section of this paper introduces IDISA and the |
---|
| 178 | SIMD notation used throughout this paper. A brief comparison of |
---|
| 179 | IDISA features with existing SIMD instruction |
---|
| 180 | sets of commodity processors such as the Intel SSE |
---|
| 181 | instruction set and the Power PC Altivec instruction set |
---|
| 182 | is also made. |
---|
| 183 | |
---|
| 184 | The third section provides a short first example of |
---|
| 185 | the inductive doubling principle in action through |
---|
| 186 | the case of population count. Although this operation |
---|
| 187 | is not a strong determinant of performance for parallel bit |
---|
| 188 | stream applications, it is nevertheless an operation needed |
---|
| 189 | frequently enough in the general computing milieux to find |
---|
| 190 | its way into some instruction set architectures, typically |
---|
| 191 | at one particular field width. By way of comparison, the |
---|
| 192 | inductive doubling architecture sacrifices some |
---|
| 193 | performance at the chosen field width, while offering a more |
---|
| 194 | general solution with frequently better performance at |
---|
| 195 | other field widths. |
---|
| 196 | |
---|
| 197 | The fourth section then moves on to consider the performance-critical |
---|
| 198 | and key task of conversion between serial byte streams and parallel |
---|
| 199 | bit streams. A first algorithm that uses the existing SIMD |
---|
| 200 | operations common to SSE and Altivec is shown, requiring 72 |
---|
| 201 | operations to transform 128 bytes of data using the three-register |
---|
| 202 | instruction form. We then move on to consider how the task may |
---|
| 203 | be simplified using IDISA to |
---|
| 204 | require a mere 24 operations. As well as providing a 3X speed-up, |
---|
| 205 | it is also argued that the version using the inductive doubling |
---|
| 206 | architecture is considerably simpler and easier to program. |
---|
| 207 | |
---|
| 208 | The fifth section then briefly considers the inverse transposition |
---|
| 209 | process, converting parallel bit stream data back into byte |
---|
| 210 | streams. Again, an algorithm to carry out this task requires |
---|
| 211 | 72 straight-line SIMD operations in the Altivec three-register |
---|
| 212 | form, but is reduced to a simpler 24 operations using IDISA. |
---|
| 213 | |
---|
| 214 | The sixth section then goes on to consider the problem of |
---|
| 215 | parallel bit deletion. This operation is performance-critical |
---|
| 216 | to any applications that require filtering or |
---|
| 217 | editing operations on strings using the parallel bit stream |
---|
| 218 | algorithms. For example, it is fundamental to the |
---|
| 219 | high-speed UTF-8 to UTF-16 transcoding algorithm that is |
---|
| 220 | often a critical component in XML parsing. In this |
---|
| 221 | section, an inductive doubling algorithm based on the |
---|
| 222 | concept of a central deletion result is described and |
---|
| 223 | shown to have much better performance than a parallel-prefix |
---|
| 224 | alternative. |
---|
| 225 | |
---|
| 226 | The seventh section considers the issue of |
---|
| 227 | horizontal SIMD operations, that is, operations for combining |
---|
| 228 | sets of adjacent fields within individual SIMD registers rather than |
---|
| 229 | corresponding fields within sets of registers. While existing |
---|
| 230 | SIMD instruction set architectures tend to only support a few |
---|
| 231 | ad hoc horizontal combinations, IDISA is shown to provide a systematic |
---|
| 232 | means for efficient horizontal combinations of any kind. |
---|
| 233 | |
---|
| 234 | An implementation model for IDISA is then considered |
---|
| 235 | in section 8 of the paper, focusing on a pipelined |
---|
| 236 | SIMD architecture featuring a modified operand fetch stage. |
---|
| 237 | A gate-count analysis of one feasible implementation is |
---|
| 238 | provided as well as a discussion of the implementation |
---|
| 239 | of required extensions to handle 2-bit and 4-bit fields not |
---|
| 240 | commonly supported on existing SIMD architectures. Design |
---|
| 241 | tradeoffs are also considered focusing the potential removal of |
---|
| 242 | various {\em ad hoc} instructions on existing processors in favor of |
---|
| 243 | more general alternatives provided through IDISA. |
---|
| 244 | |
---|
| 245 | The ninth section concludes the paper with a summary of results |
---|
| 246 | and discussion of areas for future work. |
---|
| 247 | |
---|
| 248 | |
---|
| 249 | \section{Inductive Doubling Architecture} |
---|
| 250 | |
---|
| 251 | This section presents an idealized model for a single-instruction |
---|
| 252 | multiple-data (SIMD) instruction set architecture designed |
---|
| 253 | specifically to support inductive doubling algorithms in the |
---|
| 254 | domain of parallel bit stream programming. The architecture is idealized |
---|
| 255 | in the sense that we concentrate on only the necessary features |
---|
| 256 | for our purpose, without enumerating the additional |
---|
| 257 | operations that would be required for |
---|
| 258 | SIMD applications in other domains. The goal is to focus |
---|
| 259 | on the principles of inductive doubling support in a way |
---|
| 260 | that can accommodate a variety of realizations as other |
---|
| 261 | design constraints are brought to bear on the overall instruction set |
---|
| 262 | design. First we introduce a simple model and notation for |
---|
| 263 | SIMD operations in general and then present the four key |
---|
| 264 | features of an idealized architecture in support of parallel |
---|
| 265 | bit stream programming. |
---|
| 266 | |
---|
| 267 | The idealized architecture supports typical SIMD integer |
---|
| 268 | operations common to existing commodity architectures such as SSE |
---|
| 269 | and Altivec. The architecture is defined to support SIMD |
---|
| 270 | operations on registers of size $N=2^K$ bits, for some integer $K$. |
---|
| 271 | Typical values of $K$ for commodity processors include $K=6$ for |
---|
| 272 | the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for |
---|
| 273 | the 128-bit registers of SSE and Altivec technology and $K=8$ for |
---|
| 274 | the upcoming Intel AVX technology. The registers may be |
---|
| 275 | partitioned into $N/n$ fields of width $n=2^k$ bits for some values |
---|
| 276 | of $k \leq K$. Typical values of $k$ used on commodity processors |
---|
| 277 | include $k = 3$ for SIMD operations on 8-bit fields (bytes), |
---|
| 278 | $k = 4$ for operations on 16-bit fields and $k = 5$ for operations |
---|
| 279 | on 32-bit fields. Whenever a register $r$ is partitioned into $n$-bit |
---|
| 280 | fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$. |
---|
| 281 | Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of |
---|
| 282 | register $r$, using big-endian numbering. |
---|
| 283 | |
---|
| 284 | Let \verb:simd<n>: represent the class of SIMD operations defined |
---|
| 285 | on fields of size $n$ using C++ template syntax. Given a |
---|
| 286 | binary function $F_n$ on $n$-bit fields, we denote the SIMD |
---|
| 287 | version of this operation as \verb#simd<n>::F#. Given two |
---|
| 288 | SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$, |
---|
| 289 | respectively, the operation \verb#r=simd<n>::F(a,b)# stores |
---|
| 290 | the value $r$ in the register \verb:r: as determined by |
---|
| 291 | the simultaneous calculation of individual field values in |
---|
| 292 | accord with the following equation. |
---|
| 293 | \[ r_i = F_n(a_i, b_i) \] |
---|
| 294 | |
---|
| 295 | For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left |
---|
| 296 | logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned |
---|
| 297 | integers as follows. |
---|
| 298 | %\singlespace |
---|
| 299 | \begin{eqnarray} |
---|
| 300 | \mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\ |
---|
| 301 | \mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\ |
---|
| 302 | \mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n |
---|
| 303 | \end{eqnarray} |
---|
| 304 | %\doublespace |
---|
| 305 | The SSE and Altivec instruction sets support |
---|
| 306 | each of the addition and subtraction operations in SIMD form |
---|
| 307 | for 8, 16 and 32-bit fields, while the SSE set also includes |
---|
| 308 | the 64-bit forms. For example, \verb#simd<8>::add# in our |
---|
| 309 | nomenclature is provided by the operation \verb:paddb: |
---|
| 310 | on SSE and the operation \verb:vaddubm: on Altivec. |
---|
| 311 | The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and |
---|
| 312 | \verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are |
---|
| 313 | more constrained, requiring that all field shifts by the same amount. |
---|
| 314 | |
---|
| 315 | Given these definitions and notation, we now present |
---|
| 316 | the four key elements of an inductive doubling architecture. |
---|
| 317 | The first is a definition of a core set of binary functions |
---|
| 318 | on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$. |
---|
| 319 | The second is a set of {\em half-operand modifiers} that allow |
---|
| 320 | the inductive processing of fields of size $2n$ in terms of |
---|
| 321 | combinations of $n$-bit values selected from the fields. |
---|
| 322 | The third is the definition of packing operations that compress |
---|
| 323 | two consecutive registers of $n$-bit values into a single |
---|
| 324 | register of $n/2$-bit values. The fourth is the definition |
---|
| 325 | of merging operations that produce a set of $2n$ bit fields |
---|
| 326 | by concatenating corresponding $n$-bit fields from two |
---|
| 327 | parallel registers. Each of these features is described below. |
---|
| 328 | |
---|
| 329 | For the purpose of direct and efficient support for inductive |
---|
| 330 | doubling algorithms on bit streams, the provision of |
---|
| 331 | a core set of operations at field widths of 2 and 4 as |
---|
| 332 | well as the more traditional field witdhs of 8, 16 and 32 |
---|
| 333 | is critical for elegant and efficient expression of the |
---|
| 334 | algorithms. In essence, inductive doubling algorithms |
---|
| 335 | work by establishing some base property at either single |
---|
| 336 | or 2-bit fields. Each iteration of the algorithm then |
---|
| 337 | goes on to establish the property for the power-of-2 |
---|
| 338 | field width. In order for this inductive step to be |
---|
| 339 | most conveniently and efficiently expressed, the |
---|
| 340 | core operations needed for the step should be available |
---|
| 341 | at each field width. In the case of work with parallel |
---|
| 342 | bit streams, the operations \verb:add:, \verb:sub:, |
---|
| 343 | \verb:sll:, \verb:srl: (shift right logical), and \verb:rotl: |
---|
| 344 | (rotate left) comprise the core. In other domains, |
---|
| 345 | additional operations may be usefully included in the |
---|
| 346 | core depending on the work that needs to be performed |
---|
| 347 | at each inductive doubling level. |
---|
| 348 | |
---|
| 349 | Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$ |
---|
| 350 | also includes fields of width 1. These are included for |
---|
| 351 | logical consistency, but are easily implemented by mapping |
---|
| 352 | directly to appropriate bitwise logic operations, which |
---|
| 353 | we assume are also available. For example, |
---|
| 354 | \verb#simd<1>::add# is equivalent to \verb:simd_xor:, the |
---|
| 355 | bitwise exclusive-or operation on SIMD registers. |
---|
| 356 | |
---|
| 357 | The second key facility of the inductive doubling architecture |
---|
| 358 | is the potential application of half-operand modifiers to |
---|
| 359 | the fields of either or both of the operands of a SIMD |
---|
| 360 | operation. These modifiers select either the |
---|
| 361 | low $n/2$ |
---|
| 362 | bits of each $n$-bit field (modifier ``\verb:l:'') or the |
---|
| 363 | high $n/2$ bits (modifier ``\verb:h:''). When required, |
---|
| 364 | the modifier ``\verb:x:'' means that the full $n$ bits |
---|
| 365 | should be used, unmodified. The semantics of these |
---|
| 366 | modifiers are given by the following equations. |
---|
| 367 | %\singlespace |
---|
| 368 | \begin{eqnarray} |
---|
| 369 | l(r_n) & = & r_n \bmod 2^{n/2} \\ |
---|
| 370 | h(r_n) & = & r_n / 2^{n/2} \\ |
---|
| 371 | x(r_n) & = & r_n |
---|
| 372 | \end{eqnarray} |
---|
| 373 | %\doublespace |
---|
| 374 | In our notation, the half-operand modifiers are |
---|
| 375 | specified as optional template (compile-time) parameters |
---|
| 376 | for each of the binary functions. Thus, |
---|
| 377 | \verb#simd<4>::add<l,h>(a,b)# is an operation which adds |
---|
| 378 | the 2-bit quantity found in the low 2-bits of each 4-bit field |
---|
| 379 | of its first operand (\verb:a:) |
---|
| 380 | together with the corresponding 2-bit quantity found in the |
---|
| 381 | high 2-bits of its second operand (\verb:b:). |
---|
| 382 | In general, the purpose of the half-operand modifiers |
---|
| 383 | in support of inductive doubling is to allow the processing |
---|
| 384 | of $n$-bit fields to easily expressed in terms of |
---|
| 385 | combination of the results determined by processing |
---|
| 386 | $n/2$ bit fields. |
---|
| 387 | |
---|
| 388 | The third facility of the inductive doubling architecture |
---|
| 389 | is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$. |
---|
| 390 | The field values of \verb#r=simd<n>::pack(a,b)# are |
---|
| 391 | defined by the following equations. |
---|
| 392 | %\singlespace |
---|
| 393 | \begin{eqnarray} |
---|
| 394 | r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\ |
---|
| 395 | r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n |
---|
| 396 | \end{eqnarray} |
---|
| 397 | %\doublespace |
---|
| 398 | Here conv is a function which performs conversion of an $n$-bit |
---|
| 399 | value to an $n/2$ bit value by signed saturation (although |
---|
| 400 | conversion by unsigned saturation would also suit our purpose). |
---|
| 401 | |
---|
| 402 | Half-operand modifiers may also be used with the pack |
---|
| 403 | operations. Thus packing with conversion by masking off all |
---|
| 404 | but the low $n/2$ bits of each field may be |
---|
| 405 | be performed using the operation \verb#simd<n>::pack<l,l># |
---|
| 406 | |
---|
| 407 | |
---|
| 408 | The final facility of the inductive doubling architecture is |
---|
| 409 | a set of merging operations \verb#simd<n>::mergeh# and |
---|
| 410 | \verb#simd<n>::mergel# that produce $2n$-bit fields |
---|
| 411 | by concatenating corresponding $n$-bit fields from the |
---|
| 412 | operand registers. The respective |
---|
| 413 | operations \verb#r=simd<n>::mergeh(a,b)# and |
---|
| 414 | \verb#s=simd<n>::mergel(a,b)# |
---|
| 415 | are defined by the following equations. |
---|
| 416 | %\singlespace |
---|
| 417 | \begin{eqnarray} |
---|
| 418 | r_{2n}[i] & = & a[i] \times 2^n + b[i] \\ |
---|
| 419 | s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)] |
---|
| 420 | \end{eqnarray} |
---|
| 421 | %\doublespace |
---|
| 422 | Both SSE and Altivec provide versions of pack and merge operations |
---|
| 423 | for certain field widths. The pack operations are provided |
---|
| 424 | with operands having 16-bit or 32-bit fields on each platform, although |
---|
| 425 | with some variation in how conversion is carried out. |
---|
| 426 | The merge operations are provided at 8-bit, 16-bit and 32-bit |
---|
| 427 | field widths on both architectures and also at the 64-bit level |
---|
| 428 | on SSE. |
---|
| 429 | |
---|
| 430 | This completes the description of the proposed inductive |
---|
| 431 | doubling architecture. As described, many of the features |
---|
| 432 | are already available with the SIMD facilities of |
---|
| 433 | existing commodity processors. The extensions enumerated |
---|
| 434 | here are relatively straightforward. The innovation |
---|
| 435 | is to specifically tackle the design of facilities to |
---|
| 436 | offer systematic support for transitions between power-of-2 field widths. |
---|
| 437 | As we shall show in the remainder of this paper, these facilities |
---|
| 438 | can dramatically reduce instruction count in core parallel bit |
---|
| 439 | stream algorithms, with a factor of 3 reduction being typical. |
---|
| 440 | The next section goes on to illustrate a first simple example |
---|
| 441 | for the task of population count. |
---|
| 442 | |
---|
| 443 | \section{Population Count} |
---|
| 444 | |
---|
| 445 | \begin{figure} |
---|
| 446 | \begin{center}\small |
---|
| 447 | \begin{verbatim} |
---|
| 448 | c = (x & 0x55555555) + ((x >> 1) & 0x55555555); |
---|
| 449 | c = (c & 0x33333333) + ((c >> 2) & 0x33333333); |
---|
| 450 | c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F); |
---|
| 451 | c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF); |
---|
| 452 | c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF); |
---|
| 453 | \end{verbatim} |
---|
| 454 | \end{center} |
---|
| 455 | \caption{Population Count Reference Algorithm} |
---|
| 456 | \label{HD-pop} |
---|
| 457 | \end{figure} |
---|
| 458 | |
---|
| 459 | \begin{figure} |
---|
| 460 | \begin{center}\small |
---|
| 461 | \begin{verbatim} |
---|
| 462 | c = simd<2>::add<l,h>(x, x); |
---|
| 463 | c = simd<4>::add<l,h>(c, c); |
---|
| 464 | c = simd<8>::add<l,h>(c, c); |
---|
| 465 | c = simd<16>::add<l,h>(c, c); |
---|
| 466 | c = simd<32>::add<l,h>(c, c); |
---|
| 467 | \end{verbatim} |
---|
| 468 | \end{center} |
---|
| 469 | \caption{Ideal SIMD Population Count} |
---|
| 470 | \label{inductivepopcount} |
---|
| 471 | \end{figure} |
---|
| 472 | |
---|
| 473 | As an initial example to illustrate the principle of inductive doubling |
---|
| 474 | in practice, consider the problem of {\em population count}: determining |
---|
| 475 | the number of one bits within a particular bit field. It is important |
---|
| 476 | enough for such operations as |
---|
| 477 | calculating Hamming distance to be included as a built-in instruction |
---|
| 478 | on some processors. For example, the SPU of the Cell Broadband Engine |
---|
| 479 | has a SIMD population count instruction \verb:si_cntb: for simultaneously |
---|
| 480 | determining the |
---|
| 481 | number of 1 bits within each byte of a 16-byte register. |
---|
| 482 | In text processing with parallel bit streams, population count has direct |
---|
| 483 | application to keeping track of line numbers for error reporting, for example. |
---|
| 484 | Given a bit block identifying the positions of newline characters within |
---|
| 485 | a block of characters being processed, the population count of the |
---|
| 486 | bit block can be used to efficiently and conveniently be used to update |
---|
| 487 | the line number upon completion of block processing. |
---|
| 488 | |
---|
| 489 | Figure \ref{HD-pop} presents a traditional divide-and-conquer |
---|
| 490 | implementation for a 32-bit integer {\tt x} adapted from |
---|
| 491 | Warren \cite{HackersDelight}, while |
---|
[230] | 492 | Figure \ref{inductivepopcount} shows the corresponding IDISA |
---|
| 493 | implementation for a vector of 32-bit fields. Each implementation employs |
---|
[227] | 494 | five steps of inductive doubling to produce population counts |
---|
| 495 | within 32 bit fields. The traditional implementation employs |
---|
| 496 | explicit masking and shifting operations, while these |
---|
| 497 | operations are implicit within the semantics of the inductive |
---|
[230] | 498 | doubling instructions shown in Figure \ref{inductivepopcount}. |
---|
[227] | 499 | In each implementation, the first step determines the |
---|
| 500 | the population counts within 2-bit fields |
---|
| 501 | by adding the high bit of each such field to the low bit |
---|
| 502 | to produce a set of 2-bit counts in {\tt c}. |
---|
| 503 | In the second step, the counts within 4-bit fields of {\tt c} are determined |
---|
| 504 | by adding the counts of the corresponding high and low 2-bit subfields. |
---|
| 505 | Continuing in this fashion, |
---|
| 506 | the final population counts within 32-bit fields are determined in five steps. |
---|
| 507 | |
---|
| 508 | With the substitution of longer mask constants replicated for four |
---|
| 509 | 32-bit fields, the implementation of Figure \ref{HD-pop} can be |
---|
[230] | 510 | directly adapted to SWAR processing using 128-bit registers. |
---|
| 511 | Each binary operator is replaced by a corresponding binary |
---|
| 512 | SIMD operation. Without the IDISA features, a |
---|
| 513 | straightforward SWAR implementation of population count for |
---|
| 514 | 32-bit fields thus employs 10 operations to load or generate |
---|
[227] | 515 | mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a |
---|
[230] | 516 | total of 30 operations to complete the task. Employing |
---|
| 517 | optimization identified by Warren, this can be reduced to |
---|
| 518 | 20 operations, 5 of which are required to generate mask constants. |
---|
| 519 | At the cost of register pressure, it is possible that these constants |
---|
| 520 | could be kept preloaded in long vector processing. |
---|
| 521 | In any case, the IDISA implementation requires only 5 operations |
---|
| 522 | to carry out the work requiring 15 to 20 operations with the |
---|
| 523 | optimized version using standard SWAR operations. IDISA thus offers |
---|
[227] | 524 | offers a 3X to 4X improvement in instruction count as well as |
---|
| 525 | a reduction in register pressure. |
---|
| 526 | |
---|
[230] | 527 | The pattern illustrated by population count is typical. An |
---|
| 528 | inductive doubling algorithm of $n$ steps typically applies |
---|
| 529 | two mask or shift operations at each step for each of the |
---|
| 530 | two operands being combined in the step. If the combination |
---|
| 531 | can be implemented in a single SWAR operation, then a total |
---|
| 532 | of $3n$ operations are the minimum required for the SWAR |
---|
| 533 | algorithm with IDISA features. An IDISA implementation |
---|
| 534 | typically eliminates the explicit mask and shift operations |
---|
| 535 | through appropriate half-operand modifiers, reducing the |
---|
| 536 | total instruction count to $n$. Thus a 3X improvement |
---|
| 537 | is typical. |
---|
| 538 | |
---|
[227] | 539 | \section{Transposition to Parallel Bit Streams} |
---|
| 540 | |
---|
| 541 | In this section, we consider the first major |
---|
| 542 | application of the inductive doubling architecture: |
---|
| 543 | transposition of byte stream data to parallel bit stream |
---|
| 544 | form. Of course, this operation is critical to the |
---|
| 545 | method of parallel bit streams and all applications |
---|
| 546 | of the method can benefit from a highly efficient |
---|
| 547 | transposition process. Before considering how |
---|
| 548 | the inductive doubling architecture supports this |
---|
| 549 | transposition process, however, we first consider |
---|
| 550 | algorithms on existing architectures. Two algorithms |
---|
| 551 | are presented; the best of these requires 72 |
---|
| 552 | SIMD operations in the three-register model to perform |
---|
| 553 | transposition of eight serial registers of byte stream data into |
---|
| 554 | eight parallel registers of bit stream data. |
---|
| 555 | |
---|
| 556 | We then go on to show how the transposition problem |
---|
| 557 | can be solved using the inductive doubling architecture |
---|
| 558 | with a mere 24 three-register SIMD operations. We also prove |
---|
| 559 | that this is optimal for any three-register instruction set model. |
---|
| 560 | |
---|
| 561 | |
---|
| 562 | \begin{figure}[tbh] |
---|
| 563 | \begin{center} |
---|
| 564 | \includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf} |
---|
| 565 | \caption{Input/Output Model for Serial to Parallel Transposition} |
---|
| 566 | \label{s2p-spec} |
---|
| 567 | \end{center} |
---|
| 568 | |
---|
| 569 | \end{figure} |
---|
| 570 | Figure \ref{s2p-spec} illustrates the input-output requirements of |
---|
| 571 | the transposition problem. We assume that inputs and |
---|
| 572 | outputs are each SIMD registers of size $N=2^K$ bits. |
---|
| 573 | The input consists of $N$ bytes of serial byte data, |
---|
| 574 | stored consecutively in eight SIMD registers each holding |
---|
| 575 | $N/8$ bytes. The output consists of eight parallel |
---|
| 576 | registers, one each for the eight individual bit positions |
---|
| 577 | within a byte. Upon completion of the transposition process, |
---|
| 578 | each output register is to hold the $N$ bits corresponding |
---|
| 579 | to the selected bit position in the sequence of $N$ input |
---|
| 580 | bytes. |
---|
| 581 | |
---|
| 582 | \subsection{Bit Gathering Algorithm} |
---|
| 583 | |
---|
| 584 | \begin{figure}[tbh] |
---|
| 585 | \begin{center} |
---|
[229] | 586 | \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf} |
---|
[227] | 587 | \caption{Serial to Parallel Transposition Using Bit-Gathering} |
---|
| 588 | \label{gather} |
---|
| 589 | \end{center} |
---|
| 590 | \end{figure} |
---|
| 591 | One straightforward algorithm for implementing the transposition process |
---|
| 592 | takes advantage of SIMD bit gathering operations that exist |
---|
| 593 | on some architectures. This operation gathers one bit per byte |
---|
| 594 | from a particular position within each byte of a SIMD register. |
---|
| 595 | For example, the {\tt pmovmskb} operation of the Intel MMX and |
---|
| 596 | SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask |
---|
| 597 | consisting of the high bit of each byte. Similarly, the |
---|
| 598 | {\tt \verb:si_gbb:} operation of the synergistic processing units of the |
---|
| 599 | Cell Broadband Engine gathers together the low bit of each byte |
---|
| 600 | from the SIMD register. Figure \ref{gather} illustrates the |
---|
| 601 | bit gathering process. |
---|
| 602 | |
---|
| 603 | For each bit stream of output, the bit gather algorithm requires |
---|
| 604 | one gather operation for each of the 8 input registers, |
---|
| 605 | giving 64 bit gather operations in all. In addition, for seven |
---|
| 606 | of the eight bit positions, it is necessary to shift the bits |
---|
| 607 | of each input register into the conventional position for |
---|
| 608 | gathering. A total of 56 shift operations are required for this |
---|
| 609 | task. Finally, the result of each bit gather operation must |
---|
| 610 | be properly inserted into the output stream. If the first |
---|
| 611 | gather operation for a stream can be used to also initialize |
---|
| 612 | the output register, there will then need to be 7 insert |
---|
| 613 | operations for the results of the remaining gather operations |
---|
| 614 | for this stream, making 56 insert operations in all. |
---|
| 615 | The insert step may be more complex than a single operation |
---|
| 616 | in some cases, but we use one operation per insert as a lower bound. |
---|
| 617 | Thus, the bit gather algorithm requires |
---|
| 618 | at least 176 operations to perform the transposition task. |
---|
| 619 | |
---|
| 620 | \subsection{BytePack Algorithm} |
---|
| 621 | |
---|
| 622 | A much more efficient transposition algorithm on commodity |
---|
| 623 | SIMD architectures (SSE and Altivec) involves three |
---|
| 624 | stages of binary division transformation. This is similar |
---|
| 625 | to the three stage bit matrix inversion described by |
---|
| 626 | Warren \cite{HackersDelight}, although modified to use SIMD operations. |
---|
| 627 | In each stage, input streams are divided into two half-length output streams. |
---|
| 628 | The first stage separates the bits at even numbered positions from those |
---|
| 629 | at odd number positions. The two output streams from the first |
---|
| 630 | stage are then further divided in the second stage. |
---|
| 631 | The stream comprising even numbered bits from the original byte stream |
---|
| 632 | divides into one stream consisting of bits from positions 0 and 4 of each |
---|
| 633 | byte in the original stream and a second stream consisting of bits |
---|
| 634 | from positions 2 and 6 of each original byte. The stream of bits from |
---|
| 635 | odd positions is similarly divided into streams for bits from Each of the |
---|
| 636 | positions 1 and 5 and bits from positions 2 and 6. |
---|
| 637 | Finally, each of the four streams resulting from the second stage are |
---|
| 638 | divided into the desired individual bit streams in the third stage. |
---|
| 639 | |
---|
| 640 | \begin{figure}[tbh] |
---|
| 641 | \begin{center} |
---|
| 642 | \begin{verbatim} |
---|
| 643 | t0 = simd<16>::pack<h,h>(s0, s1); |
---|
| 644 | t1 = simd<16>::pack<l,l>(s0, s1); |
---|
| 645 | p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1)); |
---|
| 646 | p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1); |
---|
| 647 | \end{verbatim} |
---|
| 648 | \end{center} |
---|
| 649 | \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm} |
---|
| 650 | \label{s2pstep} |
---|
| 651 | \end{figure} |
---|
| 652 | |
---|
| 653 | The binary division transformations are accomplished in each stage |
---|
| 654 | using byte packing, shifting and masking. In each stage, a |
---|
| 655 | transposition step combines each pair of serial input registers to |
---|
| 656 | produce a pair of parallel output registers. In essence, |
---|
| 657 | doublebytes from the input registers are packed into bytes |
---|
| 658 | in the output registers, with the bits from even positions stored |
---|
| 659 | in the bytes of one output stream and the bits from odd positions |
---|
| 660 | stored in the bytes of the second output stream. |
---|
| 661 | Figure \ref{s2pstep} shows a step in stage 1 of the algorithm |
---|
| 662 | producing two parallel registers \verb:p0: and \verb:p1: from |
---|
| 663 | two serial registers \verb:s0: and \verb:s1:. This step is applied |
---|
| 664 | four times in stage 1; stages 2 and 3 also consist of four applications |
---|
| 665 | of a similar step with different shift and masking constants. |
---|
| 666 | |
---|
| 667 | Although we have used the idealized SIMD notation here, each of the |
---|
| 668 | operations maps to a single operation in the Altivec set and a small number |
---|
| 669 | of operations in the SSE set. Using the Altivec set, there are |
---|
| 670 | 6 operations per step for a total of 24 operations per stage. |
---|
| 671 | The three stages combined required 72 operations to transpose 128 bytes |
---|
| 672 | to parallel bit stream form. This is the best algorithm known to |
---|
| 673 | us for existing SIMD architectures. |
---|
| 674 | |
---|
| 675 | \subsection{Inductive Halving Algorithm} |
---|
| 676 | |
---|
| 677 | Using the inductive doubling architecture, it is possible to design |
---|
| 678 | a transposition algorithm that is both easier to understand and requires |
---|
| 679 | many fewer operations than the the bytepack algorithm described above. |
---|
| 680 | We call it the inductive halving algorithm for serial to parallel |
---|
| 681 | transposition, because it proceeds by reducing byte streams to |
---|
| 682 | two sets of nybble streams in a first stage, dividing the nybble |
---|
| 683 | streams into streams of bitpairs in a second stage and finally |
---|
| 684 | dividing the bitpair streams into bit streams in the third stage. |
---|
| 685 | |
---|
| 686 | |
---|
| 687 | \begin{figure}[tbh] |
---|
| 688 | \begin{verbatim} |
---|
| 689 | p0 = simd<8>::pack<h,h>(s0, s1); |
---|
| 690 | p1 = simd<8>::pack<l,l>(s0, s1); |
---|
| 691 | \end{verbatim} |
---|
| 692 | \caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm} |
---|
| 693 | \label{halvingstep} |
---|
| 694 | \end{figure} |
---|
| 695 | |
---|
| 696 | Figure \ref{halvingstep} shows one step in stage 1 of the inductive |
---|
| 697 | halving algorithm, comprising just two SIMD operations. |
---|
| 698 | The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte |
---|
| 699 | from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts |
---|
| 700 | the low nybble of each byte. As in the bytepack algorithm, this step is |
---|
| 701 | applied 4 times in stage 1, for a total of 8 operations. |
---|
| 702 | |
---|
| 703 | Stage 2 of the inductive halving algorithm reduces nybble streams |
---|
| 704 | to streams of bit pairs. The basic step in this algorithm consists |
---|
| 705 | of one \verb#simd<4>::pack<h,h># operation to extract the high pair |
---|
| 706 | of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the |
---|
| 707 | low pair of each nybble. Four applications of this step complete stage 2. |
---|
| 708 | |
---|
| 709 | Stage 3 similarly uses four applications of a step that uses a |
---|
| 710 | \verb#simd<2>::pack<h,h># operation to extract the high bit of |
---|
| 711 | each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of |
---|
| 712 | each pair. The complete algorithm to transform eight serial |
---|
| 713 | byte registers s0 through s7 into the eight parallel bit stream |
---|
| 714 | registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}. |
---|
| 715 | |
---|
| 716 | \begin{figure}[tbh] |
---|
| 717 | \begin{verbatim} |
---|
| 718 | hi_nybble0 = simd<8>::pack<h,h>(s0, s1); |
---|
| 719 | lo_nybble0 = simd<8>::pack<l,l>(s0, s1); |
---|
| 720 | hi_nybble1 = simd<8>::pack<h,h>(s2, s3); |
---|
| 721 | lo_nybble1 = simd<8>::pack<l,l>(s2, s3); |
---|
| 722 | hi_nybble2 = simd<8>::pack<h,h>(s4, s5); |
---|
| 723 | lo_nybble2 = simd<8>::pack<l,l>(s4, s5); |
---|
| 724 | hi_nybble3 = simd<8>::pack<h,h>(s6, s7); |
---|
| 725 | lo_nybble3 = simd<8>::pack<l,l>(s6, s7); |
---|
| 726 | hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1); |
---|
| 727 | hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1); |
---|
| 728 | lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1); |
---|
| 729 | ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1); |
---|
| 730 | hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3); |
---|
| 731 | hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3); |
---|
| 732 | lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3); |
---|
| 733 | ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3); |
---|
| 734 | bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1); |
---|
| 735 | bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1); |
---|
| 736 | bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1); |
---|
| 737 | bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1); |
---|
| 738 | bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1); |
---|
| 739 | bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1); |
---|
| 740 | bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1); |
---|
| 741 | bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1); |
---|
| 742 | \end{verbatim} |
---|
| 743 | \caption{Complete Inductive Halving Algorithm} |
---|
| 744 | \label{halvingalgorithm} |
---|
| 745 | \end{figure} |
---|
| 746 | |
---|
| 747 | \subsection{Optimality of the Inductive Halving Algorithm} |
---|
| 748 | |
---|
| 749 | Here we show that the inductive halving algorithm presented in |
---|
| 750 | the previous subsection is optimal in the following sense: |
---|
| 751 | no other algorithm on any 3-register SIMD architecture can use |
---|
| 752 | fewer than 24 operations to transform eight serial registers |
---|
| 753 | of byte stream data into eight parallel registers of bit stream data. |
---|
| 754 | By 3-register SIMD architecture, we refer to any architecture |
---|
| 755 | that uses SIMD instructions consistent with our overall model of |
---|
| 756 | binary operations using two input register operands to produce |
---|
| 757 | one output register value. |
---|
| 758 | |
---|
| 759 | Observe that the $N$ data bits from each input register must be |
---|
| 760 | distributed $N/8$ each to the 8 output registers by virtue of |
---|
| 761 | the problem definition. Each output register can effectively |
---|
| 762 | be given a 3-bit address; the partitioning problem can be viewed |
---|
| 763 | as moving data to the correct address. However, each |
---|
| 764 | operation can move results into at most one register. |
---|
| 765 | At most this can result in the assignment of one correct address |
---|
| 766 | bit for each of the $N$ input bits. As all $8N$ input bits |
---|
| 767 | need to be moved to a register with a correct 3-bit address, |
---|
| 768 | a minimum of 24 operations is required. |
---|
| 769 | |
---|
| 770 | \section{Parallel to Serial Conversion} |
---|
| 771 | |
---|
| 772 | Parallel bit stream applications may apply string editing |
---|
| 773 | operations in bit space to substitute, delete or insert |
---|
| 774 | parallel sets of bits at particular positions. In such cases, |
---|
| 775 | the inverse transform that converts a set of parallel bit |
---|
| 776 | streams back into byte space is needed. In the example of |
---|
| 777 | UTF-8 to UTF-16 transcoding, the inverse transform is |
---|
| 778 | actually used twice for each application of the forward |
---|
| 779 | transform, to separately compute the high and low byte |
---|
| 780 | streams of each UTF-16 code unit. Those two byte streams |
---|
| 781 | are subsequentely merged to form the final result. |
---|
| 782 | |
---|
| 783 | Algorithms for performing the inverse transform mirror those |
---|
| 784 | of the forward transform, employing SIMD merge operations |
---|
| 785 | in place of pack operations. The best algorithm known |
---|
| 786 | to us on the commodity SIMD architectures takes advantage |
---|
| 787 | of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel# |
---|
| 788 | operations that are available with each of the SSE and Altivec instruction |
---|
| 789 | sets. These algorithms take 72 operations to perform the |
---|
| 790 | inverse transposition of 8 parallel registers of bit stream |
---|
| 791 | data into 8 serial registers of byte stream data. |
---|
| 792 | |
---|
| 793 | \begin{figure}[tbh] |
---|
| 794 | \begin{center} |
---|
| 795 | \begin{verbatim} |
---|
| 796 | bit01_r0 = simd<1>::mergeh(bit0, bit1); |
---|
| 797 | bit01_r1 = simd<1>::mergel(bit0, bit1); |
---|
| 798 | bit23_r0 = simd<1>::mergeh(bit2, bit3); |
---|
| 799 | bit23_r1 = simd<1>::mergel(bit2, bit3); |
---|
| 800 | bit45_r0 = simd<1>::mergeh(bit4, bit5); |
---|
| 801 | bit45_r1 = simd<1>::mergel(bit4, bit5); |
---|
| 802 | bit67_r0 = simd<1>::mergeh(bit6, bit7); |
---|
| 803 | bit67_r1 = simd<1>::mergel(bit6, bit7); |
---|
| 804 | bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0); |
---|
| 805 | bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0); |
---|
| 806 | bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1); |
---|
| 807 | bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1); |
---|
| 808 | bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0); |
---|
| 809 | bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0); |
---|
| 810 | bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1); |
---|
| 811 | bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1); |
---|
| 812 | s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0); |
---|
| 813 | s1 = simd<4>::mergel(bit0123_r0, bit4567_r0); |
---|
| 814 | s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1); |
---|
| 815 | s3 = simd<4>::mergel(bit0123_r1, bit4567_r1); |
---|
| 816 | s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2); |
---|
| 817 | s5 = simd<4>::mergel(bit0123_r2, bit4567_r2); |
---|
| 818 | s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3); |
---|
| 819 | s7 = simd<4>::mergel(bit0123_r3, bit4567_r3); |
---|
| 820 | \end{verbatim} |
---|
| 821 | \end{center} |
---|
| 822 | \label{p2s-inductive} |
---|
| 823 | \caption{Parallel to Serial Transposition by Inductive Doubling} |
---|
| 824 | \end{figure} |
---|
| 825 | |
---|
| 826 | An algorithm employing only 24 operations using the |
---|
| 827 | inductive doubling instruction set architecture is relatively |
---|
| 828 | straightforward.. In stage 1, parallel registers for individual bit streams |
---|
| 829 | are first merged with bit-level interleaving |
---|
| 830 | using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel# |
---|
| 831 | operations. For each of the four pairs of consecutive |
---|
| 832 | even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7), |
---|
| 833 | two consecutive registers of bitpair data are produced. |
---|
| 834 | In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel# |
---|
| 835 | are then applied to merge to bitpair streams to produce streams |
---|
| 836 | of nybbles for the high and low nybble of each byte. Finally, |
---|
| 837 | the nybble streams are merged in stage 3 to produce the |
---|
| 838 | desired byte stream data. The full inductive doubling |
---|
| 839 | algorithm for parallel to serial transposition is shown in Figure |
---|
| 840 | \ref{p2s-inductive}. |
---|
| 841 | |
---|
| 842 | This algorithm is again optimal, requiring the fewest operations |
---|
| 843 | of any possible algorithm using any 3-register instruction set |
---|
| 844 | model. The proof directly follows that for the serial to parallel |
---|
| 845 | transposition. |
---|
| 846 | |
---|
| 847 | The existence of high-performance algorithms for transformation of |
---|
| 848 | character data between byte stream and parallel bit stream form |
---|
| 849 | in both directions makes it possible to consider applying these |
---|
| 850 | transformations multiple times during text processing applications. |
---|
| 851 | Just as the time domain and frequency domain each have their |
---|
| 852 | use in signal processing applications, the byte stream form and |
---|
| 853 | parallel bit stream form can then each be used at will in |
---|
| 854 | character stream applications. |
---|
| 855 | |
---|
| 856 | |
---|
| 857 | |
---|
| 858 | |
---|
| 859 | |
---|
| 860 | \section{Parallel Bit Deletion} |
---|
| 861 | |
---|
| 862 | Parallel bit deletion is an important operation that allows string |
---|
| 863 | editing operations to be carried out while in parallel bit stream |
---|
| 864 | form. It is also fundamental to UTF-8 to UTF-16 transcoding |
---|
| 865 | using parallel bit streams, allowing the excess code unit |
---|
| 866 | positions for UTF-8 two-, three- and four-byte sequences to |
---|
| 867 | be deleted once the sixteen parallel bit streams of UTF-16 have |
---|
| 868 | been computed \cite{PPoPP08}. |
---|
| 869 | |
---|
| 870 | Parallel bit deletion is specified using a deletion mask. |
---|
| 871 | A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits |
---|
| 872 | to be deleted and 0s at positions identifying bits to be retained. |
---|
| 873 | For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel |
---|
| 874 | bit streams \verb:abcdefgh: and \verb:ABCDEFGH:. Parallel deletion of elements from both bit streams in |
---|
| 875 | accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:. |
---|
| 876 | |
---|
| 877 | Bit deletion may be performed using |
---|
| 878 | the parallel-prefix compress algorithm documented by |
---|
| 879 | Warren and attributed to Steele \cite{HackersDelight}. This algorithm uses |
---|
| 880 | only logic and shifts with a constant parameter to carry |
---|
| 881 | out the deletion process. However, it requires $k^2$ |
---|
| 882 | preprocessing steps for a final field width parameter |
---|
| 883 | of size $2^k$, as well as 4 operations per deletion step |
---|
| 884 | per stream. Using the inductive doubling instruction set architecture |
---|
| 885 | it is possible to carry out bit deletion much more efficiently. |
---|
| 886 | |
---|
| 887 | Deletion within fixed size fields or registers may produce results that are either |
---|
| 888 | left justified or right-justified. For example, a five-element stream \verb:bdefh: within an |
---|
| 889 | eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't |
---|
| 890 | care positions marked `\verb:x:'. Concatenating an adjacent right-justified result with a |
---|
| 891 | left-justified result produces an important intermediate form known as a |
---|
| 892 | {\em central deletion result}. For example, \verb:xxbd: and \verb:efhx: may be respective |
---|
| 893 | right-justified and left-justified results from the application of the |
---|
| 894 | 4-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element |
---|
| 895 | stream segments \verb:abcd: and \verb:efgh:. Concatenation of \verb:xxbd: and \verb:efhx: produces |
---|
| 896 | the central result \verb:xxbdefhx:, which may easily be converted to a either a |
---|
| 897 | left or a right justified 8-element result by an appropriate shift operation. |
---|
| 898 | |
---|
| 899 | \begin{figure} |
---|
| 900 | \begin{center} |
---|
| 901 | \begin{tabular}{|c||c|c|c|c|c|c|c|c|} |
---|
| 902 | \hline |
---|
| 903 | \verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010: \\ \hline |
---|
| 904 | \verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F: \\ \hline |
---|
| 905 | \verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline |
---|
| 906 | \verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1 \\ \hline |
---|
| 907 | \verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline |
---|
| 908 | \verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline |
---|
| 909 | \verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline |
---|
| 910 | \verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline |
---|
| 911 | \end{tabular} |
---|
| 912 | \end{center} |
---|
| 913 | \label{centraldelstep} |
---|
| 914 | \caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction} |
---|
| 915 | \end{figure} |
---|
| 916 | |
---|
| 917 | The observation about how two $n$-bit central deletion results can |
---|
| 918 | combine to yield a $2n$ central deletion result provides the basis |
---|
| 919 | for an inductive doubling algorithm. Figure \ref{centraldelstep} |
---|
| 920 | illustrates the inductive process for the transition from 8-bit central |
---|
| 921 | deletion results to 16-bit central deletion results. The top row shows |
---|
| 922 | the original deletion mask, while the second row shows the original |
---|
| 923 | bit stream to which deletions are to be applied, with deleted bits zeroed out. |
---|
| 924 | The third row shows the central result for each 8-bit field as the |
---|
| 925 | result of the previous inductive step. |
---|
| 926 | |
---|
| 927 | To perform the $8 \rightarrow 16$ central deletion step, we first form |
---|
| 928 | the population counts of 4-bit fields of the original deletion mask as |
---|
| 929 | shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying |
---|
| 930 | an 8-bit central result, we perform a right shift by the population count |
---|
| 931 | of the low half of the field. Similarly, |
---|
| 932 | left-justification requires a left-shift by the population count in the |
---|
| 933 | high half of the field. |
---|
| 934 | |
---|
| 935 | The left and right shifts can be performed simultaneously using a rotate |
---|
| 936 | left instruction. Right justification by shifting an $n$ bit field |
---|
| 937 | $i$ positions to the right is equivalent to a left rotate of $n-i$ |
---|
| 938 | positions. These rotation amounts are computed by the operation \newline |
---|
| 939 | \verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5, |
---|
| 940 | except that don't care fields (which won't be subsequently used) |
---|
| 941 | are marked \verb:XX:. |
---|
| 942 | |
---|
| 943 | The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)# |
---|
| 944 | as shown in row 6, and are combined with the right shift amounts |
---|
| 945 | by the selection operation \newline \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)# |
---|
| 946 | as shown in row 7. Using these computed values, the inductive step |
---|
| 947 | is completed by application of the operation \newline \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)# |
---|
| 948 | as shown in row 8. |
---|
| 949 | |
---|
| 950 | At each inductive doubling level, it requires 4 operations to compute the |
---|
| 951 | required deletion infomation and one operation per bit stream to perform deletion. |
---|
| 952 | Note that, if deletion is to be applied to a set of eight parallel bit streams, |
---|
| 953 | the computed deletion information is used for each stream without recomputation, |
---|
| 954 | thus requiring 12 operations per inductive level. |
---|
| 955 | |
---|
| 956 | In comparison to the parallel-prefix compress method, the method of central |
---|
| 957 | deletion results using the inductive doubling architecture has far fewer operations. |
---|
| 958 | The total preprocessing cost is $4k$ for $k$ steps of deletion by central result |
---|
| 959 | induction versus $4k^2$ for the parallel-prefix method. Using the computed |
---|
| 960 | deletion operation, only a single SIMD rotate operation per bit stream |
---|
| 961 | per level is needed, in comparison with 4 operations per level for parallel-prefix |
---|
| 962 | compress. |
---|
| 963 | |
---|
| 964 | |
---|
| 965 | \section{Systematic Support for Horizontal SIMD Operations} |
---|
| 966 | |
---|
| 967 | In SIMD parlance, {\em horizontal} operations are |
---|
| 968 | operations which combine values from two or more fields |
---|
| 969 | of the same register, in contrast to the normal |
---|
| 970 | {\em vertical} operations which combine corresponding |
---|
| 971 | fields of different registers. Horizontal operations |
---|
| 972 | can be found that combine two (e.g., haddpd on SSE3), |
---|
| 973 | four (e.g, \verb:si_orx: on SPU), eight (e.g, psadbw on SSE) |
---|
| 974 | or sixteen values (e.g., vcmpequb on Altivec). Some |
---|
| 975 | horizontal operations have a vertical component as well. |
---|
| 976 | For example, psadbw first forms the absolute value of |
---|
| 977 | the difference of eight corresponding byte fields before |
---|
| 978 | performing horizontal add of the eight values, while |
---|
| 979 | vsum4ubs on Altivec performs horizontal add of sets of |
---|
| 980 | four unsigned 8-bit fields within one register |
---|
| 981 | and then combines the result horizontally with |
---|
| 982 | corresponding 32-bit fields of a second registers. |
---|
| 983 | |
---|
| 984 | The space of potential horizontal operations thus has |
---|
| 985 | many dimensions, including not only the particular |
---|
| 986 | combining operation and the operand field width, but |
---|
| 987 | also the number of fields being combined, whether a |
---|
| 988 | vertical combination is applied and whether it is applied |
---|
| 989 | before or after the horizontal operation and what the |
---|
| 990 | nature of the vertical combining operation is. |
---|
| 991 | Within this space, commodity SIMD architectures tend |
---|
| 992 | to support only a very few combinations, without any |
---|
| 993 | particular attempt at systematic support for horizontal |
---|
| 994 | operations in general. |
---|
| 995 | |
---|
| 996 | By making use of \verb:<l,h>: half-operand modifier |
---|
| 997 | combinations, the inductive doubling architecture |
---|
| 998 | offers systematic support for horizontal operations |
---|
| 999 | on pairs of adjacent fields. |
---|
| 1000 | For example, \verb#simd<16>::add<l,h># adds values |
---|
| 1001 | in adjacent 8 bit fields to produce 16 bit results, |
---|
| 1002 | while \verb#simd<32>::min<l,h># can produce the |
---|
| 1003 | minimum value of adjacent 16-bit fields. In general, |
---|
| 1004 | \newline \verb#simd<n>::F<l,h># denotes the horizontal |
---|
| 1005 | binary combination of adjacent fields for any |
---|
| 1006 | operator $F$ and field width $n$. |
---|
| 1007 | |
---|
| 1008 | Horizontal combinations of larger numbers of fields |
---|
| 1009 | makes use of the inductive doubling property. |
---|
| 1010 | For example, consider the or-across operation \verb:si_orx: |
---|
| 1011 | of the SPU, that performs a logical or operation |
---|
| 1012 | on four 32-bit fields. This four field combination |
---|
| 1013 | involves two steps in the inductive doubling approach. |
---|
| 1014 | %\begin{singlespace} |
---|
| 1015 | \begin{verbatim} |
---|
| 1016 | t = simd<64>::or<l,h>(x, x) |
---|
| 1017 | t = simd<128>::or<l,h>(t, t) |
---|
| 1018 | \end{verbatim} |
---|
| 1019 | %\end{singlespace} |
---|
| 1020 | This example is also interesting in showing a potential |
---|
| 1021 | value for supporting bitwise logical operations at |
---|
| 1022 | different field widths, i.e., specifically for use with |
---|
| 1023 | half-operand modifiers. |
---|
| 1024 | |
---|
| 1025 | Similarly, to combine any eight fields simply requires |
---|
| 1026 | three inductive doubling steps using the desired |
---|
| 1027 | operator at successive power-of-two field widths, while |
---|
| 1028 | combining sixteen fields requires four such operations. |
---|
| 1029 | In this way, the inductive doubling architecture provides |
---|
| 1030 | systematic support for horizontal operations well beyond |
---|
| 1031 | the existing facilities of commodity architectures, |
---|
| 1032 | although lacking some of the special features found in |
---|
| 1033 | some cases. |
---|
| 1034 | |
---|
| 1035 | |
---|
| 1036 | \section{Implementation} |
---|
| 1037 | |
---|
[228] | 1038 | We have carried implementation work for IDISA in three |
---|
| 1039 | ways. First, we have constructed libraries that |
---|
| 1040 | implement the IDISA instructions by template and/or macro |
---|
| 1041 | expansion for each of MMX, SSE, Altivec, and SPU platforms. |
---|
| 1042 | Second, we have developed a model implementation |
---|
| 1043 | involving a modified operand fetch component |
---|
| 1044 | of a pipelined SIMD processor. Third, we have written |
---|
| 1045 | and evaluated Verilog HDL description of this model |
---|
| 1046 | implementation. |
---|
[227] | 1047 | |
---|
[228] | 1048 | \subsection{IDISA Libraries} |
---|
[227] | 1049 | |
---|
[228] | 1050 | Implementation of IDISA instructions using template |
---|
| 1051 | and macro libraries has been useful in developing |
---|
| 1052 | and assessing the correctness of many of the algorithms |
---|
| 1053 | presented here. Although these implementations do not |
---|
| 1054 | deliver the performance benefits associated with |
---|
| 1055 | direct hardware implementation of IDISA, they |
---|
| 1056 | have been quite useful in providing a practical means |
---|
| 1057 | for portable implementation of parallel bit stream |
---|
| 1058 | algorithms on multiple SWAR architectures. However, |
---|
| 1059 | one additional facility has also proven necessary for |
---|
| 1060 | portability of parallel bit stream algorithms across |
---|
| 1061 | big-endian and little-endian architectures: the |
---|
| 1062 | notion of shift-forward and shift-back operations. |
---|
| 1063 | In essence, shift forward means shift to the left |
---|
| 1064 | on little-endian systems and shift to the right on |
---|
| 1065 | big-endian systems, while shift back has the reverse |
---|
| 1066 | interpretation. Although this concept is unrelated to |
---|
| 1067 | inductive doubling, its inclusion with the IDISA |
---|
| 1068 | libraries has provided a suitable basis for portable |
---|
| 1069 | SIMD implementations of parallel bit stream algorithms. |
---|
| 1070 | Beyond this, the IDISA libraries have the additional |
---|
| 1071 | benefit of allowing the implementation |
---|
| 1072 | of inductive doubling algorithms at a higher level |
---|
| 1073 | abstraction, without need for programmer coding of |
---|
| 1074 | the underlying shift and mask operations. |
---|
[227] | 1075 | |
---|
[228] | 1076 | \subsection{IDISA Model} |
---|
| 1077 | Figure \ref{pipeline-model} shows a model architecture |
---|
| 1078 | for a pipelined SIMD processor implementing IDISA. |
---|
| 1079 | The SIMD Register File (SRF) provides a file of $R = 2^A$ |
---|
| 1080 | registers each of width $N = 2^K$ bits. |
---|
| 1081 | IDISA instructions identified by the Instruction Fetch |
---|
| 1082 | Unit (IFU) are forwarded for decoding to the SIMD |
---|
| 1083 | Instruction Decode Unit (SIDU). This unit decodes |
---|
| 1084 | the instruction to produce |
---|
| 1085 | signals identifying the source and destination |
---|
| 1086 | operand registers, the half-operand modifiers, the |
---|
| 1087 | field width specification and the SIMD operation |
---|
| 1088 | to be applied. |
---|
[227] | 1089 | |
---|
[228] | 1090 | The SIDU supplies the source register information and the half-operand |
---|
| 1091 | modifier information to the SIMD Operand Fetch Unit (SOFU). |
---|
| 1092 | For each source operand, the SIDU provides an $A$-bit register |
---|
| 1093 | address and two 1-bit signals $h$ and $l$ indicating the value |
---|
| 1094 | of the decoded half-operand modifiers for this operand. |
---|
| 1095 | Only one of these values may be 1; both are 0 if |
---|
| 1096 | no modifier is specified. |
---|
| 1097 | In addition, the SIDU supplies decoded field width information |
---|
| 1098 | to both the SOFU and to the SIMD Instruction Execute Unit (SIEU). |
---|
| 1099 | The SIDU also supplies decoded SIMD opcode information to SIEU and |
---|
| 1100 | a decoded $A$-bit register address for the destination register to |
---|
| 1101 | the SIMD Result Write Back Unit (SRWBU). |
---|
[227] | 1102 | |
---|
[228] | 1103 | The SOFU is the key component of the IDISA model that |
---|
| 1104 | differs from that found in a traditional SWAR |
---|
| 1105 | processor. For each of the two $A$-bit source |
---|
| 1106 | register addresses, SOFU is first responsible for |
---|
| 1107 | fetching the raw operand values from the SRF. |
---|
| 1108 | Then, before supplying operand values to the |
---|
| 1109 | SIEU, the SOFU applies the half-operand modification |
---|
| 1110 | logic as specified by the $h$, $l$, and field-width |
---|
| 1111 | signals. The possibly modified operand values are then |
---|
| 1112 | provided to the SIEU for carrying out the SIMD operations. |
---|
| 1113 | A detailed model of SOFU logic is described in the following |
---|
| 1114 | subsection. |
---|
| 1115 | |
---|
| 1116 | The SIEU differs from similar execution units in |
---|
| 1117 | current commodity processors primarily by providing |
---|
| 1118 | SIMD operations at each field width |
---|
| 1119 | $n=2^k$ for $0 \leq k \leq K$. This involves |
---|
| 1120 | additional circuitry for field widths not supported |
---|
| 1121 | in existing processors. For inductive doubling |
---|
| 1122 | algorithms in support of parallel bit streams, |
---|
| 1123 | the principal need is for additional circuitry to |
---|
| 1124 | support 2-bit and 4-bit field widths. This circuity |
---|
| 1125 | is generally less complicated than that for larger |
---|
| 1126 | fields. Support for circuitry at these width |
---|
| 1127 | has other applications as well. For example, |
---|
| 1128 | DNA sequences are frequently represented using |
---|
| 1129 | packed sequences of 2-bit codes for the four possible |
---|
| 1130 | nucleotides\cite{}, while the need for accurate financial |
---|
| 1131 | calculation has seen a resurgence of the 4-bit |
---|
| 1132 | packed BCD format for decimal floating point \cite{}. |
---|
| 1133 | |
---|
| 1134 | When execution of the SWAR instruction is |
---|
| 1135 | completed, the result value is then provided |
---|
| 1136 | to the SRWBU to update the value stored in the |
---|
| 1137 | SRF at the address specified by the $A$-bit |
---|
| 1138 | destination operand. |
---|
| 1139 | |
---|
| 1140 | \subsection{Operand Fetch Unit Logic} |
---|
| 1141 | |
---|
| 1142 | |
---|
| 1143 | |
---|
| 1144 | |
---|
[227] | 1145 | \section{Conclusions} |
---|
| 1146 | |
---|
| 1147 | This paper has considered the issue of architectural support for |
---|
| 1148 | SIMD text processing using the method of parallel bit streams and has |
---|
| 1149 | argued that this architectural support can best be provided |
---|
| 1150 | through a SIMD instruction set architecture that implements |
---|
| 1151 | features for direct support of inductive doubling algorithms. |
---|
| 1152 | Four key features of the inductive doubling architecture have |
---|
| 1153 | been identified include support for operations at all |
---|
| 1154 | power-of-2 field widths, half-operand modifiers and |
---|
| 1155 | pack and merge operations. The principle innovation is the |
---|
| 1156 | notion of half-operand modifiers to support efficient |
---|
| 1157 | transition between successive power-of-two field widths. |
---|
| 1158 | |
---|
| 1159 | Several algorithms key to parallel bit stream methods |
---|
| 1160 | have been examined and shown to benefit from dramatic |
---|
| 1161 | reductions in instruction count compared to the best |
---|
| 1162 | known algorithms on existing architecture. In the case |
---|
| 1163 | of transposition algorithms to and from parallel bit stream |
---|
| 1164 | form, the architecture has been shown to make possible |
---|
| 1165 | straightforward inductive doubling algorithms with the |
---|
| 1166 | lowest total number of operations that can be achieved by any |
---|
| 1167 | possible 3-register SIMD architecture. |
---|
| 1168 | |
---|
| 1169 | The inductive doubling architecture also has considerable |
---|
| 1170 | benefits beyond its role in supporting SIMD programming |
---|
| 1171 | with parallel bit streams. Notable among these is |
---|
| 1172 | that the architecture provides a framework for systematic |
---|
| 1173 | support of horizontal SIMD operations. |
---|
| 1174 | |
---|
| 1175 | |
---|
| 1176 | |
---|
| 1177 | %\appendix |
---|
| 1178 | %\section{Appendix Title} |
---|
| 1179 | % |
---|
| 1180 | %This is the text of the appendix, if you need one. |
---|
| 1181 | |
---|
| 1182 | \acks |
---|
| 1183 | |
---|
| 1184 | This research was supported by a Discovery Grant from the |
---|
| 1185 | Natural Sciences and Engineering Research Council of Canada. |
---|
| 1186 | |
---|
| 1187 | %\bibliographystyle{plainnat} |
---|
| 1188 | \bibliographystyle{plain} |
---|
| 1189 | \bibliography{asplos094-cameron} |
---|
| 1190 | %\begin{thebibliography}{} |
---|
| 1191 | |
---|
| 1192 | %\bibitem{smith02} |
---|
| 1193 | %Smith, P. Q. reference text |
---|
| 1194 | |
---|
| 1195 | %\end{thebibliography} |
---|
| 1196 | |
---|
| 1197 | |
---|
| 1198 | \end{document} |
---|