1 | %----------------------------------------------------------------------------- |
---|
2 | % |
---|
3 | % asplos094-cameron.tex |
---|
4 | % Robert D. Cameron and Dan Lin |
---|
5 | % |
---|
6 | % Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos |
---|
7 | % |
---|
8 | %----------------------------------------------------------------------------- |
---|
9 | \input epsf |
---|
10 | |
---|
11 | %\documentclass[preprint,natbib,10pt]{sigplanconf} |
---|
12 | %\documentclass[natbib,10pt]{sigplanconf} |
---|
13 | \documentclass[10pt]{sigplanconf} |
---|
14 | |
---|
15 | \usepackage{amsmath} |
---|
16 | \usepackage{graphicx} |
---|
17 | |
---|
18 | \begin{document} |
---|
19 | |
---|
20 | \conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.} |
---|
21 | \CopyrightYear{2009} |
---|
22 | \copyrightdata{978-1-60558-215-3/09/03} |
---|
23 | |
---|
24 | |
---|
25 | \titlebanner{banner above paper title} % These are ignored unless |
---|
26 | \preprintfooter{short description of paper} % 'preprint' option specified. |
---|
27 | |
---|
28 | \title{Architectural Support for SWAR Text Processing with Parallel Bit |
---|
29 | Streams: The Inductive Doubling Principle} |
---|
30 | %\subtitle{Subtitle Text, if any} |
---|
31 | |
---|
32 | \authorinfo{Robert D. Cameron \and Dan Lin} |
---|
33 | {School of Computing Science, Simon Fraser University} |
---|
34 | {\{cameron, lindanl\}@cs.sfu.ca} |
---|
35 | |
---|
36 | |
---|
37 | \maketitle |
---|
38 | |
---|
39 | \begin{abstract} |
---|
40 | Parallel bit stream algorithms exploit the SWAR (SIMD within a |
---|
41 | register) capabilities of commodity |
---|
42 | processors in high-performance text processing applications such as |
---|
43 | UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression |
---|
44 | matching. Direct architectural support for these algorithms in future SIMD |
---|
45 | instruction sets could further increase performance as well as simplifying the |
---|
46 | programming task. A set of simple SWAR instruction set extensions are proposed |
---|
47 | for this purpose based on the principle of systematic support for inductive |
---|
48 | doubling as an algorithmic technique. These extensions are shown to |
---|
49 | significantly reduce instruction count in core parallel bit stream algorithms, |
---|
50 | often providing a 3X or better improvement. The extensions are also shown to be useful |
---|
51 | for SIMD programming in other application areas, including providing a |
---|
52 | systematic treatment for horizontal operations. An implementation model for |
---|
53 | these extensions |
---|
54 | involves relatively simple circuitry added to the operand fetch components |
---|
55 | in a pipelined processor. |
---|
56 | \end{abstract} |
---|
57 | |
---|
58 | \category{CR-number}{subcategory}{third-level} |
---|
59 | |
---|
60 | \terms |
---|
61 | term1, term2 |
---|
62 | |
---|
63 | \keywords |
---|
64 | keyword1, keyword2 |
---|
65 | |
---|
66 | |
---|
67 | |
---|
68 | \section{Introduction} |
---|
69 | |
---|
70 | In the landscape of parallel computing research, finding ways to |
---|
71 | exploit intrachip (multicore) and intraregister (SWAR) parallelism |
---|
72 | for text processing and other non-numeric applications is particularly |
---|
73 | challenging. Indeed, in documenting this landscape, a widely cited Berkeley |
---|
74 | study \cite{Landscape} identifies the finite-state machine algorithms associated |
---|
75 | with text processing to be the hardest of the thirteen ``dwarves'' |
---|
76 | to parallelize, concluding that nothing seems to help. Indeed, |
---|
77 | the study even speculates that applications in this area may simply be |
---|
78 | ``embarrasingly sequential,'' easy to tackle for traditional |
---|
79 | sequential processing approaches suitable for uniprocessors, |
---|
80 | but perhaps fundamentally unsuited to parallel methods. |
---|
81 | |
---|
82 | One approach that shows some promise, however, is the |
---|
83 | method of parallel bit streams, recently applied to |
---|
84 | UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08} |
---|
85 | and amino acid sequencing\cite{Green}. |
---|
86 | In this method, byte-oriented character data is first transposed to eight |
---|
87 | parallel bit streams, one for each bit position within the character code |
---|
88 | units (bytes). Loading bit stream data into 128-bit SIMD registers, |
---|
89 | then, allows data from 128 consecutive code units to be represented and |
---|
90 | processed at once. Bitwise logic and shift operations, bit scans, |
---|
91 | population counts and other bit-based operations are then used to carry |
---|
92 | out the work. |
---|
93 | |
---|
94 | The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X |
---|
95 | speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors |
---|
96 | employing the SSE or Altivec instruction sets. A full implementation |
---|
97 | for each of these platforms is documented using literate programming techniques |
---|
98 | and available as an open source code base \cite{u8u16}. |
---|
99 | Although this transcoding task represents but a single text processing kernel, |
---|
100 | it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more |
---|
101 | of processing time\cite{NicolaJohn03, Perkins05, Psaila06}. The example is also interesting in that |
---|
102 | it illustrates a text processing task that can largely |
---|
103 | be carried out using SIMD operations even though |
---|
104 | UTF-8 character sequences are variable length. Indeed, one weakness of the |
---|
105 | actual implementation is that it reverts to byte-at-a-time processing |
---|
106 | at block boundaries, employing a block shortening strategy to |
---|
107 | reduce block lengths to as low as 125 bytes from 128. |
---|
108 | With more capable SIMD instruction set architectures providing |
---|
109 | better facilities for multiregister shifts and larger register |
---|
110 | files, a solution employing SIMD techniques for virtually all |
---|
111 | of the main work would maintain better data alignment, avoid |
---|
112 | problems of register pressure and be easier to parallelize across |
---|
113 | multiple cores. It should also naturally scale to 256-bit SIMD technology |
---|
114 | such as the expected AVX technology of Intel. |
---|
115 | |
---|
116 | The report addressing the broader problem of XML parsing is perhaps more |
---|
117 | interesting, demonstrating the utility of parallel bit stream techniques |
---|
118 | in delivering performance benefits through a significant portion of the |
---|
119 | web technology stack. In an XML statistics gathering application, |
---|
120 | including the implementation of XML well-formedness checking, an |
---|
121 | overall 3X to 10X performance improvement is achieved in using |
---|
122 | the parallel bit stream methods in comparison with a similarly |
---|
123 | coded application using such well known parsers as Expat and Xerces. |
---|
124 | Although still a work in progress, the parser has functional |
---|
125 | coverage of XML and related specifications comparable to, but somewhat |
---|
126 | beyond Expat. The study also identifies promising further |
---|
127 | work in extending the parallel bit stream methods to parallel |
---|
128 | hash value computation and parallel regular expression matching |
---|
129 | for the purpose of validating XML datatype declarations in |
---|
130 | accord with XML Schema. |
---|
131 | |
---|
132 | Given these promising initial results in the application of |
---|
133 | parallel bit stream methods, what role might architectural support play in |
---|
134 | further enhancing this route to parallelization of text processing? |
---|
135 | This paper addresses this question through presentation and |
---|
136 | analysis of a constructive proposal: a set of SIMD instruction set |
---|
137 | features based on the principle of systematic support for |
---|
138 | inductive doubling algorithms. Inductive doubling refers |
---|
139 | to a general property of certain kinds of algorithm that |
---|
140 | systematically double the values of field widths or other |
---|
141 | data attributes with each iteration. In essence, the goal |
---|
142 | of the proposed features is to support such algorithms |
---|
143 | with specific facilities to transition between successive power-of-2 field |
---|
144 | widths. These transitions are quite frequent in several critical |
---|
145 | algorithms for parallel bit streams. These transitions also |
---|
146 | occur in other applications as well. In related work, |
---|
147 | efficient automatic interleaving based on power-of-2 strides has been |
---|
148 | reported quite useful for a number of SIMD kernels \cite{Nuzman}. |
---|
149 | The specific features presented herein will be referred to |
---|
150 | as IDISA: inductive doubling instruction set architecture. |
---|
151 | |
---|
152 | The remainder of this paper is organized as follows. |
---|
153 | |
---|
154 | The second section of this paper introduces IDISA and the |
---|
155 | SIMD notation used throughout this paper. A brief comparison of |
---|
156 | IDISA features with existing SIMD instruction |
---|
157 | sets of commodity processors such as the Intel SSE |
---|
158 | instruction set and the Power PC Altivec instruction set |
---|
159 | is also made. |
---|
160 | |
---|
161 | The third section then discusses the evaluation methodology |
---|
162 | for IDISA. Two reference architectures are briefly described as |
---|
163 | well as criteria for evaluation IDISA against these architectures. |
---|
164 | |
---|
165 | The fourth section provides a short first example of |
---|
166 | the inductive doubling principle in action through |
---|
167 | the case of population count. Although this operation |
---|
168 | is not a strong determinant of performance for parallel bit |
---|
169 | stream applications, it is nevertheless an operation needed |
---|
170 | frequently enough in the general computing milieux to find |
---|
171 | its way into some instruction set architectures, typically |
---|
172 | at one particular field width. |
---|
173 | %By way of comparison, the |
---|
174 | %inductive doubling architecture sacrifices some |
---|
175 | %performance at the chosen field width, while offering a more |
---|
176 | %general solution with frequently better performance at |
---|
177 | %other field widths. |
---|
178 | |
---|
179 | The fifth section then moves on to consider the performance-critical |
---|
180 | and key task of conversion between serial byte streams and parallel |
---|
181 | bit streams. A first algorithm that uses the existing SIMD |
---|
182 | operations common to SSE and Altivec is shown, requiring 72 |
---|
183 | operations to transform 128 bytes of data using the three-register |
---|
184 | instruction form. We then move on to consider how the task may |
---|
185 | be simplified using IDISA to |
---|
186 | require a mere 24 operations. As well as providing a 3X speed-up, |
---|
187 | it is also argued that the version using the inductive doubling |
---|
188 | architecture is considerably simpler and easier to program. |
---|
189 | |
---|
190 | The sixth section then briefly considers the inverse transposition |
---|
191 | process, converting parallel bit stream data back into byte |
---|
192 | streams. Again, an algorithm to carry out this task requires |
---|
193 | 72 straight-line SIMD operations in the Altivec three-register |
---|
194 | form, but is reduced to a simpler 24 operations using IDISA. |
---|
195 | |
---|
196 | The seventh section then goes on to consider the problem of |
---|
197 | parallel bit deletion. This operation is performance-critical |
---|
198 | to any applications that require filtering or |
---|
199 | editing operations on strings using the parallel bit stream |
---|
200 | algorithms. For example, it is fundamental to the |
---|
201 | high-speed UTF-8 to UTF-16 transcoding algorithm that is |
---|
202 | often a critical component in XML parsing. In this |
---|
203 | section, an inductive doubling algorithm based on the |
---|
204 | concept of a central deletion result is described and |
---|
205 | shown to have much better performance than a parallel-prefix |
---|
206 | alternative. |
---|
207 | |
---|
208 | The eighth section then considers the potential role of |
---|
209 | IDISA in supporting applications beyond parallel bit streams. |
---|
210 | Additional examples from several domains are presented. |
---|
211 | Perhaps most importantly, however, the role of IDISA |
---|
212 | in supporting a fully systematic treatment of {\em horizontal} |
---|
213 | SWAR operations for any application area is discussed. |
---|
214 | |
---|
215 | An implementation model for IDISA is then considered |
---|
216 | in section 9 of the paper, focusing on a pipelined |
---|
217 | SIMD architecture featuring a modified operand fetch stage. |
---|
218 | A gate-count analysis of one feasible implementation is |
---|
219 | provided as well as a discussion of the implementation |
---|
220 | of required extensions to handle 2-bit and 4-bit fields not |
---|
221 | commonly supported on existing SIMD architectures. Design |
---|
222 | tradeoffs are also considered focusing the potential removal of |
---|
223 | various {\em ad hoc} instructions on existing processors in favor of |
---|
224 | more general alternatives provided through IDISA. |
---|
225 | |
---|
226 | The tenth section concludes the paper with a summary of results |
---|
227 | and discussion of areas for future work. |
---|
228 | |
---|
229 | |
---|
230 | \section{Inductive Doubling Architecture} |
---|
231 | |
---|
232 | This section presents an idealized model for a single-instruction |
---|
233 | multiple-data (SIMD) instruction set architecture designed |
---|
234 | specifically to support inductive doubling algorithms in the |
---|
235 | domain of parallel bit stream programming. The architecture is idealized |
---|
236 | in the sense that we concentrate on only the necessary features |
---|
237 | for our purpose, without enumerating the additional |
---|
238 | operations that would be required for |
---|
239 | SIMD applications in other domains. The goal is to focus |
---|
240 | on the principles of inductive doubling support in a way |
---|
241 | that can accommodate a variety of realizations as other |
---|
242 | design constraints are brought to bear on the overall instruction set |
---|
243 | design. First we introduce a simple model and notation for |
---|
244 | SIMD operations in general and then present the four key |
---|
245 | features of an idealized architecture in support of parallel |
---|
246 | bit stream programming. |
---|
247 | |
---|
248 | The idealized architecture supports typical SIMD integer |
---|
249 | operations common to existing commodity architectures such as SSE |
---|
250 | and Altivec. The architecture is defined to support SIMD |
---|
251 | operations on registers of size $N=2^K$ bits, for some integer $K$. |
---|
252 | Typical values of $K$ for commodity processors include $K=6$ for |
---|
253 | the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for |
---|
254 | the 128-bit registers of SSE and Altivec technology and $K=8$ for |
---|
255 | the upcoming Intel AVX technology. The registers may be |
---|
256 | partitioned into $N/n$ fields of width $n=2^k$ bits for some values |
---|
257 | of $k \leq K$. Typical values of $k$ used on commodity processors |
---|
258 | include $k = 3$ for SIMD operations on 8-bit fields (bytes), |
---|
259 | $k = 4$ for operations on 16-bit fields and $k = 5$ for operations |
---|
260 | on 32-bit fields. Whenever a register $r$ is partitioned into $n$-bit |
---|
261 | fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$. |
---|
262 | Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of |
---|
263 | register $r$, using big-endian numbering. |
---|
264 | |
---|
265 | Let \verb:simd<n>: represent the class of SIMD operations defined |
---|
266 | on fields of size $n$ using C++ template syntax. Given a |
---|
267 | binary function $F_n$ on $n$-bit fields, we denote the SIMD |
---|
268 | version of this operation as \verb#simd<n>::F#. Given two |
---|
269 | SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$, |
---|
270 | respectively, the operation \verb#r=simd<n>::F(a,b)# stores |
---|
271 | the value $r$ in the register \verb:r: as determined by |
---|
272 | the simultaneous calculation of individual field values in |
---|
273 | accord with the following equation. |
---|
274 | \[ r_i = F_n(a_i, b_i) \] |
---|
275 | |
---|
276 | For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left |
---|
277 | logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned |
---|
278 | integers as follows. |
---|
279 | %\singlespace |
---|
280 | \begin{eqnarray} |
---|
281 | \mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\ |
---|
282 | \mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\ |
---|
283 | \mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n |
---|
284 | \end{eqnarray} |
---|
285 | %\doublespace |
---|
286 | The SSE and Altivec instruction sets support |
---|
287 | each of the addition and subtraction operations in SIMD form |
---|
288 | for 8, 16 and 32-bit fields, while the SSE set also includes |
---|
289 | the 64-bit forms. For example, \verb#simd<8>::add# in our |
---|
290 | nomenclature is provided by the operation \verb:paddb: |
---|
291 | on SSE and the operation \verb:vaddubm: on Altivec. |
---|
292 | The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and |
---|
293 | \verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are |
---|
294 | more constrained, requiring that all field shifts by the same amount. |
---|
295 | |
---|
296 | Given these definitions and notation, we now present |
---|
297 | the four key elements of an inductive doubling architecture. |
---|
298 | The first is a definition of a core set of binary functions |
---|
299 | on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$. |
---|
300 | The second is a set of {\em half-operand modifiers} that allow |
---|
301 | the inductive processing of fields of size $2n$ in terms of |
---|
302 | combinations of $n$-bit values selected from the fields. |
---|
303 | The third is the definition of packing operations that compress |
---|
304 | two consecutive registers of $n$-bit values into a single |
---|
305 | register of $n/2$-bit values. The fourth is the definition |
---|
306 | of merging operations that produce a set of $2n$ bit fields |
---|
307 | by concatenating corresponding $n$-bit fields from two |
---|
308 | parallel registers. Each of these features is described below. |
---|
309 | |
---|
310 | For the purpose of direct and efficient support for inductive |
---|
311 | doubling algorithms on bit streams, the provision of |
---|
312 | a core set of operations at field widths of 2 and 4 as |
---|
313 | well as the more traditional field witdhs of 8, 16 and 32 |
---|
314 | is critical for elegant and efficient expression of the |
---|
315 | algorithms. In essence, inductive doubling algorithms |
---|
316 | work by establishing some base property at either single |
---|
317 | or 2-bit fields. Each iteration of the algorithm then |
---|
318 | goes on to establish the property for the power-of-2 |
---|
319 | field width. In order for this inductive step to be |
---|
320 | most conveniently and efficiently expressed, the |
---|
321 | core operations needed for the step should be available |
---|
322 | at each field width. In the case of work with parallel |
---|
323 | bit streams, the operations \verb:add:, \verb:sub:, |
---|
324 | \verb:sll:, \verb:srl: (shift right logical), and \verb:rotl: |
---|
325 | (rotate left) comprise the core. In other domains, |
---|
326 | additional operations may be usefully included in the |
---|
327 | core depending on the work that needs to be performed |
---|
328 | at each inductive doubling level. |
---|
329 | |
---|
330 | Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$ |
---|
331 | also includes fields of width 1. These are included for |
---|
332 | logical consistency, but are easily implemented by mapping |
---|
333 | directly to appropriate bitwise logic operations, which |
---|
334 | we assume are also available. For example, |
---|
335 | \verb#simd<1>::add# is equivalent to \verb:simd_xor:, the |
---|
336 | bitwise exclusive-or operation on SIMD registers. |
---|
337 | |
---|
338 | The second key facility of the inductive doubling architecture |
---|
339 | is the potential application of half-operand modifiers to |
---|
340 | the fields of either or both of the operands of a SIMD |
---|
341 | operation. These modifiers select either the |
---|
342 | low $n/2$ |
---|
343 | bits of each $n$-bit field (modifier ``\verb:l:'') or the |
---|
344 | high $n/2$ bits (modifier ``\verb:h:''). When required, |
---|
345 | the modifier ``\verb:x:'' means that the full $n$ bits |
---|
346 | should be used, unmodified. The semantics of these |
---|
347 | modifiers are given by the following equations. |
---|
348 | %\singlespace |
---|
349 | \begin{eqnarray} |
---|
350 | l(r_n) & = & r_n \bmod 2^{n/2} \\ |
---|
351 | h(r_n) & = & r_n / 2^{n/2} \\ |
---|
352 | x(r_n) & = & r_n |
---|
353 | \end{eqnarray} |
---|
354 | %\doublespace |
---|
355 | In our notation, the half-operand modifiers are |
---|
356 | specified as optional template (compile-time) parameters |
---|
357 | for each of the binary functions. Thus, |
---|
358 | \verb#simd<4>::add<h,l>(a,b)# is an operation which adds |
---|
359 | the 2-bit quantity found in the high 2-bits of each 4-bit field |
---|
360 | of its first operand (\verb:a:) |
---|
361 | together with the corresponding 2-bit quantity found in the |
---|
362 | low 2-bits of its second operand (\verb:b:). |
---|
363 | In general, the purpose of the half-operand modifiers |
---|
364 | in support of inductive doubling is to allow the processing |
---|
365 | of $n$-bit fields to easily expressed in terms of |
---|
366 | combination of the results determined by processing |
---|
367 | $n/2$ bit fields. |
---|
368 | |
---|
369 | The third facility of the inductive doubling architecture |
---|
370 | is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$. |
---|
371 | The field values of \verb#r=simd<n>::pack(a,b)# are |
---|
372 | defined by the following equations. |
---|
373 | %\singlespace |
---|
374 | \begin{eqnarray} |
---|
375 | r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\ |
---|
376 | r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n |
---|
377 | \end{eqnarray} |
---|
378 | %\doublespace |
---|
379 | Here conv is a function which performs conversion of an $n$-bit |
---|
380 | value to an $n/2$ bit value by signed saturation (although |
---|
381 | conversion by unsigned saturation would also suit our purpose). |
---|
382 | |
---|
383 | Half-operand modifiers may also be used with the pack |
---|
384 | operations. Thus packing with conversion by masking off all |
---|
385 | but the low $n/2$ bits of each field may be |
---|
386 | be performed using the operation \verb#simd<n>::pack<l,l># |
---|
387 | |
---|
388 | |
---|
389 | The final facility of the inductive doubling architecture is |
---|
390 | a set of merging operations \verb#simd<n>::mergeh# and |
---|
391 | \verb#simd<n>::mergel# that produce $2n$-bit fields |
---|
392 | by concatenating corresponding $n$-bit fields from the |
---|
393 | operand registers. The respective |
---|
394 | operations \verb#r=simd<n>::mergeh(a,b)# and |
---|
395 | \verb#s=simd<n>::mergel(a,b)# |
---|
396 | are defined by the following equations. |
---|
397 | %\singlespace |
---|
398 | \begin{eqnarray} |
---|
399 | r_{2n}[i] & = & a[i] \times 2^n + b[i] \\ |
---|
400 | s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)] |
---|
401 | \end{eqnarray} |
---|
402 | %\doublespace |
---|
403 | Both SSE and Altivec provide versions of pack and merge operations |
---|
404 | for certain field widths. The pack operations are provided |
---|
405 | with operands having 16-bit or 32-bit fields on each platform, although |
---|
406 | with some variation in how conversion is carried out. |
---|
407 | The merge operations are provided at 8-bit, 16-bit and 32-bit |
---|
408 | field widths on both architectures and also at the 64-bit level |
---|
409 | on SSE. |
---|
410 | |
---|
411 | This completes the description of IDISA. As described, many of the |
---|
412 | features are already available with the SIMD facilities of |
---|
413 | existing commodity processors. The extensions enumerated |
---|
414 | here are relatively straightforward. The innovation |
---|
415 | is to specifically tackle the design of facilities to |
---|
416 | offer systematic support for transitions between power-of-2 field widths. |
---|
417 | As we shall show in the remainder of this paper, these facilities |
---|
418 | can dramatically reduce instruction count in core parallel bit |
---|
419 | stream algorithms, with a factor of 3 reduction being typical. |
---|
420 | |
---|
421 | \section{Evaluation Methodology} |
---|
422 | |
---|
423 | IDISA represents a set of instruction set features that |
---|
424 | could potentially be added to any SWAR processor. The goal |
---|
425 | in this paper is to evaluate these features independent |
---|
426 | of artifacts that may be due to any particular realization, |
---|
427 | while still considering realistic models based on existing |
---|
428 | commodity instruction set architectures. For the purpose |
---|
429 | of IDISA evaluation, then, we define two reference architectures. |
---|
430 | For concreteness, IDISA and the two reference architectures will |
---|
431 | each be considered as 128-bit processors employing the |
---|
432 | three-register SWAR model defined in the previous |
---|
433 | section. |
---|
434 | |
---|
435 | Reference architecture A (RefA) consists of a limited register |
---|
436 | processor providing a set of core binary operations defined |
---|
437 | for 8, 16, 32 and 64 bit fields. The core binary operations |
---|
438 | will be assumed to be those defined by the SSE instruction |
---|
439 | set for 16-bit fields. In addition, we assume that |
---|
440 | shift immediate operations for each field width exist, |
---|
441 | e.g., \verb#simd<8>::srli<1>(x)# for a right logical |
---|
442 | shift of each 8-bit field by 1. We also assume that |
---|
443 | a constant load operation \verb#simd::const<n>(c)# |
---|
444 | loads the constant value $c$ into each $n$ bit field. |
---|
445 | |
---|
446 | Reference architecture B (RefB) consists of a register-rich |
---|
447 | processor incorporating all the operations of reference |
---|
448 | architecture A as well as the following additional facilities |
---|
449 | inspired by the Altivec instruction set. |
---|
450 | For each of the 8, 16, 32 and 64 bit widths, a binary rotate left |
---|
451 | logical instruction \verb#simd<n>::rotl(a,b)# rotates each field |
---|
452 | of $a$ by the rotation count in the corresponding field of $b$. |
---|
453 | A three-register \verb#simd<1>::if(a,b,c)# bitwise logical |
---|
454 | operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned |
---|
455 | after the Altivec \verb:vec_sel: operation. Finally, |
---|
456 | a \verb#simd<8>::permute(a,b,c)# selects an arbitrary |
---|
457 | permutation of bytes from the concatenation of $a$ and $b$ based |
---|
458 | on the set of indices in $c$. |
---|
459 | |
---|
460 | Two versions of IDISA are assessed against these reference |
---|
461 | architectures as follows. IDISA-A has all the facilities |
---|
462 | of RefA extended with half-operand modifiers and all core |
---|
463 | operations at field widths of 2, 4 and 128. IDISA-B is |
---|
464 | similarly defined and extended based on RefB. Algorithms |
---|
465 | for both RefA and IDISA-A are assessed assuming that |
---|
466 | any required constants must be loaded as needed; this |
---|
467 | reflects the limited register assumption. On the other, |
---|
468 | assessment for both RefB and IDISA-B will make the |
---|
469 | assumption that sufficiently many registers exist that |
---|
470 | constants can be kept preloaded. |
---|
471 | |
---|
472 | In each case, the processors are assumed to be |
---|
473 | pipelined processors with a throughput of one SWAR instruction |
---|
474 | each processor cycle for straight-line code free of memory |
---|
475 | access. Such processors are certainly feasible with a |
---|
476 | suitable investment in circuitry, although practical designs |
---|
477 | may make sacrifices to achieve close to optimal throughput |
---|
478 | with less circuitry. However, the assumption makes for |
---|
479 | straightforward performance evaluation based on instruction |
---|
480 | count for straight-line computational kernels. Furthermore, |
---|
481 | the assumption also eliminates artifacts due to possibly different |
---|
482 | latencies in reference and IDISA architectures. Because |
---|
483 | the same assumption is made for reference and IDISA |
---|
484 | architectures, determination of the additional circuit |
---|
485 | complexity due to IDISA features is unaffected by the |
---|
486 | assumption. |
---|
487 | |
---|
488 | In the remainder of this paper, then, IDISA-A and IDISA-B |
---|
489 | models are evaluated against their respective reference |
---|
490 | architectures on straight-line computational kernels |
---|
491 | used in parallel bit stream processing and other applications. |
---|
492 | As XML and other sequential text processing applications |
---|
493 | tend to use memory in an efficient streaming model, the |
---|
494 | applications tend to be compute-bound rather than IO-bound. |
---|
495 | Thus, the focus on computational kernels addresses the |
---|
496 | primary concern for performance improvement of these applications. |
---|
497 | |
---|
498 | The additional circuity complexity to realize IDISA-A and |
---|
499 | IDISA-B designs over their reference models will be |
---|
500 | addressed in the penultimate section. That discussion |
---|
501 | will focus primarily on the complexity of implementing |
---|
502 | half-operand modifier logic, but will also address |
---|
503 | the extension of the core operations to operate on |
---|
504 | 2-bit, 4-bit and 128-bit fields, as well. |
---|
505 | |
---|
506 | \section{Population Count} |
---|
507 | |
---|
508 | \begin{figure}[h] |
---|
509 | \begin{center}\small |
---|
510 | \begin{verbatim} |
---|
511 | c = (x & 0x55555555) + ((x >> 1) & 0x55555555); |
---|
512 | c = (c & 0x33333333) + ((c >> 2) & 0x33333333); |
---|
513 | c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F); |
---|
514 | c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF); |
---|
515 | c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF); |
---|
516 | \end{verbatim} |
---|
517 | \end{center} |
---|
518 | \caption{Population Count Reference Algorithm} |
---|
519 | \label{HD-pop} |
---|
520 | \end{figure} |
---|
521 | |
---|
522 | \begin{figure} |
---|
523 | \begin{center}\small |
---|
524 | \begin{verbatim} |
---|
525 | c = simd<2>::add<h,l>(x, x); |
---|
526 | c = simd<4>::add<h,l>(c, c); |
---|
527 | c = simd<8>::add<h,l>(c, c); |
---|
528 | c = simd<16>::add<h,l>(c, c); |
---|
529 | c = simd<32>::add<h,l>(c, c); |
---|
530 | \end{verbatim} |
---|
531 | \end{center} |
---|
532 | \caption{IDISA Population Count} |
---|
533 | \label{inductivepopcount} |
---|
534 | \end{figure} |
---|
535 | |
---|
536 | As an initial example to illustrate the principle of inductive doubling |
---|
537 | in practice, consider the problem of {\em population count}: determining |
---|
538 | the number of one bits within a particular bit field. It is important |
---|
539 | enough for such operations as calculating Hamming distance to be included |
---|
540 | as a built-in instruction |
---|
541 | on some processors. For example, the SPU of the Cell Broadband Engine |
---|
542 | has a SIMD population count instruction \verb:si_cntb: for simultaneously |
---|
543 | determining the |
---|
544 | number of 1 bits within each byte of a 16-byte register. |
---|
545 | In text processing with parallel bit streams, population count has direct |
---|
546 | application to keeping track of line numbers for error reporting, for example. |
---|
547 | Given a bit block identifying the positions of newline characters within |
---|
548 | a block of characters being processed, the population count of the |
---|
549 | bit block can be used to efficiently and conveniently be used to update |
---|
550 | the line number upon completion of block processing. |
---|
551 | |
---|
552 | Figure \ref{HD-pop} presents a traditional divide-and-conquer |
---|
553 | implementation for a 32-bit integer {\tt x} adapted from |
---|
554 | Warren \cite{HackersDelight}, while |
---|
555 | Figure \ref{inductivepopcount} shows the corresponding IDISA |
---|
556 | implementation for a vector of 32-bit fields. Each implementation employs |
---|
557 | five steps of inductive doubling to produce population counts |
---|
558 | within 32 bit fields. The traditional implementation employs |
---|
559 | explicit masking and shifting operations, while these |
---|
560 | operations are implicit within the semantics of the inductive |
---|
561 | doubling instructions shown in Figure \ref{inductivepopcount}. |
---|
562 | In each implementation, the first step determines the |
---|
563 | the population counts within 2-bit fields |
---|
564 | by adding the high bit of each such field to the low bit |
---|
565 | to produce a set of 2-bit counts in {\tt c}. |
---|
566 | In the second step, the counts within 4-bit fields of {\tt c} are determined |
---|
567 | by adding the counts of the corresponding high and low 2-bit subfields. |
---|
568 | Continuing in this fashion, |
---|
569 | the final population counts within 32-bit fields are determined in five steps. |
---|
570 | |
---|
571 | With the substitution of longer mask constants replicated for four |
---|
572 | 32-bit fields, the implementation of Figure \ref{HD-pop} can be |
---|
573 | directly adapted to SWAR processing using 128-bit registers. |
---|
574 | Each binary operator is replaced by a corresponding binary |
---|
575 | SWAR operation. Without the IDISA features, a |
---|
576 | straightforward RefA implementation of population count for |
---|
577 | 32-bit fields thus employs 10 operations to load or generate |
---|
578 | mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a |
---|
579 | total of 30 operations to complete the task. Employing |
---|
580 | optimization identified by Warren, this can be reduced to |
---|
581 | 20 operations, 5 of which are required to generate mask constants. |
---|
582 | At the cost of register pressure, it is possible that these constants |
---|
583 | could be kept preloaded in long vector processing. In accord |
---|
584 | with our evaluation model, the RefB cost is thus 15 operations. |
---|
585 | As the IDISA implementation requires no constants at all, |
---|
586 | both the IDISA-A and IDISA-B cost is 5 operations. |
---|
587 | At our assumed one CPU cycle per instruction model, IDISA-A |
---|
588 | offers a 4X improvement over RefA, while IDISA-B offers a 3X |
---|
589 | improvement over its comparator. |
---|
590 | |
---|
591 | The pattern illustrated by population count is typical. |
---|
592 | An inductive doubling algorithm of $n$ steps typically applies |
---|
593 | mask or shift operations at each step for each of the |
---|
594 | two operands being combined in the step. If there is |
---|
595 | one such operation for each operand, and the combination |
---|
596 | can be implemented in a single SWAR operation, then a total |
---|
597 | of $3n$ operations are the required in a RefB implementation, |
---|
598 | and possibly $4n$ for a RefA implementation including the |
---|
599 | cost of loading masks. IDISA-A and IDISA-B implementations |
---|
600 | typically eliminate the explicit mask and shift operations |
---|
601 | through appropriate half-operand modifiers, reducing the |
---|
602 | total instruction count to $n$. Thus a 3X to 4X improvement |
---|
603 | obtains in these cases. |
---|
604 | |
---|
605 | \section{Transposition to Parallel Bit Streams} |
---|
606 | |
---|
607 | In this section, we consider the first major |
---|
608 | application of IDISA: transposition of byte stream data to parallel bit stream |
---|
609 | form. Of course, this operation is critical to the |
---|
610 | method of parallel bit streams and all applications |
---|
611 | of the method can benefit from a highly efficient |
---|
612 | transposition process. Before considering how |
---|
613 | the IDISA supports this |
---|
614 | transposition process, however, we first consider |
---|
615 | algorithms on existing architectures. Two algorithms |
---|
616 | are presented; the best of these requires 72 |
---|
617 | SWAR operations under the RefB model to perform |
---|
618 | transposition of eight serial registers of byte stream data into |
---|
619 | eight parallel registers of bit stream data. |
---|
620 | |
---|
621 | We then go on to show how the transposition problem |
---|
622 | can be solved using IDISA-A or IDISA-B |
---|
623 | with a mere 24 three-register SIMD operations. We also show |
---|
624 | that this is optimal for any three-register instruction set model. |
---|
625 | |
---|
626 | \begin{figure}[tbh] |
---|
627 | \begin{center} |
---|
628 | \includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf} |
---|
629 | \caption{Input/Output Model for Serial to Parallel Transposition} |
---|
630 | \label{s2p-spec} |
---|
631 | \end{center} |
---|
632 | |
---|
633 | \end{figure} |
---|
634 | Figure \ref{s2p-spec} illustrates the input-output requirements of |
---|
635 | the transposition problem. We assume that inputs and |
---|
636 | outputs are each SWAR registers of size $N=2^K$ bits. |
---|
637 | The input consists of $N$ bytes of serial byte data, |
---|
638 | stored consecutively in eight SIMD registers each holding |
---|
639 | $N/8$ bytes. The output consists of eight parallel |
---|
640 | registers, one each for the eight individual bit positions |
---|
641 | within a byte. Upon completion of the transposition process, |
---|
642 | each output register is to hold the $N$ bits corresponding |
---|
643 | to the selected bit position in the sequence of $N$ input |
---|
644 | bytes. |
---|
645 | |
---|
646 | \subsection{Bit Gathering Algorithm} |
---|
647 | |
---|
648 | \begin{figure}[tbh] |
---|
649 | \begin{center} |
---|
650 | \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf} |
---|
651 | \caption{Serial to Parallel Transposition Using Bit-Gathering} |
---|
652 | \label{gather} |
---|
653 | \end{center} |
---|
654 | \end{figure} |
---|
655 | One straightforward algorithm for implementing the transposition process |
---|
656 | takes advantage of SWAR bit gathering operations that exist |
---|
657 | on some architectures. This operation gathers one bit per byte |
---|
658 | from a particular position within each byte of a SIMD register. |
---|
659 | For example, the {\tt pmovmskb} operation of the Intel MMX and |
---|
660 | SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask |
---|
661 | consisting of the high bit of each byte. Similarly, the |
---|
662 | {\tt \verb:si_gbb:} operation of the synergistic processing units of the |
---|
663 | Cell Broadband Engine gathers together the low bit of each byte |
---|
664 | from the SIMD register. Figure \ref{gather} illustrates the |
---|
665 | bit gathering process. |
---|
666 | |
---|
667 | For each bit stream of output, the bit gather algorithm requires |
---|
668 | one gather operation for each of the 8 input registers, |
---|
669 | giving 64 bit gather operations in all. In addition, for seven |
---|
670 | of the eight bit positions, it is necessary to shift the bits |
---|
671 | of each input register into the conventional position for |
---|
672 | gathering. A total of 56 shift operations are required for this |
---|
673 | task. Finally, the result of each bit gather operation must |
---|
674 | be properly inserted into the output stream. If the first |
---|
675 | gather operation for a stream can be used to also initialize |
---|
676 | the output register, there will then need to be 7 insert |
---|
677 | operations for the results of the remaining gather operations |
---|
678 | for this stream, making 56 insert operations in all. |
---|
679 | The insert step may be more complex than a single operation |
---|
680 | in some cases, but we use one operation per insert as a lower bound. |
---|
681 | Thus, the bit gather algorithm requires |
---|
682 | at least 176 operations to perform the transposition task. |
---|
683 | |
---|
684 | \subsection{BytePack Algorithm} |
---|
685 | |
---|
686 | A much more efficient transposition algorithm on commodity |
---|
687 | SWAR architectures involves three |
---|
688 | stages of binary division transformation. This is similar |
---|
689 | to the three stage bit matrix inversion described by |
---|
690 | Warren \cite{HackersDelight}, although modified to use SWAR operations. |
---|
691 | In each stage, input streams are divided into two half-length output streams. |
---|
692 | The first stage separates the bits at even numbered positions from those |
---|
693 | at odd number positions. The two output streams from the first |
---|
694 | stage are then further divided in the second stage. |
---|
695 | The stream comprising even numbered bits from the original byte stream |
---|
696 | divides into one stream consisting of bits from positions 0 and 4 of each |
---|
697 | byte in the original stream and a second stream consisting of bits |
---|
698 | from positions 2 and 6 of each original byte. The stream of bits from |
---|
699 | odd positions is similarly divided into streams for bits from each of the |
---|
700 | positions 1 and 5 and bits from positions 2 and 6. |
---|
701 | Finally, each of the four streams resulting from the second stage are |
---|
702 | divided into the desired individual bit streams in the third stage. |
---|
703 | |
---|
704 | % \begin{figure}[tbh] |
---|
705 | % \begin{center}\small |
---|
706 | % \begin{verbatim} |
---|
707 | % s0h = simd<16>::srli<8>(s0); |
---|
708 | % s0l = simd_and(s0, simd<16>::const(0x00FF)); |
---|
709 | % s1h = simd<16>::srli<8>(s1); |
---|
710 | % s1l = simd_and(s1, simd<16>::const(0x00FF)); |
---|
711 | % t0 = simd<16>::pack(s0h, s1h); |
---|
712 | % t1 = simd<16>::pack(s0l, s1l); |
---|
713 | % t0_l1 = simd<16>::slli<1>(t0); |
---|
714 | % t0_r1 = simd<16>::srli<1>(t1); |
---|
715 | % mask = simd<8>::const(0xAA); |
---|
716 | % p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask)); |
---|
717 | % p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask)); |
---|
718 | % \end{verbatim} |
---|
719 | % \end{center} |
---|
720 | % \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm} |
---|
721 | % \label{s2pstep} |
---|
722 | % \end{figure} |
---|
723 | % |
---|
724 | |
---|
725 | \begin{figure}[tbh] |
---|
726 | \begin{center}\small |
---|
727 | \begin{verbatim} |
---|
728 | even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}; |
---|
729 | odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}; |
---|
730 | mask = simd<8>::const(0xAA); |
---|
731 | t0 = simd<8>::permute(s0, s1, even); |
---|
732 | t1 = simd<8>::permute(s0, s1, odd); |
---|
733 | p0 = simd_if(mask, t0, simd<16>::srli<1>(t1)); |
---|
734 | p1 = simd_if(mask, simd<16>::slli<1>(t0), t1); |
---|
735 | \end{verbatim} |
---|
736 | \end{center} |
---|
737 | \caption{RefB Transposition Step in BytePack Stage 1} |
---|
738 | \label{s2pstep} |
---|
739 | \end{figure} |
---|
740 | |
---|
741 | The binary division transformations are accomplished in each stage |
---|
742 | using byte packing, shifting and masking. In each stage, a |
---|
743 | transposition step combines each pair of serial input registers to |
---|
744 | produce a pair of parallel output registers. |
---|
745 | Figure \ref{s2pstep} shows a stage 1 transposition step in a |
---|
746 | Ref B implementation. Using the permute facility, the even |
---|
747 | and odd bytes, respectively, from two serial input registers |
---|
748 | \verb:s0: and \verb:s1: are packed into temporary registers |
---|
749 | \verb:t0: and \verb:t1:. The even and odd bits are then |
---|
750 | separated into two parallel output registers \verb:p0: and \verb:p1: |
---|
751 | by selecting alternating bits using a mask. This step is applied |
---|
752 | four times in stage 1; stages 2 and 3 also consist of four applications |
---|
753 | of a similar step with different shift and mask constants. |
---|
754 | Overall, 6 operations per step are required, yielding a total |
---|
755 | of 72 operations to transpose 128 bytes to parallel bit stream |
---|
756 | form in the RefB implementation. |
---|
757 | |
---|
758 | In a RefA implementation, byte packing may also be achieved |
---|
759 | by the \verb#simd<16>::pack# with 4 additional operations to |
---|
760 | prepare operands. Essentially, the RefB implementation |
---|
761 | uses single permuite instructions to implement the equivalent of |
---|
762 | \verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#. |
---|
763 | The RefA implementation also requires 3 logic operations to implement |
---|
764 | each \verb#simd_if#. |
---|
765 | Assuming that mask loads are only need once per 128 bytes, |
---|
766 | a total of 148 operations are required in the RefB implementation. |
---|
767 | |
---|
768 | \subsection{Inductive Halving Algorithm} |
---|
769 | |
---|
770 | Using IDISA, it is possible to design |
---|
771 | a transposition algorithm that is both easier to understand and requires |
---|
772 | many fewer operations than the the bytepack algorithm described above. |
---|
773 | We call it the inductive halving algorithm for serial to parallel |
---|
774 | transposition, because it proceeds by reducing byte streams to |
---|
775 | two sets of nybble streams in a first stage, dividing the nybble |
---|
776 | streams into streams of bitpairs in a second stage and finally |
---|
777 | dividing the bitpair streams into bit streams in the third stage. |
---|
778 | |
---|
779 | \begin{figure}[tbh] |
---|
780 | \small |
---|
781 | \begin{verbatim} |
---|
782 | p0 = simd<8>::pack<h,h>(s0, s1); |
---|
783 | p1 = simd<8>::pack<l,l>(s0, s1); |
---|
784 | \end{verbatim} |
---|
785 | \caption{Stage 1 Transposition Step in the Inductive Halving Algorithm} |
---|
786 | \label{halvingstep} |
---|
787 | \end{figure} |
---|
788 | |
---|
789 | Figure \ref{halvingstep} shows one step in stage 1 of the inductive |
---|
790 | halving algorithm, comprising just two IDISA-A operations. |
---|
791 | The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte |
---|
792 | from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts |
---|
793 | the low nybble of each byte. As in the bytepack algorithm, this step is |
---|
794 | applied 4 times in stage 1, for a total of 8 operations. |
---|
795 | |
---|
796 | Stage 2 of the inductive halving algorithm reduces nybble streams |
---|
797 | to streams of bit pairs. The basic step in this algorithm consists |
---|
798 | of one \verb#simd<4>::pack<h,h># operation to extract the high pair |
---|
799 | of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the |
---|
800 | low pair of each nybble. Four applications of this step complete stage 2. |
---|
801 | |
---|
802 | Stage 3 similarly uses four applications of a step that uses a |
---|
803 | \verb#simd<2>::pack<h,h># operation to extract the high bit of |
---|
804 | each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of |
---|
805 | each pair. The complete algorithm to transform eight serial |
---|
806 | byte registers s0 through s7 into the eight parallel bit stream |
---|
807 | registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}. |
---|
808 | Under either IDISA-A or IDISA-B models, a mere 24 operations per 128 |
---|
809 | input bytes is required. |
---|
810 | |
---|
811 | \begin{figure}[tbh] |
---|
812 | \small |
---|
813 | \begin{verbatim} |
---|
814 | hnybble0 = simd<8>::pack<h,h>(s0, s1); |
---|
815 | lnybble0 = simd<8>::pack<l,l>(s0, s1); |
---|
816 | hnybble1 = simd<8>::pack<h,h>(s2, s3); |
---|
817 | lnybble1 = simd<8>::pack<l,l>(s2, s3); |
---|
818 | hnybble2 = simd<8>::pack<h,h>(s4, s5); |
---|
819 | lnybble2 = simd<8>::pack<l,l>(s4, s5); |
---|
820 | hnybble3 = simd<8>::pack<h,h>(s6, s7); |
---|
821 | lnybble3 = simd<8>::pack<l,l>(s6, s7); |
---|
822 | hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1); |
---|
823 | hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1); |
---|
824 | lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1); |
---|
825 | ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1); |
---|
826 | hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3); |
---|
827 | hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3); |
---|
828 | lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3); |
---|
829 | ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3); |
---|
830 | bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1); |
---|
831 | bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1); |
---|
832 | bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1); |
---|
833 | bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1); |
---|
834 | bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1); |
---|
835 | bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1); |
---|
836 | bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1); |
---|
837 | bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1); |
---|
838 | \end{verbatim} |
---|
839 | \caption{Complete Inductive Halving Algorithm} |
---|
840 | \label{halvingalgorithm} |
---|
841 | \end{figure} |
---|
842 | |
---|
843 | \subsection{Optimality of the Inductive Halving Algorithm} |
---|
844 | |
---|
845 | Here we show that the inductive halving algorithm presented in |
---|
846 | the previous subsection is optimal in the following sense: |
---|
847 | no other algorithm on any 3-register SIMD architecture can use |
---|
848 | fewer than 24 operations to transform eight serial registers |
---|
849 | of byte stream data into eight parallel registers of bit stream data. |
---|
850 | By 3-register SIMD architecture, we refer to any architecture |
---|
851 | that uses SIMD instructions consistent with our overall model of |
---|
852 | binary operations using two input register operands to produce |
---|
853 | one output register value. |
---|
854 | |
---|
855 | Observe that the $N$ data bits from each input register must be |
---|
856 | distributed $N/8$ each to the 8 output registers by virtue of |
---|
857 | the problem definition. Each output register can effectively |
---|
858 | be given a 3-bit address; the partitioning problem can be viewed |
---|
859 | as moving data to the correct address. However, each |
---|
860 | operation can move results into at most one register. |
---|
861 | At most this can result in the assignment of one correct address |
---|
862 | bit for each of the $N$ input bits. As all $8N$ input bits |
---|
863 | need to be moved to a register with a correct 3-bit address, |
---|
864 | a minimum of 24 operations is required. |
---|
865 | |
---|
866 | \subsection{End-to-End Significance} |
---|
867 | |
---|
868 | In a study of several XML technologies applied to |
---|
869 | the problem of GML to SVG transformation, the parabix |
---|
870 | implementation (parallel bit streams for XML) was |
---|
871 | found to the fastest with a cost of approximately |
---|
872 | 15 CPU cycles per input byte \cite{Herdy}. Within parabix, |
---|
873 | transposition to parallel bit stream form requires |
---|
874 | approximately 1.1 cycles per byte \cite{CASCON08}. |
---|
875 | All other things being equal, a 3X speed-up of transposition |
---|
876 | alone would improve end-to-end performance in a |
---|
877 | complete XML processing application by more than 4\%. |
---|
878 | |
---|
879 | |
---|
880 | \section{Parallel to Serial Conversion} |
---|
881 | |
---|
882 | Parallel bit stream applications may apply string editing |
---|
883 | operations in bit space to substitute, delete or insert |
---|
884 | parallel sets of bits at particular positions. In such cases, |
---|
885 | the inverse transform that converts a set of parallel bit |
---|
886 | streams back into byte space is needed. In the example of |
---|
887 | UTF-8 to UTF-16 transcoding, the inverse transform is |
---|
888 | actually used twice for each application of the forward |
---|
889 | transform, to separately compute the high and low byte |
---|
890 | streams of each UTF-16 code unit. Those two byte streams |
---|
891 | are subsequentely merged to form the final result. |
---|
892 | |
---|
893 | Algorithms for performing the inverse transform mirror those |
---|
894 | of the forward transform, employing SIMD merge operations |
---|
895 | in place of pack operations. The best algorithm known |
---|
896 | to us on the commodity SIMD architectures takes advantage |
---|
897 | of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel# |
---|
898 | operations that are available with each of the SSE and Altivec instruction |
---|
899 | sets. These algorithms take 72 operations to perform the |
---|
900 | inverse transposition of 8 parallel registers of bit stream |
---|
901 | data into 8 serial registers of byte stream data. |
---|
902 | |
---|
903 | \begin{figure}[tbh] |
---|
904 | \begin{center}\small |
---|
905 | \begin{verbatim} |
---|
906 | bit01_r0 = simd<1>::mergeh(bit0, bit1); |
---|
907 | bit01_r1 = simd<1>::mergel(bit0, bit1); |
---|
908 | bit23_r0 = simd<1>::mergeh(bit2, bit3); |
---|
909 | bit23_r1 = simd<1>::mergel(bit2, bit3); |
---|
910 | bit45_r0 = simd<1>::mergeh(bit4, bit5); |
---|
911 | bit45_r1 = simd<1>::mergel(bit4, bit5); |
---|
912 | bit67_r0 = simd<1>::mergeh(bit6, bit7); |
---|
913 | bit67_r1 = simd<1>::mergel(bit6, bit7); |
---|
914 | bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0); |
---|
915 | bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0); |
---|
916 | bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1); |
---|
917 | bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1); |
---|
918 | bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0); |
---|
919 | bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0); |
---|
920 | bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1); |
---|
921 | bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1); |
---|
922 | s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0); |
---|
923 | s1 = simd<4>::mergel(bit0123_r0, bit4567_r0); |
---|
924 | s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1); |
---|
925 | s3 = simd<4>::mergel(bit0123_r1, bit4567_r1); |
---|
926 | s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2); |
---|
927 | s5 = simd<4>::mergel(bit0123_r2, bit4567_r2); |
---|
928 | s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3); |
---|
929 | s7 = simd<4>::mergel(bit0123_r3, bit4567_r3); |
---|
930 | \end{verbatim} |
---|
931 | \end{center} |
---|
932 | \label{p2s-inductive} |
---|
933 | \caption{Parallel to Serial Transposition by Inductive Doubling} |
---|
934 | \end{figure} |
---|
935 | |
---|
936 | An algorithm employing only 24 operations using the |
---|
937 | inductive doubling instruction set architecture is relatively |
---|
938 | straightforward.. In stage 1, parallel registers for individual bit streams |
---|
939 | are first merged with bit-level interleaving |
---|
940 | using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel# |
---|
941 | operations. For each of the four pairs of consecutive |
---|
942 | even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7), |
---|
943 | two consecutive registers of bitpair data are produced. |
---|
944 | In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel# |
---|
945 | are then applied to merge to bitpair streams to produce streams |
---|
946 | of nybbles for the high and low nybble of each byte. Finally, |
---|
947 | the nybble streams are merged in stage 3 to produce the |
---|
948 | desired byte stream data. The full inductive doubling |
---|
949 | algorithm for parallel to serial transposition is shown in Figure |
---|
950 | \ref{p2s-inductive}. |
---|
951 | |
---|
952 | This algorithm is again optimal, requiring the fewest operations |
---|
953 | of any possible algorithm using any 3-register instruction set |
---|
954 | model. The proof directly follows that for the serial to parallel |
---|
955 | transposition. |
---|
956 | |
---|
957 | The existence of high-performance algorithms for transformation of |
---|
958 | character data between byte stream and parallel bit stream form |
---|
959 | in both directions makes it possible to consider applying these |
---|
960 | transformations multiple times during text processing applications. |
---|
961 | Just as the time domain and frequency domain each have their |
---|
962 | use in signal processing applications, the byte stream form and |
---|
963 | parallel bit stream form can then each be used at will in |
---|
964 | character stream applications. |
---|
965 | |
---|
966 | |
---|
967 | |
---|
968 | |
---|
969 | |
---|
970 | \section{Parallel Bit Deletion} |
---|
971 | |
---|
972 | \begin{figure*}[tbh] |
---|
973 | \begin{center} |
---|
974 | \begin{tabular}{|c||c|c|c|c|c|c|c|c|} |
---|
975 | \hline |
---|
976 | \verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010: \\ \hline |
---|
977 | \verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F: \\ \hline |
---|
978 | \verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline |
---|
979 | \verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1 \\ \hline |
---|
980 | \verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline |
---|
981 | \verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline |
---|
982 | \verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline |
---|
983 | \verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline |
---|
984 | \end{tabular} |
---|
985 | \end{center} |
---|
986 | \label{centraldelstep} |
---|
987 | \caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction} |
---|
988 | \end{figure*} |
---|
989 | |
---|
990 | Parallel bit deletion is an important operation that allows string |
---|
991 | editing operations to be carried out while in parallel bit stream |
---|
992 | form. It is also fundamental to UTF-8 to UTF-16 transcoding |
---|
993 | using parallel bit streams, allowing the excess code unit |
---|
994 | positions for UTF-8 two-, three- and four-byte sequences to |
---|
995 | be deleted once the sixteen parallel bit streams of UTF-16 have |
---|
996 | been computed \cite{PPoPP08}. |
---|
997 | |
---|
998 | Parallel bit deletion is specified using a deletion mask. |
---|
999 | A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits |
---|
1000 | to be deleted and 0s at positions identifying bits to be retained. |
---|
1001 | For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel |
---|
1002 | bit streams \verb:abcdefgh: and \verb:ABCDEFGH:. Parallel deletion of elements from both bit streams in |
---|
1003 | accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:. |
---|
1004 | |
---|
1005 | Bit deletion may be performed using |
---|
1006 | the parallel-prefix compress algorithm documented by |
---|
1007 | Warren and attributed to Steele \cite{HackersDelight}. This algorithm uses |
---|
1008 | only logic and shifts with a constant parameter to carry |
---|
1009 | out the deletion process. However, it requires $k^2$ |
---|
1010 | preprocessing steps for a final field width parameter |
---|
1011 | of size $2^k$, as well as 4 operations per deletion step |
---|
1012 | per stream. Using the inductive doubling instruction set architecture |
---|
1013 | it is possible to carry out bit deletion much more efficiently. |
---|
1014 | |
---|
1015 | Deletion within fixed size fields or registers may produce results that are either |
---|
1016 | left justified or right-justified. For example, a five-element stream \verb:bdefh: within an |
---|
1017 | eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't |
---|
1018 | care positions marked `\verb:x:'. Concatenating an adjacent right-justified result with a |
---|
1019 | left-justified result produces an important intermediate form known as a |
---|
1020 | {\em central deletion result}. For example, \verb:xxbd: and \verb:efhx: may be respective |
---|
1021 | right-justified and left-justified results from the application of the |
---|
1022 | 4-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element |
---|
1023 | stream segments \verb:abcd: and \verb:efgh:. Concatenation of \verb:xxbd: and \verb:efhx: produces |
---|
1024 | the central result \verb:xxbdefhx:, which may easily be converted to a either a |
---|
1025 | left or a right justified 8-element result by an appropriate shift operation. |
---|
1026 | |
---|
1027 | |
---|
1028 | The observation about how two $n$-bit central deletion results can |
---|
1029 | combine to yield a $2n$ central deletion result provides the basis |
---|
1030 | for an inductive doubling algorithm. Figure \ref{centraldelstep} |
---|
1031 | illustrates the inductive process for the transition from 8-bit central |
---|
1032 | deletion results to 16-bit central deletion results. The top row shows |
---|
1033 | the original deletion mask, while the second row shows the original |
---|
1034 | bit stream to which deletions are to be applied, with deleted bits zeroed out. |
---|
1035 | The third row shows the central result for each 8-bit field as the |
---|
1036 | result of the previous inductive step. |
---|
1037 | |
---|
1038 | To perform the $8 \rightarrow 16$ central deletion step, we first form |
---|
1039 | the population counts of 4-bit fields of the original deletion mask as |
---|
1040 | shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying |
---|
1041 | an 8-bit central result, we perform a right shift by the population count |
---|
1042 | of the low half of the field. Similarly, |
---|
1043 | left-justification requires a left-shift by the population count in the |
---|
1044 | high half of the field. |
---|
1045 | |
---|
1046 | The left and right shifts can be performed simultaneously using a rotate |
---|
1047 | left instruction. Right justification by shifting an $n$ bit field |
---|
1048 | $i$ positions to the right is equivalent to a left rotate of $n-i$ |
---|
1049 | positions. These rotation amounts are computed by the operation |
---|
1050 | \verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5, |
---|
1051 | except that don't care fields (which won't be subsequently used) |
---|
1052 | are marked \verb:XX:. |
---|
1053 | |
---|
1054 | The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)# |
---|
1055 | as shown in row 6, and are combined with the right shift amounts |
---|
1056 | by the selection operation \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)# |
---|
1057 | as shown in row 7. Using these computed values, the inductive step |
---|
1058 | is completed by application of the operation \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)# |
---|
1059 | as shown in row 8. |
---|
1060 | |
---|
1061 | At each inductive doubling level, it requires 4 operations to compute the |
---|
1062 | required deletion infomation and one operation per bit stream to perform deletion. |
---|
1063 | Note that, if deletion is to be applied to a set of eight parallel bit streams, |
---|
1064 | the computed deletion information is used for each stream without recomputation, |
---|
1065 | thus requiring 12 operations per inductive level. |
---|
1066 | |
---|
1067 | In comparison to the parallel-prefix compress method, the method of central |
---|
1068 | deletion results using the inductive doubling architecture has far fewer operations. |
---|
1069 | The total preprocessing cost is $4k$ for $k$ steps of deletion by central result |
---|
1070 | induction versus $4k^2$ for the parallel-prefix method. Using the computed |
---|
1071 | deletion operation, only a single SIMD rotate operation per bit stream |
---|
1072 | per level is needed, in comparison with 4 operations per level for parallel-prefix |
---|
1073 | compress. |
---|
1074 | |
---|
1075 | |
---|
1076 | |
---|
1077 | \section{Beyond Parallel Bit Streams} |
---|
1078 | |
---|
1079 | It can be argued that text processing with parallel bit |
---|
1080 | streams by itself is not sufficiently important to justify |
---|
1081 | the circuit complexity of IDISA. In this section, then |
---|
1082 | we move on to consider further applications that may |
---|
1083 | benefit from IDISA features. These include additional |
---|
1084 | basic inductive doubling |
---|
1085 | algorithms such as parity, and least power-of-2 ceiling. |
---|
1086 | Further applications for narrow field widths are discussed |
---|
1087 | as well, such as packed DNA representations using 2-bit |
---|
1088 | field widths and packed decimal representations using 4-bit |
---|
1089 | field widths. Most significantly, however, we show that IDISA offers |
---|
1090 | a fully general solution to the problem of supporting |
---|
1091 | {\em horizontal} SWAR operations. |
---|
1092 | |
---|
1093 | \subsection{Inductive Doubling with Logic Operations} |
---|
1094 | |
---|
1095 | \begin{figure} |
---|
1096 | \begin{center}\small |
---|
1097 | \begin{verbatim} |
---|
1098 | y = x ^ (x >> 1); |
---|
1099 | y = y ^ (y >> 2); |
---|
1100 | y = y ^ (y >> 4); |
---|
1101 | y = y ^ (y >> 8); |
---|
1102 | y = y ^ (y >>16); |
---|
1103 | y = y & 1; |
---|
1104 | \end{verbatim} |
---|
1105 | \end{center} |
---|
1106 | \caption{Parity Reference Algorithm} |
---|
1107 | \label{HD-parity} |
---|
1108 | \end{figure} |
---|
1109 | |
---|
1110 | \begin{figure} |
---|
1111 | \begin{center}\small |
---|
1112 | \begin{verbatim} |
---|
1113 | x = x - 1; |
---|
1114 | x = x | (x >> 1); |
---|
1115 | x = x | (x >> 2); |
---|
1116 | x = x | (x >> 4); |
---|
1117 | x = x | (x >> 8); |
---|
1118 | x = x | (x >>16); |
---|
1119 | x = x + 1; |
---|
1120 | \end{verbatim} |
---|
1121 | \end{center} |
---|
1122 | \caption{Power-of-2 Ceiling Reference Algorithm} |
---|
1123 | \label{HD-clp2} |
---|
1124 | \end{figure} |
---|
1125 | |
---|
1126 | \begin{figure} |
---|
1127 | \begin{center}\small |
---|
1128 | \begin{verbatim} |
---|
1129 | y = simd<2>::xor<h,l>(x, x); |
---|
1130 | y = simd<4>::xor<h,l>(y, y); |
---|
1131 | y = simd<8>::xor<h,l>(y, y); |
---|
1132 | y = simd<16>::xor<h,l>(y, y); |
---|
1133 | y = simd<32>::xor<h,l>(y, y); |
---|
1134 | \end{verbatim} |
---|
1135 | \end{center} |
---|
1136 | \caption{IDISA Parity Implementation} |
---|
1137 | \label{ID-parity} |
---|
1138 | \end{figure} |
---|
1139 | |
---|
1140 | \begin{figure} |
---|
1141 | \begin{center}\small |
---|
1142 | \begin{verbatim} |
---|
1143 | x = simd<32>::sub(x, simd<32>::const(1)); |
---|
1144 | x = simd<2>::or<x, h>(x, x); |
---|
1145 | x = simd<4>::or<x, h>(x, x); |
---|
1146 | x = simd<8>::or<x, h>(x, x); |
---|
1147 | x = simd<16>::or<x, h>(x, x); |
---|
1148 | x = simd<32>::or<x, h>(x, x); |
---|
1149 | x = simd<32>::add(x, simd<32>::const(1)); |
---|
1150 | \end{verbatim} |
---|
1151 | \end{center} |
---|
1152 | \caption{IDISA Power-of-2 Ceiling Implementation} |
---|
1153 | \label{ID-clp2} |
---|
1154 | \end{figure} |
---|
1155 | |
---|
1156 | Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's |
---|
1157 | code for two further inductive doubling examples using |
---|
1158 | logical operators as the combining feature. In the |
---|
1159 | first of these, the ``exclusive or'' operation is applied |
---|
1160 | to all bits in a 32-bit value to determine parity. |
---|
1161 | Parity has important applications for error-correcting |
---|
1162 | codes such as the various Hamming codes for detecting |
---|
1163 | and correcting numbers of bit errors dependent on the |
---|
1164 | number of parity bits added. Warren's version uses |
---|
1165 | 11 operations for parity of a single 32-bit value; |
---|
1166 | a straightforward SWAR adaptation also uses 11 operations |
---|
1167 | for the parity of a set of 32-bit fields. |
---|
1168 | |
---|
1169 | Figure \ref{ID-parity} shows the IDISA parity implementation |
---|
1170 | with only 5 operations required. This represents slightly |
---|
1171 | more than a 2X improvement. The improvement is less than |
---|
1172 | 3X seen in other cases because one of the operands need |
---|
1173 | not be modified before applying the exclusive-or operation. |
---|
1174 | |
---|
1175 | Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's |
---|
1176 | code for the least power-of-2 ceiling of a 32-bit value. The |
---|
1177 | technique is to employ inductive doubling to fill in one bits at |
---|
1178 | all bit positions after the first one bit in the original value to |
---|
1179 | first produce a value of the form $2^n-1$. This is then incremented |
---|
1180 | to determine the power of 2 ceiling. This function and |
---|
1181 | its straightforward adaptation for SWAR application on a |
---|
1182 | set of 32-bit fields requires 12 operations. The |
---|
1183 | corresponding IDISA implementation of Figure \ref{ID-clp2} |
---|
1184 | requires 7 operations, just under a 2X improvement. |
---|
1185 | |
---|
1186 | \subsection{Packed DNA Representation} |
---|
1187 | |
---|
1188 | DNA sequences are often represented as strings consisting |
---|
1189 | of the four nucleotide codes A, C, G and T. Internally, |
---|
1190 | these sequences are frequently represented in internal |
---|
1191 | form as packed sequences of 2-bit values. The IDISA |
---|
1192 | \verb#simd<8>:pack# and \verb#simd<4>:pack# operations |
---|
1193 | allow these packed representations to be quickly computed |
---|
1194 | from byte-oriented string values by two steps of inductive |
---|
1195 | halving. Similarly, conversion back to string form |
---|
1196 | can use two steps of inductive merging. Without direct |
---|
1197 | support for these pack and merge operations, the SWAR |
---|
1198 | implementations of these conversions require the cost |
---|
1199 | of explicit masking and shifting in combination with |
---|
1200 | the 16-bit to 8-bit packing and 8-bit to 16-bit |
---|
1201 | merging operations supported by existing SWAR facilities |
---|
1202 | on commodity processors. |
---|
1203 | |
---|
1204 | \subsection{Bit Reverse} |
---|
1205 | |
---|
1206 | Include or omit? |
---|
1207 | |
---|
1208 | \subsection{String/Decimal/Integer Conversion} |
---|
1209 | \begin{figure} |
---|
1210 | \begin{center}\small |
---|
1211 | \begin{verbatim} |
---|
1212 | b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F); |
---|
1213 | b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF); |
---|
1214 | b = (d & 0x0000FFFF) + 10000 * (d >> 16); |
---|
1215 | \end{verbatim} |
---|
1216 | \end{center} |
---|
1217 | \caption{BCD to Integer Reference Algorithm} |
---|
1218 | \label{BCD2int} |
---|
1219 | \end{figure} |
---|
1220 | |
---|
1221 | \begin{figure} |
---|
1222 | \begin{center}\small |
---|
1223 | \begin{verbatim} |
---|
1224 | c10 = simd<16>:const(10); |
---|
1225 | c100 = simd<16>:const(100); |
---|
1226 | c10000 = simd<32>:const(10000); |
---|
1227 | b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d); |
---|
1228 | b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b); |
---|
1229 | b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b); |
---|
1230 | \end{verbatim} |
---|
1231 | \end{center} |
---|
1232 | \caption{IDISA Parity Implementation} |
---|
1233 | \label{ID-BCD2int} |
---|
1234 | \end{figure} |
---|
1235 | |
---|
1236 | Just as DNA sequences represent an important use case for |
---|
1237 | SWAR operations on 2-bit fields, packed sequences of |
---|
1238 | decimal or hexadecimal digits represent a common use case |
---|
1239 | for 4-bit fields. These representations can be used |
---|
1240 | both as an intermediate form in numeric string to integer |
---|
1241 | conversion and as a direct representation for |
---|
1242 | packed binary coded decimal. |
---|
1243 | |
---|
1244 | Figure \ref{BCD2int} shows a three-step inductive |
---|
1245 | doubling implementation for conversion of 32-bit packed BCD |
---|
1246 | values to integer form. The 32-bit value consists |
---|
1247 | of 8 4-bit decimal digits. Pairs of digits are |
---|
1248 | first combined by multiplying the higher digit |
---|
1249 | of the pair by 10 and adding. Pairs of these |
---|
1250 | two-digit results are then further combined by |
---|
1251 | multiplying the value of the higher of the two-digit |
---|
1252 | results by 100 and adding. The final step is |
---|
1253 | to combine four-digit results by multiplying the |
---|
1254 | higher one by 10000 and adding. Overall, 20 |
---|
1255 | operations are required for this implementation |
---|
1256 | as well as the corresponding SWAR implementation |
---|
1257 | for sets of 32-bit fields. Preloading of 6 constants |
---|
1258 | into registers for repeated use can reduce the number of |
---|
1259 | operations to 14 at the cost of significant register |
---|
1260 | pressure. |
---|
1261 | |
---|
1262 | The IDISA implementation of this algorithm is shown |
---|
1263 | in Figure \ref{ID-BCD2int}. This implementation |
---|
1264 | shows an interesting variation in the use of |
---|
1265 | half-operand modifiers, with only one operand |
---|
1266 | of each of the addition and multiplication operations |
---|
1267 | modified at each level. Overall, this implementation |
---|
1268 | requires 9 operations, or 6 operations with 3 |
---|
1269 | preloaded constants. This represents more than a 2X |
---|
1270 | reduction in instruction count as well as a 2X reduction |
---|
1271 | in register pressure. |
---|
1272 | |
---|
1273 | |
---|
1274 | |
---|
1275 | |
---|
1276 | \subsection{Systematic Support for Horizontal SIMD Operations} |
---|
1277 | |
---|
1278 | In SIMD parlance, {\em horizontal} operations are |
---|
1279 | operations which combine values from two or more fields |
---|
1280 | of the same register, in contrast to the normal |
---|
1281 | {\em vertical} operations which combine corresponding |
---|
1282 | fields of different registers. Horizontal operations |
---|
1283 | can be found that combine two (e.g., \verb:haddpd: on SSE3), |
---|
1284 | four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE) |
---|
1285 | or sixteen values (e.g., \verb:vcmpequb: on Altivec). Some |
---|
1286 | horizontal operations have a vertical component as well. |
---|
1287 | For example, \verb:psadbw: first forms the absolute value of |
---|
1288 | the difference of eight corresponding byte fields before |
---|
1289 | performing horizontal add of the eight values, while |
---|
1290 | \verb:vsum4ubs: on Altivec performs horizontal add of sets of |
---|
1291 | four unsigned 8-bit fields within one register |
---|
1292 | and then combines the result horizontally with |
---|
1293 | corresponding 32-bit fields of a second registers. |
---|
1294 | |
---|
1295 | The space of potential horizontal operations thus has |
---|
1296 | many dimensions, including not only the particular |
---|
1297 | combining operation and the operand field width, but |
---|
1298 | also the number of fields being combined, whether a |
---|
1299 | vertical combination is applied and whether it is applied |
---|
1300 | before or after the horizontal operation and what the |
---|
1301 | nature of the vertical combining operation is. |
---|
1302 | Within this space, commodity SIMD architectures tend |
---|
1303 | to support only a very few combinations, without any |
---|
1304 | particular attempt at systematic support for horizontal |
---|
1305 | operations in general. |
---|
1306 | |
---|
1307 | In contrast to this {\em ad hoc} support on commodity |
---|
1308 | processors, IDISA offers a completely systematic treatment |
---|
1309 | of horizontal operations without any special features beyond |
---|
1310 | the inductive doubling features already described. |
---|
1311 | In the simplest case, any vertical operation |
---|
1312 | \verb#simd<n>::F# on $n$-bit fields gives rise to |
---|
1313 | an immediate horizontal operation |
---|
1314 | \verb#simd<n>::F<h,l>(r, r)# for combining adjacent |
---|
1315 | pairs of $n/2$ bit fields. |
---|
1316 | For example, \verb#simd<16>::add<h,l># adds values |
---|
1317 | in adjacent 8 bit fields to produce 16 bit results, |
---|
1318 | while \verb#simd<32>::min<h,l># can produce the |
---|
1319 | minimum value of adjacent 16-bit fields. |
---|
1320 | Thus any binary horizontal operation can be implemented |
---|
1321 | in a single IDISA instruction making use of the \verb:<h,l>: |
---|
1322 | operand modifier combination. |
---|
1323 | |
---|
1324 | Horizontal combinations of four adjacent fields can also be |
---|
1325 | realized in a general way through two steps of inductive |
---|
1326 | doubling. For example, consider the or-across operation \verb:si_orx: |
---|
1327 | of the SPU, that performs a logical or operation |
---|
1328 | on four 32-bit fields. This four field combination |
---|
1329 | can easily be implemented with the following two operations. |
---|
1330 | %\begin{singlespace} |
---|
1331 | \begin{verbatim} |
---|
1332 | t = simd<64>::or<h,l>(x, x) |
---|
1333 | t = simd<128>::or<h,l>(t, t) |
---|
1334 | \end{verbatim} |
---|
1335 | %\end{singlespace} |
---|
1336 | |
---|
1337 | In general, systematic support for horizontal |
---|
1338 | combinations of sets of $2^h$ adjacent fields may |
---|
1339 | be realized through $h$ inductive double steps |
---|
1340 | in a similar fashion. |
---|
1341 | Thus, IDISA esssentially offers systematic support |
---|
1342 | for horizontal operations entirely through the |
---|
1343 | use of \verb:<h,l>: half-operand modifier |
---|
1344 | combinations. |
---|
1345 | |
---|
1346 | Systematic support for general horizontal operations |
---|
1347 | under IDISA also creates opportunity for a design tradeoff: |
---|
1348 | offsetting the circuit complexity of half-operand |
---|
1349 | modifiers with potential elimination of dedicated |
---|
1350 | logic for some {/ad hoc} horizontal SIMD operations. |
---|
1351 | Even if legacy support for these operations is required, |
---|
1352 | it may be possible to provide that support through |
---|
1353 | software or firmware rather than a full hardware |
---|
1354 | implementation. Evaluation of these possibilities |
---|
1355 | in the context of particular architectures is a potential |
---|
1356 | area for further work. |
---|
1357 | |
---|
1358 | |
---|
1359 | \section{Implementation} |
---|
1360 | |
---|
1361 | We have carried implementation work for IDISA in three |
---|
1362 | ways. First, we have constructed libraries that |
---|
1363 | implement the IDISA instructions by template and/or macro |
---|
1364 | expansion for each of MMX, SSE, Altivec, and SPU platforms. |
---|
1365 | Second, we have developed a model implementation |
---|
1366 | involving a modified operand fetch component |
---|
1367 | of a pipelined SIMD processor. Third, we have written |
---|
1368 | and evaluated Verilog HDL description of this model |
---|
1369 | implementation. |
---|
1370 | |
---|
1371 | \subsection{IDISA Libraries} |
---|
1372 | |
---|
1373 | Implementation of IDISA instructions using template |
---|
1374 | and macro libraries has been useful in developing |
---|
1375 | and assessing the correctness of many of the algorithms |
---|
1376 | presented here. Although these implementations do not |
---|
1377 | deliver the performance benefits associated with |
---|
1378 | direct hardware implementation of IDISA, they |
---|
1379 | have been quite useful in providing a practical means |
---|
1380 | for portable implementation of parallel bit stream |
---|
1381 | algorithms on multiple SWAR architectures. However, |
---|
1382 | one additional facility has also proven necessary for |
---|
1383 | portability of parallel bit stream algorithms across |
---|
1384 | big-endian and little-endian architectures: the |
---|
1385 | notion of shift-forward and shift-back operations. |
---|
1386 | In essence, shift forward means shift to the left |
---|
1387 | on little-endian systems and shift to the right on |
---|
1388 | big-endian systems, while shift back has the reverse |
---|
1389 | interpretation. Although this concept is unrelated to |
---|
1390 | inductive doubling, its inclusion with the IDISA |
---|
1391 | libraries has provided a suitable basis for portable |
---|
1392 | SIMD implementations of parallel bit stream algorithms. |
---|
1393 | Beyond this, the IDISA libraries have the additional |
---|
1394 | benefit of allowing the implementation |
---|
1395 | of inductive doubling algorithms at a higher level |
---|
1396 | abstraction, without need for programmer coding of |
---|
1397 | the underlying shift and mask operations. |
---|
1398 | |
---|
1399 | \subsection{IDISA Model} |
---|
1400 | \begin{figure}[tbh] |
---|
1401 | \begin{center} |
---|
1402 | \includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf} |
---|
1403 | \caption{IDISA Block Diagram} |
---|
1404 | \label{pipeline-model} |
---|
1405 | \end{center} |
---|
1406 | \end{figure} |
---|
1407 | |
---|
1408 | Figure \ref{pipeline-model} shows a block diagram |
---|
1409 | for a pipelined SIMD processor implementing IDISA. |
---|
1410 | The SIMD Register File (SRF) provides a file of $R = 2^A$ |
---|
1411 | registers each of width $N = 2^K$ bits. |
---|
1412 | IDISA instructions identified by the Instruction Fetch |
---|
1413 | Unit (IFU) are forwarded for decoding to the SIMD |
---|
1414 | Instruction Decode Unit (SIDU). This unit decodes |
---|
1415 | the instruction to produce |
---|
1416 | signals identifying the source and destination |
---|
1417 | operand registers, the half-operand modifiers, the |
---|
1418 | field width specification and the SIMD operation |
---|
1419 | to be applied. |
---|
1420 | |
---|
1421 | The SIDU supplies the source register information and the half-operand |
---|
1422 | modifier information to the SIMD Operand Fetch Unit (SOFU). |
---|
1423 | For each source operand, the SIDU provides an $A$-bit register |
---|
1424 | address and two 1-bit signals $h$ and $l$ indicating the value |
---|
1425 | of the decoded half-operand modifiers for this operand. |
---|
1426 | Only one of these values may be 1; both are 0 if |
---|
1427 | no modifier is specified. |
---|
1428 | In addition, the SIDU supplies decoded field width information |
---|
1429 | to both the SOFU and to the SIMD Instruction Execute Unit (SIEU). |
---|
1430 | The SIDU also supplies decoded SIMD opcode information to SIEU and |
---|
1431 | a decoded $A$-bit register address for the destination register to |
---|
1432 | the SIMD Result Write Back Unit (SRWBU). |
---|
1433 | |
---|
1434 | The SOFU is the key component of the IDISA model that |
---|
1435 | differs from that found in a traditional SWAR |
---|
1436 | processor. For each of the two $A$-bit source |
---|
1437 | register addresses, SOFU is first responsible for |
---|
1438 | fetching the raw operand values from the SRF. |
---|
1439 | Then, before supplying operand values to the |
---|
1440 | SIEU, the SOFU applies the half-operand modification |
---|
1441 | logic as specified by the $h$, $l$, and field-width |
---|
1442 | signals. The possibly modified operand values are then |
---|
1443 | provided to the SIEU for carrying out the SIMD operations. |
---|
1444 | A detailed model of SOFU logic is described in the following |
---|
1445 | subsection. |
---|
1446 | |
---|
1447 | The SIEU differs from similar execution units in |
---|
1448 | current commodity processors primarily by providing |
---|
1449 | SIMD operations at each field width |
---|
1450 | $n=2^k$ for $0 \leq k \leq K$. This involves |
---|
1451 | additional circuitry for field widths not supported |
---|
1452 | in existing processors. For inductive doubling |
---|
1453 | algorithms in support of parallel bit streams, |
---|
1454 | the principal need is for additional circuitry to |
---|
1455 | support 2-bit and 4-bit field widths. This circuity |
---|
1456 | is generally less complicated than that for larger |
---|
1457 | fields. Support for circuitry at these width |
---|
1458 | has other applications as well. For example, |
---|
1459 | DNA sequences are frequently represented using |
---|
1460 | packed sequences of 2-bit codes for the four possible |
---|
1461 | nucleotides\cite{}, while the need for accurate financial |
---|
1462 | calculation has seen a resurgence of the 4-bit |
---|
1463 | packed BCD format for decimal floating point \cite{}. |
---|
1464 | |
---|
1465 | When execution of the SWAR instruction is |
---|
1466 | completed, the result value is then provided |
---|
1467 | to the SRWBU to update the value stored in the |
---|
1468 | SRF at the address specified by the $A$-bit |
---|
1469 | destination operand. |
---|
1470 | |
---|
1471 | \subsection{Operand Fetch Unit Logic} |
---|
1472 | |
---|
1473 | Discussion of gate-level implementation. |
---|
1474 | |
---|
1475 | |
---|
1476 | \section{Conclusions} |
---|
1477 | |
---|
1478 | In considering the architectural support for |
---|
1479 | SIMD text processing using the method of parallel bit streams, |
---|
1480 | this paper has presented the principle of inductive doubling |
---|
1481 | and a realization of that principle in the form of |
---|
1482 | IDISA, a modified SWAR instruction set architecture. |
---|
1483 | IDISA features support for SWAR operations at all power-of-2 |
---|
1484 | field widths, including 2-bit and 4-bit widths, in particular, |
---|
1485 | as well as half-operand modifiers and pack and merge operations |
---|
1486 | to support efficient transition between successive power-of-two |
---|
1487 | field widths. The principle innovation is the notion of |
---|
1488 | half-operand modifiers that eliminate the cost associated |
---|
1489 | with the explicit mask and shift operations required for |
---|
1490 | such transitions. |
---|
1491 | |
---|
1492 | Several algorithms key to parallel bit stream methods |
---|
1493 | have been examined and shown to benefit from dramatic |
---|
1494 | reductions in instruction count compared to the best |
---|
1495 | known algorithms on existing architectures. In the case |
---|
1496 | of transposition algorithms to and from parallel bit stream |
---|
1497 | form, the architecture has been shown to make possible |
---|
1498 | straightforward inductive doubling algorithms with the |
---|
1499 | lowest total number of operations that can be achieved by |
---|
1500 | any possible 3-register SWAR architecture. |
---|
1501 | |
---|
1502 | Applications of IDISA in other areas have also been |
---|
1503 | examined. The support for 2-bit and 4-bit field widths |
---|
1504 | in SWAR processing is beneficial for packed DNA representations |
---|
1505 | and packed decimal representations respectively. Additional |
---|
1506 | inductive doubling examples are presented and the phenomenon |
---|
1507 | of power-of-2 transitions discussed more broadly. |
---|
1508 | Most significantly, IDISA supports a fully general approach |
---|
1509 | to horizontal SWAR operations, offering a considerable |
---|
1510 | improvement over the {\em ad hoc} sets of special-purpose |
---|
1511 | horizontal operations found in existing SWAR instruction sets. |
---|
1512 | |
---|
1513 | An IDISA implementation model has been presented employing |
---|
1514 | a customized operand fetch unit to implement the half-operand |
---|
1515 | modifier logic. Gate-level implementation of this unit |
---|
1516 | has been analyzed and showed to be quite reasonable. |
---|
1517 | |
---|
1518 | Many special-purpose operations that are found in existing |
---|
1519 | processors could instead be implemented through efficient |
---|
1520 | IDISA sequences. These include examples such as population |
---|
1521 | count, count leading and/or trailing zeroes and parity. |
---|
1522 | They also include specialized horizontal SIMD operations. |
---|
1523 | Thus, design tradeoffs can be made with the potential of |
---|
1524 | reducing the chip area devoted to special purpose instructions |
---|
1525 | in favor of more general IDISA features. |
---|
1526 | |
---|
1527 | Other tradeoffs may be possible in IDISA implementation itself. |
---|
1528 | Full support of IDISA features to the largest field widths |
---|
1529 | are not necessary in many cases. For example, a given 128-bit |
---|
1530 | SIMD facility may support IDISA features only up to 32-bit |
---|
1531 | or 64-bit fields, similar to the current Altivec and SSE |
---|
1532 | architectures, respectively. It may also be possible to |
---|
1533 | reduce the complexity of operand fetch circuitry by a factor |
---|
1534 | of two by dedicating one operand to a possible high half-operand |
---|
1535 | modifier and the other to a possible low half-operand modifier. |
---|
1536 | |
---|
1537 | Future research may consider the extension of inductive doubling |
---|
1538 | support in further ways. For example, it may be possible |
---|
1539 | to develop a pipelined architecture supporting two or three |
---|
1540 | steps of inductive doubling in a single operation. |
---|
1541 | |
---|
1542 | %\appendix |
---|
1543 | %\section{Appendix Title} |
---|
1544 | % |
---|
1545 | %This is the text of the appendix, if you need one. |
---|
1546 | |
---|
1547 | \acks |
---|
1548 | |
---|
1549 | This research was supported by a Discovery Grant from the |
---|
1550 | Natural Sciences and Engineering Research Council of Canada. |
---|
1551 | |
---|
1552 | %\bibliographystyle{plainnat} |
---|
1553 | \bibliographystyle{plain} |
---|
1554 | \bibliography{asplos094-cameron} |
---|
1555 | %\begin{thebibliography}{} |
---|
1556 | |
---|
1557 | %\bibitem{smith02} |
---|
1558 | %Smith, P. Q. reference text |
---|
1559 | |
---|
1560 | %\end{thebibliography} |
---|
1561 | |
---|
1562 | |
---|
1563 | \end{document} |
---|