source: docs/ASPLOS09/asplos094-cameron.tex @ 238

Last change on this file since 238 was 238, checked in by cameron, 11 years ago

Intro and evaluation methodology section

File size: 69.4 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152The remainder of this paper is organized as follows.
153
154The second section of this paper introduces IDISA and the
155SIMD notation used throughout this paper.  A brief comparison of
156IDISA features with existing SIMD instruction
157sets of commodity processors such as the Intel SSE
158instruction set and the Power PC Altivec instruction set
159is also made.
160
161The third section then discusses the evaluation methodology
162for IDISA.  Two reference architectures are briefly described as
163well as criteria for evaluation IDISA against these architectures.
164
165The fourth section provides a short first example of
166the inductive doubling principle in action through
167the case of population count.   Although this operation
168is not a strong determinant of performance for parallel bit
169stream applications, it is nevertheless an operation needed
170frequently enough in the general computing milieux to find
171its way into some instruction set architectures, typically
172at one particular field width. 
173%By way of comparison, the
174%inductive doubling architecture sacrifices some
175%performance at the chosen field width, while offering a more
176%general solution with frequently better performance at
177%other field widths.
178
179The fifth section then moves on to consider the performance-critical
180and key task of conversion between serial byte streams and parallel
181bit streams.  A first algorithm that uses the existing SIMD
182operations common to SSE and Altivec is shown, requiring 72
183operations to transform 128 bytes of data using the three-register
184instruction form.  We then move on to consider how the task may
185be simplified using IDISA to
186require a mere 24 operations.  As well as providing a 3X speed-up,
187it is also argued that the version using the inductive doubling
188architecture is considerably simpler and easier to program.
189
190The sixth section then briefly considers the inverse transposition
191process, converting parallel bit stream data back into byte
192streams.  Again, an algorithm to carry out this task requires
19372 straight-line SIMD operations in the Altivec three-register
194form, but is reduced to a simpler 24 operations using IDISA.
195
196The seventh section then goes on to consider the problem of
197parallel bit deletion.  This operation is performance-critical
198to any applications that require filtering or
199editing operations on strings using the parallel bit stream
200algorithms.   For example, it is fundamental to the
201high-speed UTF-8 to UTF-16 transcoding algorithm that is
202often a critical component in XML parsing.  In this
203section, an inductive doubling algorithm based on the
204concept of a central deletion result is described and
205shown to have much better performance than a parallel-prefix
206alternative.
207
208The eighth section then considers the potential role of
209IDISA in supporting applications beyond parallel bit streams.
210Additional examples from several domains are presented.
211Perhaps most importantly, however, the role of IDISA
212in supporting a fully systematic treatment of {\em horizontal}
213SWAR operations for any application area is discussed.
214
215An implementation model for IDISA is then considered
216in section 9 of the paper, focusing on a pipelined
217SIMD architecture featuring a modified operand fetch stage.
218A gate-count analysis of one feasible implementation is
219provided as well as a discussion of the implementation
220of required extensions to handle 2-bit and 4-bit fields not
221commonly supported on existing SIMD architectures.   Design
222tradeoffs are also considered focusing the potential removal of
223various {\em ad hoc} instructions on existing processors in favor of
224more general alternatives provided through IDISA.
225
226The tenth section concludes the paper with a summary of results
227and discussion of areas for future work.
228
229
230\section{Inductive Doubling Architecture}
231
232This section presents an idealized model for a single-instruction
233multiple-data (SIMD) instruction set architecture designed
234specifically to support inductive doubling algorithms in the
235domain of parallel bit stream programming.   The architecture is idealized
236in the sense that we concentrate on only the necessary features
237for our purpose, without enumerating the additional
238operations that would be required for
239SIMD applications in other domains.  The goal is to focus
240on the principles of inductive doubling support in a way
241that can accommodate a variety of realizations as other
242design constraints are brought to bear on the overall instruction set
243design.  First we introduce a simple model and notation for
244SIMD operations in general and then present the four key
245features of an idealized architecture in support of parallel
246bit stream programming.
247
248The idealized architecture supports typical SIMD integer
249operations common to existing commodity architectures such as SSE
250and Altivec.   The architecture is defined to support SIMD
251operations on registers of size $N=2^K$ bits, for some integer $K$.
252Typical values of $K$ for commodity processors include $K=6$ for
253the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
254the 128-bit registers of SSE and Altivec technology and $K=8$ for
255the upcoming Intel AVX technology.   The registers may be
256partitioned into $N/n$ fields of width $n=2^k$ bits for some values
257of $k \leq K$.   Typical values of $k$ used on commodity processors
258include $k = 3$ for SIMD operations on 8-bit fields (bytes),
259$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
260on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
261fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
262Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
263register $r$, using big-endian numbering.
264
265Let \verb:simd<n>: represent the class of SIMD operations defined
266on fields of size $n$ using C++ template syntax.  Given a
267binary function $F_n$ on $n$-bit fields, we denote the SIMD
268version of this operation as \verb#simd<n>::F#.  Given two
269SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
270respectively, the operation \verb#r=simd<n>::F(a,b)# stores
271the value $r$ in the register \verb:r: as determined by
272the simultaneous calculation of individual field values in
273accord with the following equation.
274\[ r_i = F_n(a_i, b_i) \]
275
276For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
277logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
278integers as follows.
279%\singlespace
280\begin{eqnarray}
281\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
282\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
283\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
284\end{eqnarray}
285%\doublespace
286The SSE and Altivec instruction sets support
287each of the addition and subtraction operations in SIMD form
288for 8, 16 and 32-bit fields, while the SSE set also includes
289the 64-bit forms.  For example, \verb#simd<8>::add# in our
290nomenclature is provided by the operation \verb:paddb:
291on SSE and the operation \verb:vaddubm: on Altivec.
292The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
293\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
294more constrained, requiring that all field shifts by the same amount.
295
296Given these definitions and notation, we now present
297the four key elements of an inductive doubling architecture.
298The first is a definition of a core set of binary functions
299on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
300The second is a set of {\em half-operand modifiers} that allow
301the inductive processing of fields of size $2n$ in terms of
302combinations of $n$-bit values selected from the fields.
303The third is the definition of packing operations that compress
304two consecutive registers of $n$-bit values into a single
305register of $n/2$-bit values.  The fourth is the definition
306of merging operations that produce a set of $2n$ bit fields
307by concatenating corresponding $n$-bit fields from two
308parallel registers.  Each of these features is described below.
309
310For the purpose of direct and efficient support for inductive
311doubling algorithms on bit streams, the provision of
312a core set of operations at field widths of 2 and 4 as
313well as the more traditional field witdhs of 8, 16 and 32
314is critical for elegant and efficient expression of the
315algorithms.  In essence, inductive doubling algorithms
316work by establishing some base property at either single
317or 2-bit fields.  Each iteration of the algorithm then
318goes on to establish the property for the power-of-2
319field width.  In order for this inductive step to be
320most conveniently and efficiently expressed, the
321core operations needed for the step should be available
322at each field width.  In the case of work with parallel
323bit streams, the operations \verb:add:, \verb:sub:,
324\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
325(rotate left) comprise the core.  In other domains,
326additional operations may be usefully included in the
327core depending on the work that needs to be performed
328at each inductive doubling level.
329
330Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
331also includes fields of width 1.  These are included for
332logical consistency, but are easily implemented by mapping
333directly to appropriate bitwise logic operations, which
334we assume are also available.  For example,
335\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
336bitwise exclusive-or operation on SIMD registers.
337
338The second key facility of the inductive doubling architecture
339is the potential application of half-operand modifiers to
340the fields of either or both of the operands of a SIMD
341operation.  These modifiers select either the
342low $n/2$
343bits of each $n$-bit field (modifier ``\verb:l:'') or the
344high $n/2$ bits (modifier ``\verb:h:'').  When required,
345the modifier ``\verb:x:'' means that the full $n$ bits
346should be used, unmodified.  The semantics of these
347modifiers are given by the following equations.
348%\singlespace
349\begin{eqnarray}
350l(r_n) & = & r_n \bmod 2^{n/2} \\
351h(r_n) & = & r_n / 2^{n/2} \\
352x(r_n) & = & r_n
353\end{eqnarray}
354%\doublespace
355In our notation, the half-operand modifiers are
356specified as optional template (compile-time) parameters
357for each of the binary functions.  Thus,
358\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
359the 2-bit quantity found in the low 2-bits of each 4-bit field
360of its first operand (\verb:a:)
361together with the corresponding 2-bit quantity found in the
362high 2-bits of its second operand (\verb:b:).
363In general, the purpose of the half-operand modifiers
364in support of inductive doubling is to allow the processing
365of $n$-bit fields to easily expressed in terms of
366combination of the results determined by processing
367$n/2$ bit fields.
368
369The third facility of the inductive doubling architecture
370is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
371The field values of \verb#r=simd<n>::pack(a,b)# are
372defined by the following equations.
373%\singlespace
374\begin{eqnarray}
375r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
376r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
377\end{eqnarray}
378%\doublespace
379Here conv is a function which performs conversion of an $n$-bit
380value to an $n/2$ bit value by signed saturation (although
381conversion by unsigned saturation would also suit our purpose).
382
383Half-operand modifiers may also be used with the pack
384operations.  Thus packing with conversion by masking off all
385but the low $n/2$ bits of each field may be
386be performed using the operation \verb#simd<n>::pack<l,l>#
387
388
389The final facility of the inductive doubling architecture is
390a set of merging operations \verb#simd<n>::mergeh# and
391\verb#simd<n>::mergel# that produce $2n$-bit fields
392by concatenating corresponding $n$-bit fields from the
393operand registers.   The respective
394operations \verb#r=simd<n>::mergeh(a,b)# and
395\verb#s=simd<n>::mergel(a,b)#
396are defined by the following equations.
397%\singlespace
398\begin{eqnarray}
399r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
400s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
401\end{eqnarray}
402%\doublespace
403Both SSE and Altivec provide versions of pack and merge operations
404for certain field widths.  The pack operations are provided
405with operands having 16-bit or 32-bit fields on each platform, although
406with some variation in how conversion is carried out.
407The merge operations are provided at 8-bit, 16-bit and 32-bit
408field widths on both architectures and also at the 64-bit level
409on SSE.
410
411This completes the description of IDISA.  As described, many of the
412features are already available with the SIMD facilities of
413existing commodity processors.  The extensions enumerated
414here are relatively straightforward.  The innovation
415is to specifically tackle the design of facilities to
416offer systematic support for transitions between power-of-2 field widths.
417As we shall show in the remainder of this paper, these facilities
418can dramatically reduce instruction count in core parallel bit
419stream algorithms, with a factor of 3 reduction being typical.
420
421\section{Evaluation Methodology}
422
423IDISA represents a set of instruction set features that
424could potentially be added to any SWAR processor.  The goal
425in this paper is to evaluate these features independent
426of any artifacts that may be due to any particular realization,
427while still considering realistic models based on existing
428commodity instruction set architectures.  For the purpose
429of IDISA evaluation, then, we define two reference architectures.
430For concreteness, IDISA and the two reference architectures will
431each be considered as 128-bit processors employing the
432three-register SWAR model defined in the previous
433section.
434
435Reference architecture A (RefA) consists of a limited register
436processor providing a set of core binary operations defined
437for 8, 16, 32 and 64 bit fields.  The core binary operations
438will be assumed to be those defined by the SSE instruction
439set for 16-bit fields.  In addition, we assume that
440shift immediate operations for each field width exist,
441e.g., \verb#simd<8>::srli<1>(x)# for a right logical
442shift of each 8-bit field by 1.  We also assume that
443a constant load operation \verb#simd::const<n>(c)#
444loads the constant value $c$ into each $n$ bit field.
445
446Reference architecture B (RefB) consists of a register-rich
447processor incorporating all the operations of reference
448architecture A as well as the following additional facilities
449inspired by the Altivec instruction set.
450For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
451logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
452of $a$ by the rotation count in the corresponding field of $b$.
453A three-register \verb#simd<1>::if(a, b, c)# bitwise logical
454operator implements the logic $a & b | ~a & c$, patterned
455after the Altivec \verb:vec_sel: operation.  Finally,
456a \verb#simd<8>::permute(a, b, c)# selects an arbitrary
457permutation of bytes from the concatenation of $a$ and $b$ based
458on the set of indices in $c$.
459
460Two versions of IDISA are assessed against these reference
461architectures as follows.  IDISA-A has all the facilities
462of RefA extended with half-operand modifiers and all core
463operations at field widths of 2, 4 and 128.  IDISA-B is
464similarly defined and extended based on RefB.  Algorithms
465for both RefA and IDISA-A are assessed assuming that
466any required constants must be loaded as needed; this
467reflects the limited register assumption.  On the other,
468assessment for both RefB and IDISA-B will make the
469assumption that sufficiently many registers exist that
470constants can be kept preloaded.
471
472In each case, the processors are assumed to be well-designed
473pipeline processors with a throughput of one SWAR instruction
474each processor cycle for straight-line code free of memory
475access.  Such processors are certainly feasible with a
476suitable investment in circuitry, although practical designs
477may make sacrifices to achieve close to optimal throughput
478with less circuitry.  However, the assumption makes for
479straightforward performance evaluation based on instruction
480count for straight-line computational kernels.  Furthermore,
481the assumption also eliminates artifacts due to possibly different
482latencies in reference and IDISA architectures.  Because
483the same assumption is made for reference and IDISA
484architectures, determination of the additional circuit
485complexity due to IDISA features is unaffected by the
486assumption.
487
488In the remainder of this paper, then, IDISA-A and IDISA-B
489models are evaluated against their respective reference
490architectures on straight-line computational kernels
491used in parallel bit stream processing and other applications.
492As XML and other sequential text processing applications
493tend to use memory in an efficient streaming model, the
494applications tend to be compute-bound rather than IO-bound.
495Thus, the focus on computational kernels addresses the
496primary concern for performance improvement of these applications.
497
498\section{Population Count}
499
500\begin{figure}[h]
501\begin{center}\small
502\begin{verbatim}
503c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
504c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
505c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
506c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
507c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
508\end{verbatim}
509\end{center}
510\caption{Population Count Reference Algorithm}
511\label{HD-pop}
512\end{figure}
513
514\begin{figure}
515\begin{center}\small
516\begin{verbatim}
517c = simd<2>::add<h,l>(x, x);
518c = simd<4>::add<h,l>(c, c);
519c = simd<8>::add<h,l>(c, c);
520c = simd<16>::add<h,l>(c, c);
521c = simd<32>::add<h,l>(c, c);
522\end{verbatim}
523\end{center}
524\caption{IDISA Population Count}
525\label{inductivepopcount}
526\end{figure}
527
528As an initial example to illustrate the principle of inductive doubling
529in practice, consider the problem of {\em population count}: determining
530the number of one bits within a particular bit field.  It is important
531enough for such operations as
532calculating Hamming distance to be included as a built-in instruction
533on some processors.  For example, the SPU of the Cell Broadband Engine
534has a SIMD population count instruction \verb:si_cntb: for simultaneously
535determining the
536number of 1 bits within each byte of a 16-byte register.
537In text processing with parallel bit streams, population count has direct
538application to keeping track of line numbers for error reporting, for example.
539Given a bit block identifying the positions of newline characters within
540a block of characters being processed, the population count of the
541bit block can be used to efficiently and conveniently be used to update
542the line number upon completion of block processing.
543
544Figure \ref{HD-pop} presents a traditional divide-and-conquer
545implementation for a 32-bit integer {\tt x} adapted from
546Warren \cite{HackersDelight}, while
547Figure \ref{inductivepopcount} shows the corresponding IDISA
548implementation for a vector of 32-bit fields.  Each implementation employs
549five steps of inductive doubling to produce population counts
550within 32 bit fields.  The traditional implementation employs
551explicit masking and shifting operations, while these
552operations are implicit within the semantics of the inductive
553doubling instructions shown in Figure \ref{inductivepopcount}.
554In each implementation, the first step determines the
555the population counts within 2-bit fields
556by adding the high bit of each such field to the low bit
557to produce a set of 2-bit counts in {\tt c}.
558In the second step, the counts within 4-bit fields of {\tt c} are determined
559by adding the counts of the corresponding high and low 2-bit subfields.
560Continuing in this fashion,
561the final population counts within 32-bit fields are determined in five steps.
562
563With the substitution of longer mask constants replicated for four
56432-bit fields, the implementation of Figure \ref{HD-pop} can be
565directly adapted to SWAR processing using 128-bit registers.
566Each binary operator is replaced by a corresponding binary
567SIMD operation.   Without the IDISA features, a
568straightforward SWAR implementation of population count for
56932-bit fields thus employs 10 operations to load or generate
570mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
571total of 30 operations to complete the task.   Employing
572optimization identified by Warren, this can be reduced to
57320 operations, 5 of which are required to generate mask constants.
574At the cost of register pressure, it is possible that these constants
575could be kept preloaded in long vector processing.
576In any case, the IDISA implementation requires only 5 operations
577to carry out the work requiring 15 to 20 operations with the
578optimized version using standard SWAR operations.   IDISA thus offers
579offers a 3X to 4X improvement in instruction count as well as
580a reduction in register pressure.
581
582The pattern illustrated by population count is typical.  An
583inductive doubling algorithm of $n$ steps typically applies
584two mask or shift operations at each step for each of the
585two operands being combined in the step.   If the combination
586can be implemented in a single SWAR operation, then a total
587of $3n$ operations are the minimum required for the SWAR
588algorithm with IDISA features.   An IDISA implementation
589typically eliminates the explicit mask and shift operations
590through appropriate half-operand modifiers, reducing the
591total instruction count to $n$.   Thus a 3X improvement
592is typical.
593
594\section{Transposition to Parallel Bit Streams}
595
596In this section, we consider the first major
597application of the inductive doubling architecture:
598transposition of byte stream data to parallel bit stream
599form.   Of course, this operation is critical to the
600method of parallel bit streams and all applications
601of the method can benefit from a highly efficient
602transposition process.  Before considering how
603the inductive doubling architecture supports this
604transposition process, however, we first consider
605algorithms on existing architectures.  Two algorithms
606are presented; the best of these requires 72
607SIMD operations in the three-register model to perform
608transposition of eight serial registers of byte stream data into
609eight parallel registers of bit stream data.
610
611We then go on to show how the transposition problem
612can be solved using the inductive doubling architecture
613with a mere 24 three-register SIMD operations.  We also prove
614that this is optimal for any three-register instruction set model.
615
616
617\begin{figure}[tbh]
618\begin{center}
619\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
620\caption{Input/Output Model for Serial to Parallel Transposition}
621\label{s2p-spec}
622\end{center}
623
624\end{figure}
625Figure \ref{s2p-spec} illustrates the input-output requirements of
626the transposition problem.  We assume that inputs and
627outputs are each SIMD registers of size $N=2^K$ bits.
628The input consists of $N$ bytes of serial byte data,
629stored consecutively in eight SIMD registers each holding
630$N/8$ bytes.  The output consists of eight parallel
631registers, one each for the eight individual bit positions
632within a byte.  Upon completion of the transposition process,
633each output register is to hold the $N$ bits corresponding
634to the selected bit position in the sequence of $N$ input
635bytes.
636
637\subsection{Bit Gathering Algorithm}
638
639\begin{figure}[tbh]
640\begin{center}
641\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
642\caption{Serial to Parallel Transposition Using Bit-Gathering}
643\label{gather}
644\end{center}
645\end{figure}
646One straightforward algorithm for implementing the transposition process
647takes advantage of SIMD bit gathering operations that exist
648on some architectures.  This operation gathers one bit per byte
649from a particular position within each byte of a SIMD register.
650For example, the {\tt pmovmskb} operation of the Intel MMX and
651SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
652consisting of the high bit of each byte.  Similarly, the
653{\tt \verb:si_gbb:} operation of the synergistic processing units of the
654Cell Broadband Engine gathers together the low bit of each byte
655from the SIMD register.  Figure \ref{gather} illustrates the
656bit gathering process.
657
658For each bit stream of output, the bit gather algorithm requires
659one gather operation for each of the 8 input registers,
660giving 64 bit gather operations in all.  In addition, for seven
661of the eight bit positions, it is necessary to shift the bits
662of each input register into the conventional position for
663gathering.  A total of 56 shift operations are required for this
664task.  Finally, the result of each bit gather operation must
665be properly inserted into the output stream.  If the first
666gather operation for a stream can be used to also initialize
667the output register, there will then need to be 7 insert
668operations for the results of the remaining gather operations
669for this stream, making 56 insert operations in all.
670The insert step may be more complex than a single operation
671in some cases, but we use one operation per insert as a lower bound.
672Thus, the bit gather algorithm requires
673at least 176 operations to perform the transposition task.
674
675\subsection{BytePack Algorithm}
676
677A much more efficient transposition algorithm on commodity
678SIMD architectures (SSE and Altivec) involves three
679stages of binary division transformation.  This is similar
680to the three stage bit matrix inversion described by
681Warren  \cite{HackersDelight}, although modified to use SIMD operations.
682In each stage, input streams are divided into two half-length output streams.
683The first stage separates the bits at even numbered positions from those
684at odd number positions.  The two output streams from the first
685stage are then further divided in the second stage.
686The stream comprising even numbered bits from the original byte stream
687divides into one stream consisting of bits from positions 0 and 4 of each
688byte in the original stream and a second stream consisting of bits
689from positions 2 and 6 of each original byte.  The stream of bits from
690odd positions is similarly divided into streams for bits from Each of the
691positions 1 and 5 and bits from positions 2 and 6.
692Finally, each of the four streams resulting from the second stage are
693divided into the desired individual bit streams in the third stage.
694
695\begin{figure}[tbh]
696\begin{center}\small
697\begin{verbatim}
698t0 = simd<16>::pack<h,h>(s0, s1);
699t1 = simd<16>::pack<l,l>(s0, s1);
700p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
701p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
702\end{verbatim}
703\end{center}
704\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
705\label{s2pstep}
706\end{figure}
707
708The binary division transformations are accomplished in each stage
709using byte packing, shifting and masking.  In each stage, a
710transposition step combines each pair of serial input registers to
711produce a pair of parallel output registers.  In essence,
712doublebytes from the input registers are packed into bytes
713in the output registers, with the bits from even positions stored
714in the bytes of one output stream and the bits from odd positions
715stored in the bytes of the second output stream.
716Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
717producing two parallel registers \verb:p0: and \verb:p1: from
718two serial registers \verb:s0: and \verb:s1:.  This step is applied
719four times in stage 1; stages 2 and 3 also consist of four applications
720of a similar step with different shift and masking constants.
721
722Although we have used the idealized SIMD notation here, each of the
723operations maps to a single operation in the Altivec set and a small number
724of operations in the SSE set.  Using the Altivec set, there are
7256 operations per step for a total of 24 operations per stage. 
726The three stages combined required 72 operations to transpose 128 bytes
727to parallel bit stream form.  This is the best algorithm known to
728us for existing SIMD architectures.
729
730\subsection{Inductive Halving Algorithm}
731
732Using the inductive doubling architecture, it is possible to design
733a transposition algorithm that is both easier to understand and requires
734many fewer operations than the the bytepack algorithm described above.
735We call it the inductive halving algorithm for serial to parallel
736transposition, because it proceeds by reducing byte streams to
737two sets of nybble streams in a first stage, dividing the nybble
738streams into streams of bitpairs in a second stage and finally
739dividing the bitpair streams into bit streams in the third stage.
740
741
742\begin{figure}[tbh]
743\small
744\begin{verbatim}
745p0 = simd<8>::pack<h,h>(s0, s1);
746p1 = simd<8>::pack<l,l>(s0, s1);
747\end{verbatim}
748\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
749\label{halvingstep}
750\end{figure}
751
752Figure \ref{halvingstep} shows one step in stage 1 of the inductive
753halving algorithm, comprising just two SIMD operations.
754The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
755from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
756the low nybble of each byte.  As in the bytepack algorithm, this step is
757applied 4 times in stage 1, for a total of 8 operations.
758
759Stage 2 of the inductive halving algorithm reduces nybble streams
760to streams of bit pairs.  The basic step in this algorithm consists
761of one \verb#simd<4>::pack<h,h># operation to extract the high pair
762of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
763low pair of each nybble.  Four applications of this step complete stage 2.
764
765Stage 3 similarly uses four applications of a step that uses a
766\verb#simd<2>::pack<h,h># operation to extract the high bit of
767each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
768each pair.  The complete algorithm to transform eight serial
769byte registers s0 through s7 into the eight parallel bit stream
770registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
771
772\begin{figure}[tbh]
773\small
774\begin{verbatim}
775hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
776lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
777hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
778lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
779hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
780lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
781hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
782lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
783hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
784hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
785lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
786ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
787hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
788hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
789lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
790ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
791bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
792bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
793bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
794bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
795bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
796bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
797bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
798bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
799\end{verbatim}
800\caption{Complete Inductive Halving Algorithm}
801\label{halvingalgorithm}
802\end{figure}
803
804\subsection{Optimality of the Inductive Halving Algorithm}
805
806Here we show that the inductive halving algorithm presented in
807the previous subsection is optimal in the following sense:
808no other algorithm on any 3-register SIMD architecture can use
809fewer than 24 operations to transform eight serial registers
810of byte stream data into eight parallel registers of bit stream data.
811By 3-register SIMD architecture, we refer to any architecture
812that uses SIMD instructions consistent with our overall model of
813binary operations using two input register operands to produce
814one output register value.
815
816Observe that the $N$ data bits from each input register must be
817distributed $N/8$ each to the 8 output registers by virtue of
818the problem definition.  Each output register can effectively
819be given a 3-bit address; the partitioning problem can be viewed
820as moving data to the correct address.  However, each
821operation can move results into at most one register. 
822At most this can result in the assignment of one correct address
823bit for each of the $N$ input bits.  As all $8N$ input bits
824need to be moved to a register with a correct 3-bit address,
825a minimum of 24 operations is required.
826
827\subsection{End-to-End Significance}
828
829In a study of several XML technologies applied to
830the problem of GML to SVG transformation, the parabix
831implementation (parallel bit streams for XML) was
832found to the fastest with a cost of approximately
83315 CPU cycles per input byte \cite{Herdy}.  Within parabix,
834transposition to parallel bit stream form requires
835approximately 1.1 cycles per byte \cite{CASCON08}.
836All other things being equal, a 3X speed-up of transposition
837alone would improve end-to-end performance in a
838complete XML processing application by more than 4\%.
839
840
841\section{Parallel to Serial Conversion}
842
843Parallel bit stream applications may apply string editing
844operations in bit space to substitute, delete or insert
845parallel sets of bits at particular positions.  In such cases,
846the inverse transform that converts a set of parallel bit
847streams back into byte space is needed.  In the example of
848UTF-8 to UTF-16 transcoding, the inverse transform is
849actually used twice for each application of the forward
850transform, to separately compute the high and low byte
851streams of each UTF-16 code unit.  Those two byte streams
852are subsequentely merged to form the final result.
853
854Algorithms for performing the inverse transform mirror those
855of the forward transform, employing SIMD merge operations
856in place of pack operations.  The best algorithm known
857to us on the commodity SIMD architectures takes advantage
858of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
859operations that are available with each of the SSE and Altivec instruction
860sets.  These algorithms take 72 operations to perform the
861inverse transposition of 8 parallel registers of bit stream
862data into 8 serial registers of byte stream data. 
863
864\begin{figure}[tbh]
865\begin{center}\small
866\begin{verbatim}
867bit01_r0 = simd<1>::mergeh(bit0, bit1);
868bit01_r1 = simd<1>::mergel(bit0, bit1);
869bit23_r0 = simd<1>::mergeh(bit2, bit3);
870bit23_r1 = simd<1>::mergel(bit2, bit3);
871bit45_r0 = simd<1>::mergeh(bit4, bit5);
872bit45_r1 = simd<1>::mergel(bit4, bit5);
873bit67_r0 = simd<1>::mergeh(bit6, bit7);
874bit67_r1 = simd<1>::mergel(bit6, bit7);
875bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
876bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
877bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
878bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
879bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
880bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
881bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
882bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
883s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
884s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
885s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
886s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
887s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
888s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
889s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
890s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
891\end{verbatim}
892\end{center}
893\label{p2s-inductive}
894\caption{Parallel to Serial Transposition by Inductive Doubling}
895\end{figure}
896
897An algorithm employing only 24 operations using the
898inductive doubling instruction set architecture is relatively
899straightforward.. In stage 1, parallel registers for individual bit streams
900are first merged with bit-level interleaving
901using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
902operations.  For each of the four pairs of consecutive
903even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
904two consecutive registers of bitpair data are produced.
905In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
906are then applied to merge to bitpair streams to produce streams
907of nybbles for the high and low nybble of each byte.  Finally,
908the nybble streams are merged in stage 3 to produce the
909desired byte stream data.  The full inductive doubling
910algorithm for parallel to serial transposition is shown in Figure
911\ref{p2s-inductive}.
912
913This algorithm is again optimal, requiring the fewest operations
914of any possible algorithm using any 3-register instruction set
915model.  The proof directly follows that for the serial to parallel
916transposition.
917
918The existence of high-performance algorithms for transformation of
919character data between byte stream and parallel bit stream form
920in both directions makes it possible to consider applying these
921transformations multiple times during text processing applications.
922Just as the time domain and frequency domain each have their
923use in signal processing applications, the byte stream form and
924parallel bit stream form can then each be used at will in
925character stream applications.
926
927
928
929
930
931\section{Parallel Bit Deletion}
932
933\begin{figure*}[tbh]
934\begin{center}
935\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
936\hline
937\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
938\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
939\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
940\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
941\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
942\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
943\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
944\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
945\end{tabular}
946\end{center}
947\label{centraldelstep}
948\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
949\end{figure*}
950
951Parallel bit deletion is an important operation that allows string
952editing operations to be carried out while in parallel bit stream
953form.  It is also fundamental to UTF-8 to UTF-16 transcoding
954using parallel bit streams, allowing the excess code unit
955positions for UTF-8 two-, three- and four-byte sequences to
956be deleted once the sixteen parallel bit streams of UTF-16 have
957been computed \cite{PPoPP08}.
958
959Parallel bit deletion is specified using a deletion mask.
960A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
961to be deleted and 0s at positions identifying bits to be retained.
962For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
963bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
964accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
965
966Bit deletion may be performed using
967the parallel-prefix compress algorithm documented by
968Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
969only logic and shifts with a constant parameter to carry
970out the deletion process.  However, it requires $k^2$
971preprocessing steps for a final field width parameter
972of size $2^k$, as well as 4 operations per deletion step
973per stream.  Using the inductive doubling instruction set architecture
974it is possible to carry out bit deletion much more efficiently.
975
976Deletion within fixed size fields or registers may produce results that are either
977left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
978eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
979care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
980left-justified result produces an important intermediate form known as a
981{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
982right-justified and left-justified results from the application of the
9834-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
984stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
985the central result \verb:xxbdefhx:, which may easily be converted to a either a
986left or a right justified 8-element result by an appropriate shift operation.
987
988
989The observation about how two $n$-bit central deletion results can
990combine to yield a $2n$ central deletion result provides the basis
991for an inductive doubling algorithm.  Figure \ref{centraldelstep}
992illustrates the inductive process for the transition from 8-bit central
993deletion results to 16-bit central deletion results.  The top row shows
994the original deletion mask, while the second row shows the original
995bit stream to which deletions are to be applied, with deleted bits zeroed out.
996The third row shows the central result for each 8-bit field as the
997result of the previous inductive step.
998
999To perform the $8 \rightarrow 16$ central deletion step, we first form
1000the population counts of 4-bit fields of the original deletion mask as
1001shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
1002an 8-bit central result, we perform a right shift by the population count
1003of the low half of the field.  Similarly,
1004left-justification requires a left-shift by the population count in the
1005high half of the field.
1006
1007The left and right shifts can be performed simultaneously using a rotate
1008left instruction.  Right justification by shifting an $n$ bit field
1009$i$ positions to the right is equivalent to a left rotate of $n-i$
1010positions.  These rotation amounts are computed by the operation
1011\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
1012except that don't care fields (which won't be subsequently used)
1013are marked \verb:XX:. 
1014
1015The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
1016as shown in row 6, and are combined with the right shift amounts
1017by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
1018as shown in row 7.  Using these computed values, the inductive step
1019is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
1020as shown in row 8.
1021
1022At each inductive doubling level, it requires 4 operations to compute the
1023required deletion infomation and one operation per bit stream to perform deletion.
1024Note that, if deletion is to be applied to a set of eight parallel bit streams,
1025the computed deletion information is used for each stream without recomputation,
1026thus requiring 12 operations per inductive level.
1027
1028In comparison to the parallel-prefix compress method, the method of central
1029deletion results using the inductive doubling architecture has far fewer operations.
1030The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
1031induction versus $4k^2$ for the parallel-prefix method.  Using the computed
1032deletion operation, only a single SIMD rotate operation per bit stream
1033per level is needed, in comparison with 4 operations per level for parallel-prefix
1034compress.
1035
1036
1037
1038\section{Beyond Parallel Bit Streams}
1039
1040It can be argued that text processing with parallel bit
1041streams by itself is not sufficiently important to justify
1042the circuit complexity of IDISA.  In this section, then
1043we move on to consider further applications that may
1044benefit from IDISA features.  These include additional
1045basic inductive doubling
1046algorithms such as parity, and least power-of-2 ceiling.
1047Further applications for narrow field widths are discussed
1048as well, such as packed DNA representations using 2-bit
1049field widths and packed decimal representations using 4-bit
1050field widths. Most significantly, however, we show that IDISA offers
1051a fully general solution to the problem of supporting
1052{\em horizontal} SWAR operations.
1053
1054\subsection{Inductive Doubling with Logic Operations}
1055
1056\begin{figure}
1057\begin{center}\small
1058\begin{verbatim}
1059y = x ^ (x >> 1);
1060y = y ^ (y >> 2);
1061y = y ^ (y >> 4);
1062y = y ^ (y >> 8);
1063y = y ^ (y >>16);
1064y = y & 1;
1065\end{verbatim}
1066\end{center}
1067\caption{Parity Reference Algorithm}
1068\label{HD-parity}
1069\end{figure}
1070
1071\begin{figure}
1072\begin{center}\small
1073\begin{verbatim}
1074x = x - 1;
1075x = x | (x >> 1);
1076x = x | (x >> 2);
1077x = x | (x >> 4);
1078x = x | (x >> 8);
1079x = x | (x >>16);
1080x = x + 1;
1081\end{verbatim}
1082\end{center}
1083\caption{Power-of-2 Ceiling Reference Algorithm}
1084\label{HD-clp2}
1085\end{figure}
1086
1087\begin{figure}
1088\begin{center}\small
1089\begin{verbatim}
1090y = simd<2>::xor<h,l>(x, x);
1091y = simd<4>::xor<h,l>(y, y);
1092y = simd<8>::xor<h,l>(y, y);
1093y = simd<16>::xor<h,l>(y, y);
1094y = simd<32>::xor<h,l>(y, y);
1095\end{verbatim}
1096\end{center}
1097\caption{IDISA Parity Implementation}
1098\label{ID-parity}
1099\end{figure}
1100
1101\begin{figure}
1102\begin{center}\small
1103\begin{verbatim}
1104x = simd<32>::sub(x, simd<32>::const(1));
1105x = simd<2>::or<x, h>(x, x);
1106x = simd<4>::or<x, h>(x, x);
1107x = simd<8>::or<x, h>(x, x);
1108x = simd<16>::or<x, h>(x, x);
1109x = simd<32>::or<x, h>(x, x);
1110x = simd<32>::add(x, simd<32>::const(1));
1111\end{verbatim}
1112\end{center}
1113\caption{IDISA Power-of-2 Ceiling Implementation}
1114\label{ID-clp2}
1115\end{figure}
1116
1117Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1118code for two further inductive doubling examples using
1119logical operators as the combining feature.  In the
1120first of these, the ``exclusive or'' operation is applied
1121to all bits in a 32-bit value to determine parity.
1122Parity has important applications for error-correcting
1123codes such as the various Hamming codes for detecting
1124and correcting numbers of bit errors dependent on the
1125number of parity bits added.  Warren's version uses
112611 operations for parity of a single 32-bit value;
1127a straightforward SWAR adaptation also uses 11 operations
1128for the parity of a set of 32-bit fields.
1129
1130Figure \ref{ID-parity} shows the IDISA parity implementation
1131with only 5 operations required.  This represents slightly
1132more than a 2X improvement.  The improvement is less than
11333X seen in other cases because one of the operands need
1134not be modified before applying the exclusive-or operation.
1135
1136Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1137code for the least power-of-2 ceiling of a 32-bit value.  The
1138technique is to employ inductive doubling to fill in one bits at
1139all bit positions after the first one bit in the original value to
1140first produce a value of the form $2^n-1$.  This is then incremented
1141to determine the power of 2 ceiling.  This function and
1142its straightforward adaptation for SWAR application on a
1143set of 32-bit fields requires 12 operations.   The
1144corresponding IDISA implementation of Figure \ref{ID-clp2}
1145requires 7 operations, just under a 2X improvement.
1146
1147\subsection{Packed DNA Representation}
1148
1149DNA sequences are often represented as strings consisting
1150of the four nucleotide codes A, C, G and T.  Internally,
1151these sequences are frequently represented in internal
1152form as packed sequences of 2-bit values.  The IDISA
1153\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1154allow these packed representations to be quickly computed
1155from byte-oriented string values by two steps of inductive
1156halving.   Similarly, conversion back to string form
1157can use two steps of inductive merging.  Without direct
1158support for these pack and merge operations, the SWAR
1159implementations of these conversions require the cost
1160of explicit masking and shifting in combination with
1161the 16-bit to 8-bit packing and 8-bit to 16-bit
1162merging operations supported by existing SWAR facilities
1163on commodity processors.
1164
1165\subsection{Bit Reverse}
1166
1167Include or omit?
1168
1169\subsection{String/Decimal/Integer Conversion}
1170\begin{figure}
1171\begin{center}\small
1172\begin{verbatim}
1173b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1174b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1175b = (d & 0x0000FFFF) + 10000 * (d >> 16);
1176\end{verbatim}
1177\end{center}
1178\caption{BCD to Integer Reference Algorithm}
1179\label{BCD2int}
1180\end{figure}
1181
1182\begin{figure}
1183\begin{center}\small
1184\begin{verbatim}
1185c10 = simd<16>:const(10);
1186c100 = simd<16>:const(100);
1187c10000 = simd<32>:const(10000);
1188b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d);
1189b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b);
1190b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b);
1191\end{verbatim}
1192\end{center}
1193\caption{IDISA Parity Implementation}
1194\label{ID-BCD2int}
1195\end{figure}
1196
1197Just as DNA sequences represent an important use case for
1198SWAR operations on 2-bit fields, packed sequences of
1199decimal or hexadecimal digits represent a common use case
1200for 4-bit fields.  These representations can be used
1201both as an intermediate form in numeric string to integer
1202conversion and as a direct representation for
1203packed binary coded decimal.
1204
1205Figure \ref{BCD2int} shows a three-step inductive
1206doubling implementation for conversion of 32-bit packed BCD
1207values to integer form.  The 32-bit value consists
1208of 8 4-bit decimal digits.  Pairs of digits are
1209first combined by multiplying the higher digit
1210of the pair by 10 and adding.  Pairs of these
1211two-digit results are then further combined by
1212multiplying the value of the higher of the two-digit
1213results by 100 and adding.  The final step is
1214to combine four-digit results by multiplying the
1215higher one by 10000 and adding.  Overall, 20
1216operations are required for this implementation
1217as well as the corresponding SWAR implementation
1218for sets of 32-bit fields.  Preloading of 6 constants
1219into registers for repeated use can reduce the number of
1220operations to 14 at the cost of significant register
1221pressure.
1222
1223The IDISA implementation of this algorithm is shown
1224in Figure \ref{ID-BCD2int}.  This implementation
1225shows an interesting variation in the use of
1226half-operand modifiers, with only one operand
1227of each of the addition and multiplication operations
1228modified at each level.  Overall, this implementation
1229requires 9 operations, or 6 operations with 3
1230preloaded constants.  This represents more than a 2X
1231reduction in instruction count as well as a 2X reduction
1232in register pressure.
1233
1234
1235
1236
1237\subsection{Systematic Support for Horizontal SIMD Operations}
1238
1239In SIMD parlance, {\em horizontal} operations are
1240operations which combine values from two or more fields
1241of the same register, in contrast to the normal
1242{\em vertical} operations which combine corresponding
1243fields of different registers.  Horizontal operations
1244can be found that combine two (e.g., \verb:haddpd: on SSE3),
1245four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1246or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1247horizontal operations have a vertical component as well.
1248For example, \verb:psadbw: first forms the absolute value of
1249the difference of eight corresponding byte fields before
1250performing horizontal add of the eight values, while
1251\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1252four unsigned 8-bit fields within one register
1253and then combines the result horizontally with
1254corresponding 32-bit fields of a second registers.
1255
1256The space of potential horizontal operations thus has
1257many dimensions, including not only the particular
1258combining operation and the operand field width, but
1259also the number of fields being combined, whether a
1260vertical combination is applied and whether it is applied
1261before or after the horizontal operation and what the
1262nature of the vertical combining operation is.
1263Within this space, commodity SIMD architectures tend
1264to support only a very few combinations, without any
1265particular attempt at systematic support for horizontal
1266operations in general.
1267
1268In contrast to this {\em ad hoc} support on commodity
1269processors, IDISA offers a completely systematic treatment
1270of horizontal operations without any special features beyond
1271the inductive doubling features already described.
1272In the simplest case, any vertical operation
1273\verb#simd<n>::F# on $n$-bit fields gives rise to
1274an immediate horizontal operation
1275\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1276pairs of $n/2$ bit fields.
1277For example, \verb#simd<16>::add<h,l># adds values
1278in adjacent 8 bit fields to produce 16 bit results,
1279while \verb#simd<32>::min<h,l># can produce the
1280minimum value of adjacent 16-bit fields.
1281Thus any binary horizontal operation can be implemented
1282in a single IDISA instruction making use of the \verb:<h,l>:
1283operand modifier combination.
1284
1285Horizontal combinations of four adjacent fields can also be
1286realized in a general way through two steps of inductive
1287doubling.  For example, consider the or-across operation \verb:si_orx:
1288of the SPU, that performs a logical or operation
1289on four 32-bit fields.  This four field combination
1290can easily be implemented with the following two operations.
1291%\begin{singlespace}
1292\begin{verbatim}
1293t = simd<64>::or<h,l>(x, x)
1294t = simd<128>::or<h,l>(t, t)
1295\end{verbatim}
1296%\end{singlespace}
1297
1298In general, systematic support for horizontal
1299combinations of sets of $2^h$ adjacent fields may
1300be realized through $h$ inductive double steps
1301in a similar fashion.
1302Thus, IDISA esssentially offers systematic support
1303for horizontal operations entirely through the
1304use of \verb:<h,l>: half-operand modifier
1305combinations.
1306
1307Systematic support for general horizontal operations
1308under IDISA also creates opportunity for a design tradeoff:
1309offsetting the circuit complexity of half-operand
1310modifiers with potential elimination of dedicated
1311logic for some {/ad hoc} horizontal SIMD operations.
1312Even if legacy support for these operations is required,
1313it may be possible to provide that support through
1314software or firmware rather than a full hardware
1315implementation.  Evaluation of these possibilities
1316in the context of particular architectures is a potential
1317area for further work.
1318
1319
1320\section{Implementation}
1321
1322We have carried implementation work for IDISA in three
1323ways.  First, we have constructed libraries that
1324implement the IDISA instructions by template and/or macro
1325expansion for each of MMX, SSE, Altivec, and SPU platforms.
1326Second, we have developed a model implementation
1327involving a modified operand fetch component
1328of a pipelined SIMD processor.  Third, we have written
1329and evaluated Verilog HDL description of this model
1330implementation.
1331
1332\subsection{IDISA Libraries}
1333
1334Implementation of IDISA instructions using template
1335and macro libraries has been useful in developing
1336and assessing the correctness of many of the algorithms
1337presented here.  Although these implementations do not
1338deliver the performance benefits associated with
1339direct hardware implementation of IDISA, they
1340have been quite useful in providing a practical means
1341for portable implementation of parallel bit stream
1342algorithms on multiple SWAR architectures.  However,
1343one additional facility has also proven necessary for
1344portability of parallel bit stream algorithms across
1345big-endian and little-endian architectures: the
1346notion of shift-forward and shift-back operations.
1347In essence, shift forward means shift to the left
1348on little-endian systems and shift to the right on
1349big-endian systems, while shift back has the reverse
1350interpretation.  Although this concept is unrelated to
1351inductive doubling, its inclusion with the IDISA
1352libraries has provided a suitable basis for portable
1353SIMD implementations of parallel bit stream algorithms.
1354Beyond this, the IDISA libraries have the additional
1355benefit of allowing the implementation
1356of inductive doubling algorithms at a higher level
1357abstraction, without need for programmer coding of
1358the underlying shift and mask operations.
1359
1360\subsection{IDISA Model}
1361\begin{figure}[tbh]
1362\begin{center}
1363\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1364\caption{IDISA Block Diagram}
1365\label{pipeline-model}
1366\end{center}
1367\end{figure}
1368
1369Figure \ref{pipeline-model} shows a block diagram
1370for a pipelined SIMD processor implementing IDISA.
1371The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1372registers each of width $N = 2^K$ bits. 
1373IDISA instructions identified by the Instruction Fetch
1374Unit (IFU) are forwarded for decoding to the SIMD
1375Instruction Decode Unit (SIDU).  This unit decodes
1376the instruction to produce
1377signals identifying the source and destination
1378operand registers, the half-operand modifiers, the
1379field width specification and the SIMD operation
1380to be applied.
1381
1382The SIDU supplies the source register information and the half-operand
1383modifier information to the SIMD Operand Fetch Unit (SOFU).
1384For each source operand, the SIDU provides an $A$-bit register
1385address and two 1-bit signals $h$ and $l$ indicating the value
1386of the decoded half-operand modifiers for this operand.
1387Only one of these values may be 1; both are 0 if
1388no modifier is specified.
1389In addition, the SIDU supplies decoded field width information
1390to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
1391The SIDU also supplies decoded SIMD opcode information to SIEU and
1392a decoded $A$-bit register address for the destination register to
1393the SIMD Result Write Back Unit (SRWBU).
1394
1395The SOFU is the key component of the IDISA model that
1396differs from that found in a traditional SWAR
1397processor.  For each of the two $A$-bit source
1398register addresses, SOFU is first responsible for
1399fetching the raw operand values from the SRF.
1400Then, before supplying operand values to the
1401SIEU, the SOFU applies the half-operand modification
1402logic as specified by the $h$, $l$, and field-width
1403signals.  The possibly modified operand values are then
1404provided to the SIEU for carrying out the SIMD operations.
1405A detailed model of SOFU logic is described in the following
1406subsection.
1407
1408The SIEU differs from similar execution units in
1409current commodity processors primarily by providing
1410SIMD operations at each field width
1411$n=2^k$ for $0 \leq k \leq K$.  This involves
1412additional circuitry for field widths not supported
1413in existing processors.  For inductive doubling
1414algorithms in support of parallel bit streams,
1415the principal need is for additional circuitry to
1416support 2-bit and 4-bit field widths.  This circuity
1417is generally less complicated than that for larger
1418fields.  Support for circuitry at these width
1419has other applications as well.   For example,
1420DNA sequences are frequently represented using
1421packed sequences of 2-bit codes for the four possible
1422nucleotides\cite{}, while the need for accurate financial
1423calculation has seen a resurgence of the 4-bit
1424packed BCD format for decimal floating point \cite{}.
1425
1426When execution of the SWAR instruction is
1427completed, the result value is then provided
1428to the SRWBU to update the value stored in the
1429SRF at the address specified by the $A$-bit
1430destination operand.
1431
1432\subsection{Operand Fetch Unit Logic}
1433
1434Discussion of gate-level implementation.
1435
1436
1437\section{Conclusions}
1438
1439In considering the architectural support for
1440SIMD text processing using the method of parallel bit streams,
1441this paper has presented the principle of inductive doubling
1442and a realization of that principle in the form of
1443IDISA, a modified SWAR instruction set architecture.
1444IDISA features support for SWAR operations at all power-of-2
1445field widths, including 2-bit and 4-bit widths, in particular,
1446as well as half-operand modifiers and pack and merge operations
1447to support efficient transition between successive power-of-two
1448field widths.  The principle innovation is the notion of
1449half-operand modifiers that eliminate the cost associated
1450with the explicit mask and shift operations required for
1451such transitions.
1452
1453Several algorithms key to parallel bit stream methods
1454have been examined and shown to benefit from dramatic
1455reductions in instruction count compared to the best
1456known algorithms on existing architectures.  In the case
1457of transposition algorithms to and from parallel bit stream
1458form, the architecture has been shown to make possible
1459straightforward inductive doubling algorithms with the
1460lowest total number of operations that can be achieved by
1461any possible 3-register SWAR architecture.
1462
1463Applications of IDISA in other areas have also been
1464examined.  The support for 2-bit and 4-bit field widths
1465in SWAR processing is beneficial for packed DNA representations
1466and packed decimal representations respectively.  Additional
1467inductive doubling examples are presented and the phenomenon
1468of power-of-2 transitions discussed more broadly.
1469Most significantly, IDISA supports a fully general approach
1470to horizontal SWAR operations, offering a considerable
1471improvement over the {\em ad hoc} sets of special-purpose
1472horizontal operations found in existing SWAR instruction sets.
1473
1474An IDISA implementation model has been presented employing
1475a customized operand fetch unit to implement the half-operand
1476modifier logic.  Gate-level implementation of this unit
1477has been analyzed and showed to be quite reasonable.
1478
1479Many special-purpose operations that are found in existing
1480processors could instead be implemented through efficient
1481IDISA sequences.  These include examples such as population
1482count, count leading and/or trailing zeroes and parity.
1483They also include specialized horizontal SIMD operations.
1484Thus, design tradeoffs can be made with the potential of
1485reducing the chip area devoted to special purpose instructions
1486in favor of more general IDISA features.
1487
1488Other tradeoffs may be possible in IDISA implementation itself.
1489Full support of IDISA features to the largest field widths
1490are not necessary in many cases.   For example, a given 128-bit
1491SIMD facility may support IDISA features only up to 32-bit
1492or 64-bit fields, similar to the current Altivec and SSE
1493architectures, respectively.   It may also be possible to
1494reduce the complexity of operand fetch circuitry by a factor
1495of two by dedicating one operand to a possible high half-operand
1496modifier and the other to a possible low half-operand modifier.
1497
1498Future research may consider the extension of inductive doubling
1499support in further ways.   For example, it may be possible
1500to develop a pipelined architecture supporting two or three
1501steps of inductive doubling in a single operation.
1502
1503%\appendix
1504%\section{Appendix Title}
1505%
1506%This is the text of the appendix, if you need one.
1507
1508\acks
1509
1510This research was supported by a Discovery Grant from the
1511Natural Sciences and Engineering Research Council of Canada.
1512
1513%\bibliographystyle{plainnat}
1514\bibliographystyle{plain}
1515\bibliography{asplos094-cameron}
1516%\begin{thebibliography}{}
1517
1518%\bibitem{smith02}
1519%Smith, P. Q. reference text
1520
1521%\end{thebibliography}
1522
1523
1524\end{document}
Note: See TracBrowser for help on using the repository browser.