source: docs/ASPLOS09/asplos094-cameron.tex @ 240

Last change on this file since 240 was 240, checked in by cameron, 11 years ago

Notation fixes.

File size: 69.9 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152The remainder of this paper is organized as follows.
153
154The second section of this paper introduces IDISA and the
155SIMD notation used throughout this paper.  A brief comparison of
156IDISA features with existing SIMD instruction
157sets of commodity processors such as the Intel SSE
158instruction set and the Power PC Altivec instruction set
159is also made.
160
161The third section then discusses the evaluation methodology
162for IDISA.  Two reference architectures are briefly described as
163well as criteria for evaluation IDISA against these architectures.
164
165The fourth section provides a short first example of
166the inductive doubling principle in action through
167the case of population count.   Although this operation
168is not a strong determinant of performance for parallel bit
169stream applications, it is nevertheless an operation needed
170frequently enough in the general computing milieux to find
171its way into some instruction set architectures, typically
172at one particular field width. 
173%By way of comparison, the
174%inductive doubling architecture sacrifices some
175%performance at the chosen field width, while offering a more
176%general solution with frequently better performance at
177%other field widths.
178
179The fifth section then moves on to consider the performance-critical
180and key task of conversion between serial byte streams and parallel
181bit streams.  A first algorithm that uses the existing SIMD
182operations common to SSE and Altivec is shown, requiring 72
183operations to transform 128 bytes of data using the three-register
184instruction form.  We then move on to consider how the task may
185be simplified using IDISA to
186require a mere 24 operations.  As well as providing a 3X speed-up,
187it is also argued that the version using the inductive doubling
188architecture is considerably simpler and easier to program.
189
190The sixth section then briefly considers the inverse transposition
191process, converting parallel bit stream data back into byte
192streams.  Again, an algorithm to carry out this task requires
19372 straight-line SIMD operations in the Altivec three-register
194form, but is reduced to a simpler 24 operations using IDISA.
195
196The seventh section then goes on to consider the problem of
197parallel bit deletion.  This operation is performance-critical
198to any applications that require filtering or
199editing operations on strings using the parallel bit stream
200algorithms.   For example, it is fundamental to the
201high-speed UTF-8 to UTF-16 transcoding algorithm that is
202often a critical component in XML parsing.  In this
203section, an inductive doubling algorithm based on the
204concept of a central deletion result is described and
205shown to have much better performance than a parallel-prefix
206alternative.
207
208The eighth section then considers the potential role of
209IDISA in supporting applications beyond parallel bit streams.
210Additional examples from several domains are presented.
211Perhaps most importantly, however, the role of IDISA
212in supporting a fully systematic treatment of {\em horizontal}
213SWAR operations for any application area is discussed.
214
215An implementation model for IDISA is then considered
216in section 9 of the paper, focusing on a pipelined
217SIMD architecture featuring a modified operand fetch stage.
218A gate-count analysis of one feasible implementation is
219provided as well as a discussion of the implementation
220of required extensions to handle 2-bit and 4-bit fields not
221commonly supported on existing SIMD architectures.   Design
222tradeoffs are also considered focusing the potential removal of
223various {\em ad hoc} instructions on existing processors in favor of
224more general alternatives provided through IDISA.
225
226The tenth section concludes the paper with a summary of results
227and discussion of areas for future work.
228
229
230\section{Inductive Doubling Architecture}
231
232This section presents an idealized model for a single-instruction
233multiple-data (SIMD) instruction set architecture designed
234specifically to support inductive doubling algorithms in the
235domain of parallel bit stream programming.   The architecture is idealized
236in the sense that we concentrate on only the necessary features
237for our purpose, without enumerating the additional
238operations that would be required for
239SIMD applications in other domains.  The goal is to focus
240on the principles of inductive doubling support in a way
241that can accommodate a variety of realizations as other
242design constraints are brought to bear on the overall instruction set
243design.  First we introduce a simple model and notation for
244SIMD operations in general and then present the four key
245features of an idealized architecture in support of parallel
246bit stream programming.
247
248The idealized architecture supports typical SIMD integer
249operations common to existing commodity architectures such as SSE
250and Altivec.   The architecture is defined to support SIMD
251operations on registers of size $N=2^K$ bits, for some integer $K$.
252Typical values of $K$ for commodity processors include $K=6$ for
253the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
254the 128-bit registers of SSE and Altivec technology and $K=8$ for
255the upcoming Intel AVX technology.   The registers may be
256partitioned into $N/n$ fields of width $n=2^k$ bits for some values
257of $k \leq K$.   Typical values of $k$ used on commodity processors
258include $k = 3$ for SIMD operations on 8-bit fields (bytes),
259$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
260on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
261fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
262Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
263register $r$, using big-endian numbering.
264
265Let \verb:simd<n>: represent the class of SIMD operations defined
266on fields of size $n$ using C++ template syntax.  Given a
267binary function $F_n$ on $n$-bit fields, we denote the SIMD
268version of this operation as \verb#simd<n>::F#.  Given two
269SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
270respectively, the operation \verb#r=simd<n>::F(a,b)# stores
271the value $r$ in the register \verb:r: as determined by
272the simultaneous calculation of individual field values in
273accord with the following equation.
274\[ r_i = F_n(a_i, b_i) \]
275
276For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
277logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
278integers as follows.
279%\singlespace
280\begin{eqnarray}
281\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
282\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
283\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
284\end{eqnarray}
285%\doublespace
286The SSE and Altivec instruction sets support
287each of the addition and subtraction operations in SIMD form
288for 8, 16 and 32-bit fields, while the SSE set also includes
289the 64-bit forms.  For example, \verb#simd<8>::add# in our
290nomenclature is provided by the operation \verb:paddb:
291on SSE and the operation \verb:vaddubm: on Altivec.
292The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
293\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
294more constrained, requiring that all field shifts by the same amount.
295
296Given these definitions and notation, we now present
297the four key elements of an inductive doubling architecture.
298The first is a definition of a core set of binary functions
299on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
300The second is a set of {\em half-operand modifiers} that allow
301the inductive processing of fields of size $2n$ in terms of
302combinations of $n$-bit values selected from the fields.
303The third is the definition of packing operations that compress
304two consecutive registers of $n$-bit values into a single
305register of $n/2$-bit values.  The fourth is the definition
306of merging operations that produce a set of $2n$ bit fields
307by concatenating corresponding $n$-bit fields from two
308parallel registers.  Each of these features is described below.
309
310For the purpose of direct and efficient support for inductive
311doubling algorithms on bit streams, the provision of
312a core set of operations at field widths of 2 and 4 as
313well as the more traditional field witdhs of 8, 16 and 32
314is critical for elegant and efficient expression of the
315algorithms.  In essence, inductive doubling algorithms
316work by establishing some base property at either single
317or 2-bit fields.  Each iteration of the algorithm then
318goes on to establish the property for the power-of-2
319field width.  In order for this inductive step to be
320most conveniently and efficiently expressed, the
321core operations needed for the step should be available
322at each field width.  In the case of work with parallel
323bit streams, the operations \verb:add:, \verb:sub:,
324\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
325(rotate left) comprise the core.  In other domains,
326additional operations may be usefully included in the
327core depending on the work that needs to be performed
328at each inductive doubling level.
329
330Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
331also includes fields of width 1.  These are included for
332logical consistency, but are easily implemented by mapping
333directly to appropriate bitwise logic operations, which
334we assume are also available.  For example,
335\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
336bitwise exclusive-or operation on SIMD registers.
337
338The second key facility of the inductive doubling architecture
339is the potential application of half-operand modifiers to
340the fields of either or both of the operands of a SIMD
341operation.  These modifiers select either the
342low $n/2$
343bits of each $n$-bit field (modifier ``\verb:l:'') or the
344high $n/2$ bits (modifier ``\verb:h:'').  When required,
345the modifier ``\verb:x:'' means that the full $n$ bits
346should be used, unmodified.  The semantics of these
347modifiers are given by the following equations.
348%\singlespace
349\begin{eqnarray}
350l(r_n) & = & r_n \bmod 2^{n/2} \\
351h(r_n) & = & r_n / 2^{n/2} \\
352x(r_n) & = & r_n
353\end{eqnarray}
354%\doublespace
355In our notation, the half-operand modifiers are
356specified as optional template (compile-time) parameters
357for each of the binary functions.  Thus,
358\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
359the 2-bit quantity found in the low 2-bits of each 4-bit field
360of its first operand (\verb:a:)
361together with the corresponding 2-bit quantity found in the
362high 2-bits of its second operand (\verb:b:).
363In general, the purpose of the half-operand modifiers
364in support of inductive doubling is to allow the processing
365of $n$-bit fields to easily expressed in terms of
366combination of the results determined by processing
367$n/2$ bit fields.
368
369The third facility of the inductive doubling architecture
370is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
371The field values of \verb#r=simd<n>::pack(a,b)# are
372defined by the following equations.
373%\singlespace
374\begin{eqnarray}
375r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
376r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
377\end{eqnarray}
378%\doublespace
379Here conv is a function which performs conversion of an $n$-bit
380value to an $n/2$ bit value by signed saturation (although
381conversion by unsigned saturation would also suit our purpose).
382
383Half-operand modifiers may also be used with the pack
384operations.  Thus packing with conversion by masking off all
385but the low $n/2$ bits of each field may be
386be performed using the operation \verb#simd<n>::pack<l,l>#
387
388
389The final facility of the inductive doubling architecture is
390a set of merging operations \verb#simd<n>::mergeh# and
391\verb#simd<n>::mergel# that produce $2n$-bit fields
392by concatenating corresponding $n$-bit fields from the
393operand registers.   The respective
394operations \verb#r=simd<n>::mergeh(a,b)# and
395\verb#s=simd<n>::mergel(a,b)#
396are defined by the following equations.
397%\singlespace
398\begin{eqnarray}
399r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
400s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
401\end{eqnarray}
402%\doublespace
403Both SSE and Altivec provide versions of pack and merge operations
404for certain field widths.  The pack operations are provided
405with operands having 16-bit or 32-bit fields on each platform, although
406with some variation in how conversion is carried out.
407The merge operations are provided at 8-bit, 16-bit and 32-bit
408field widths on both architectures and also at the 64-bit level
409on SSE.
410
411This completes the description of IDISA.  As described, many of the
412features are already available with the SIMD facilities of
413existing commodity processors.  The extensions enumerated
414here are relatively straightforward.  The innovation
415is to specifically tackle the design of facilities to
416offer systematic support for transitions between power-of-2 field widths.
417As we shall show in the remainder of this paper, these facilities
418can dramatically reduce instruction count in core parallel bit
419stream algorithms, with a factor of 3 reduction being typical.
420
421\section{Evaluation Methodology}
422
423IDISA represents a set of instruction set features that
424could potentially be added to any SWAR processor.  The goal
425in this paper is to evaluate these features independent
426of any artifacts that may be due to any particular realization,
427while still considering realistic models based on existing
428commodity instruction set architectures.  For the purpose
429of IDISA evaluation, then, we define two reference architectures.
430For concreteness, IDISA and the two reference architectures will
431each be considered as 128-bit processors employing the
432three-register SWAR model defined in the previous
433section.
434
435Reference architecture A (RefA) consists of a limited register
436processor providing a set of core binary operations defined
437for 8, 16, 32 and 64 bit fields.  The core binary operations
438will be assumed to be those defined by the SSE instruction
439set for 16-bit fields.  In addition, we assume that
440shift immediate operations for each field width exist,
441e.g., \verb#simd<8>::srli<1>(x)# for a right logical
442shift of each 8-bit field by 1.  We also assume that
443a constant load operation \verb#simd::const<n>(c)#
444loads the constant value $c$ into each $n$ bit field.
445
446Reference architecture B (RefB) consists of a register-rich
447processor incorporating all the operations of reference
448architecture A as well as the following additional facilities
449inspired by the Altivec instruction set.
450For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
451logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
452of $a$ by the rotation count in the corresponding field of $b$.
453A three-register \verb#simd<1>::if(a,b,c)# bitwise logical
454operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned
455after the Altivec \verb:vec_sel: operation.  Finally,
456a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
457permutation of bytes from the concatenation of $a$ and $b$ based
458on the set of indices in $c$.
459
460Two versions of IDISA are assessed against these reference
461architectures as follows.  IDISA-A has all the facilities
462of RefA extended with half-operand modifiers and all core
463operations at field widths of 2, 4 and 128.  IDISA-B is
464similarly defined and extended based on RefB.  Algorithms
465for both RefA and IDISA-A are assessed assuming that
466any required constants must be loaded as needed; this
467reflects the limited register assumption.  On the other,
468assessment for both RefB and IDISA-B will make the
469assumption that sufficiently many registers exist that
470constants can be kept preloaded.
471
472In each case, the processors are assumed to be well-designed
473pipeline processors with a throughput of one SWAR instruction
474each processor cycle for straight-line code free of memory
475access.  Such processors are certainly feasible with a
476suitable investment in circuitry, although practical designs
477may make sacrifices to achieve close to optimal throughput
478with less circuitry.  However, the assumption makes for
479straightforward performance evaluation based on instruction
480count for straight-line computational kernels.  Furthermore,
481the assumption also eliminates artifacts due to possibly different
482latencies in reference and IDISA architectures.  Because
483the same assumption is made for reference and IDISA
484architectures, determination of the additional circuit
485complexity due to IDISA features is unaffected by the
486assumption.
487
488In the remainder of this paper, then, IDISA-A and IDISA-B
489models are evaluated against their respective reference
490architectures on straight-line computational kernels
491used in parallel bit stream processing and other applications.
492As XML and other sequential text processing applications
493tend to use memory in an efficient streaming model, the
494applications tend to be compute-bound rather than IO-bound.
495Thus, the focus on computational kernels addresses the
496primary concern for performance improvement of these applications.
497
498The additional circuity complexity to realize IDISA-A and
499IDISA-B designs over their reference models will be
500addressed in the penultimate section.  That discussion
501will focus primarily on the complexity of implementing
502half-operand modifier logic, but will also address
503the extension of the core operations to operate on
5042-bit, 4-bit and 128-bit fields, as well.
505
506\section{Population Count}
507
508\begin{figure}[h]
509\begin{center}\small
510\begin{verbatim}
511c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
512c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
513c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
514c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
515c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
516\end{verbatim}
517\end{center}
518\caption{Population Count Reference Algorithm}
519\label{HD-pop}
520\end{figure}
521
522\begin{figure}
523\begin{center}\small
524\begin{verbatim}
525c = simd<2>::add<h,l>(x, x);
526c = simd<4>::add<h,l>(c, c);
527c = simd<8>::add<h,l>(c, c);
528c = simd<16>::add<h,l>(c, c);
529c = simd<32>::add<h,l>(c, c);
530\end{verbatim}
531\end{center}
532\caption{IDISA Population Count}
533\label{inductivepopcount}
534\end{figure}
535
536As an initial example to illustrate the principle of inductive doubling
537in practice, consider the problem of {\em population count}: determining
538the number of one bits within a particular bit field.  It is important
539enough for such operations as calculating Hamming distance to be included
540as a built-in instruction
541on some processors.  For example, the SPU of the Cell Broadband Engine
542has a SIMD population count instruction \verb:si_cntb: for simultaneously
543determining the
544number of 1 bits within each byte of a 16-byte register.
545In text processing with parallel bit streams, population count has direct
546application to keeping track of line numbers for error reporting, for example.
547Given a bit block identifying the positions of newline characters within
548a block of characters being processed, the population count of the
549bit block can be used to efficiently and conveniently be used to update
550the line number upon completion of block processing.
551
552Figure \ref{HD-pop} presents a traditional divide-and-conquer
553implementation for a 32-bit integer {\tt x} adapted from
554Warren \cite{HackersDelight}, while
555Figure \ref{inductivepopcount} shows the corresponding IDISA
556implementation for a vector of 32-bit fields.  Each implementation employs
557five steps of inductive doubling to produce population counts
558within 32 bit fields.  The traditional implementation employs
559explicit masking and shifting operations, while these
560operations are implicit within the semantics of the inductive
561doubling instructions shown in Figure \ref{inductivepopcount}.
562In each implementation, the first step determines the
563the population counts within 2-bit fields
564by adding the high bit of each such field to the low bit
565to produce a set of 2-bit counts in {\tt c}.
566In the second step, the counts within 4-bit fields of {\tt c} are determined
567by adding the counts of the corresponding high and low 2-bit subfields.
568Continuing in this fashion,
569the final population counts within 32-bit fields are determined in five steps.
570
571With the substitution of longer mask constants replicated for four
57232-bit fields, the implementation of Figure \ref{HD-pop} can be
573directly adapted to SWAR processing using 128-bit registers.
574Each binary operator is replaced by a corresponding binary
575SIMD operation.   Without the IDISA features, a
576straightforward RefA implementation of population count for
57732-bit fields thus employs 10 operations to load or generate
578mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
579total of 30 operations to complete the task.   Employing
580optimization identified by Warren, this can be reduced to
58120 operations, 5 of which are required to generate mask constants.
582At the cost of register pressure, it is possible that these constants
583could be kept preloaded in long vector processing.  In accord
584with our evaluation model, the RefB cost is thus 15 operations.
585As the IDISA implementation requires no constants at all,
586both the IDISA-A and IDISA-B cost is 5 operations.
587At our assumed one CPU cycle per instruction model, IDISA-A
588offers a 4X improvement over RefA, while IDISA-B offers a 3X
589improvement over its comparator.
590
591The pattern illustrated by population count is typical.
592An inductive doubling algorithm of $n$ steps typically applies
593mask or shift operations at each step for each of the
594two operands being combined in the step.  If there is
595one such operation for each operand, and the combination
596can be implemented in a single SWAR operation, then a total
597of $3n$ operations are the required in a RefB implementation,
598and possibly $4n$ for a RefA implementation including the
599cost of loading masks.  IDISA-A and IDISA-B implementations
600typically eliminate the explicit mask and shift operations
601through appropriate half-operand modifiers, reducing the
602total instruction count to $n$.   Thus a 3X to 4X improvement
603obtains in these cases.
604
605\section{Transposition to Parallel Bit Streams}
606
607In this section, we consider the first major
608application of the inductive doubling architecture:
609transposition of byte stream data to parallel bit stream
610form.   Of course, this operation is critical to the
611method of parallel bit streams and all applications
612of the method can benefit from a highly efficient
613transposition process.  Before considering how
614the inductive doubling architecture supports this
615transposition process, however, we first consider
616algorithms on existing architectures.  Two algorithms
617are presented; the best of these requires 72
618SIMD operations in the three-register model to perform
619transposition of eight serial registers of byte stream data into
620eight parallel registers of bit stream data.
621
622We then go on to show how the transposition problem
623can be solved using the inductive doubling architecture
624with a mere 24 three-register SIMD operations.  We also prove
625that this is optimal for any three-register instruction set model.
626
627
628\begin{figure}[tbh]
629\begin{center}
630\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
631\caption{Input/Output Model for Serial to Parallel Transposition}
632\label{s2p-spec}
633\end{center}
634
635\end{figure}
636Figure \ref{s2p-spec} illustrates the input-output requirements of
637the transposition problem.  We assume that inputs and
638outputs are each SIMD registers of size $N=2^K$ bits.
639The input consists of $N$ bytes of serial byte data,
640stored consecutively in eight SIMD registers each holding
641$N/8$ bytes.  The output consists of eight parallel
642registers, one each for the eight individual bit positions
643within a byte.  Upon completion of the transposition process,
644each output register is to hold the $N$ bits corresponding
645to the selected bit position in the sequence of $N$ input
646bytes.
647
648\subsection{Bit Gathering Algorithm}
649
650\begin{figure}[tbh]
651\begin{center}
652\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
653\caption{Serial to Parallel Transposition Using Bit-Gathering}
654\label{gather}
655\end{center}
656\end{figure}
657One straightforward algorithm for implementing the transposition process
658takes advantage of SIMD bit gathering operations that exist
659on some architectures.  This operation gathers one bit per byte
660from a particular position within each byte of a SIMD register.
661For example, the {\tt pmovmskb} operation of the Intel MMX and
662SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
663consisting of the high bit of each byte.  Similarly, the
664{\tt \verb:si_gbb:} operation of the synergistic processing units of the
665Cell Broadband Engine gathers together the low bit of each byte
666from the SIMD register.  Figure \ref{gather} illustrates the
667bit gathering process.
668
669For each bit stream of output, the bit gather algorithm requires
670one gather operation for each of the 8 input registers,
671giving 64 bit gather operations in all.  In addition, for seven
672of the eight bit positions, it is necessary to shift the bits
673of each input register into the conventional position for
674gathering.  A total of 56 shift operations are required for this
675task.  Finally, the result of each bit gather operation must
676be properly inserted into the output stream.  If the first
677gather operation for a stream can be used to also initialize
678the output register, there will then need to be 7 insert
679operations for the results of the remaining gather operations
680for this stream, making 56 insert operations in all.
681The insert step may be more complex than a single operation
682in some cases, but we use one operation per insert as a lower bound.
683Thus, the bit gather algorithm requires
684at least 176 operations to perform the transposition task.
685
686\subsection{BytePack Algorithm}
687
688A much more efficient transposition algorithm on commodity
689SIMD architectures (SSE and Altivec) involves three
690stages of binary division transformation.  This is similar
691to the three stage bit matrix inversion described by
692Warren  \cite{HackersDelight}, although modified to use SIMD operations.
693In each stage, input streams are divided into two half-length output streams.
694The first stage separates the bits at even numbered positions from those
695at odd number positions.  The two output streams from the first
696stage are then further divided in the second stage.
697The stream comprising even numbered bits from the original byte stream
698divides into one stream consisting of bits from positions 0 and 4 of each
699byte in the original stream and a second stream consisting of bits
700from positions 2 and 6 of each original byte.  The stream of bits from
701odd positions is similarly divided into streams for bits from Each of the
702positions 1 and 5 and bits from positions 2 and 6.
703Finally, each of the four streams resulting from the second stage are
704divided into the desired individual bit streams in the third stage.
705
706\begin{figure}[tbh]
707\begin{center}\small
708\begin{verbatim}
709t0 = simd<16>::pack<h,h>(s0, s1);
710t1 = simd<16>::pack<l,l>(s0, s1);
711p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
712p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
713\end{verbatim}
714\end{center}
715\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
716\label{s2pstep}
717\end{figure}
718
719The binary division transformations are accomplished in each stage
720using byte packing, shifting and masking.  In each stage, a
721transposition step combines each pair of serial input registers to
722produce a pair of parallel output registers.  In essence,
723doublebytes from the input registers are packed into bytes
724in the output registers, with the bits from even positions stored
725in the bytes of one output stream and the bits from odd positions
726stored in the bytes of the second output stream.
727Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
728producing two parallel registers \verb:p0: and \verb:p1: from
729two serial registers \verb:s0: and \verb:s1:.  This step is applied
730four times in stage 1; stages 2 and 3 also consist of four applications
731of a similar step with different shift and masking constants.
732
733Although we have used the idealized SIMD notation here, each of the
734operations maps to a single operation in the Altivec set and a small number
735of operations in the SSE set.  Using the Altivec set, there are
7366 operations per step for a total of 24 operations per stage. 
737The three stages combined required 72 operations to transpose 128 bytes
738to parallel bit stream form.  This is the best algorithm known to
739us for existing SIMD architectures.
740
741\subsection{Inductive Halving Algorithm}
742
743Using the inductive doubling architecture, it is possible to design
744a transposition algorithm that is both easier to understand and requires
745many fewer operations than the the bytepack algorithm described above.
746We call it the inductive halving algorithm for serial to parallel
747transposition, because it proceeds by reducing byte streams to
748two sets of nybble streams in a first stage, dividing the nybble
749streams into streams of bitpairs in a second stage and finally
750dividing the bitpair streams into bit streams in the third stage.
751
752
753\begin{figure}[tbh]
754\small
755\begin{verbatim}
756p0 = simd<8>::pack<h,h>(s0, s1);
757p1 = simd<8>::pack<l,l>(s0, s1);
758\end{verbatim}
759\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
760\label{halvingstep}
761\end{figure}
762
763Figure \ref{halvingstep} shows one step in stage 1 of the inductive
764halving algorithm, comprising just two SIMD operations.
765The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
766from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
767the low nybble of each byte.  As in the bytepack algorithm, this step is
768applied 4 times in stage 1, for a total of 8 operations.
769
770Stage 2 of the inductive halving algorithm reduces nybble streams
771to streams of bit pairs.  The basic step in this algorithm consists
772of one \verb#simd<4>::pack<h,h># operation to extract the high pair
773of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
774low pair of each nybble.  Four applications of this step complete stage 2.
775
776Stage 3 similarly uses four applications of a step that uses a
777\verb#simd<2>::pack<h,h># operation to extract the high bit of
778each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
779each pair.  The complete algorithm to transform eight serial
780byte registers s0 through s7 into the eight parallel bit stream
781registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
782
783\begin{figure}[tbh]
784\small
785\begin{verbatim}
786hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
787lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
788hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
789lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
790hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
791lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
792hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
793lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
794hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
795hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
796lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
797ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
798hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
799hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
800lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
801ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
802bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
803bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
804bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
805bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
806bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
807bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
808bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
809bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
810\end{verbatim}
811\caption{Complete Inductive Halving Algorithm}
812\label{halvingalgorithm}
813\end{figure}
814
815\subsection{Optimality of the Inductive Halving Algorithm}
816
817Here we show that the inductive halving algorithm presented in
818the previous subsection is optimal in the following sense:
819no other algorithm on any 3-register SIMD architecture can use
820fewer than 24 operations to transform eight serial registers
821of byte stream data into eight parallel registers of bit stream data.
822By 3-register SIMD architecture, we refer to any architecture
823that uses SIMD instructions consistent with our overall model of
824binary operations using two input register operands to produce
825one output register value.
826
827Observe that the $N$ data bits from each input register must be
828distributed $N/8$ each to the 8 output registers by virtue of
829the problem definition.  Each output register can effectively
830be given a 3-bit address; the partitioning problem can be viewed
831as moving data to the correct address.  However, each
832operation can move results into at most one register. 
833At most this can result in the assignment of one correct address
834bit for each of the $N$ input bits.  As all $8N$ input bits
835need to be moved to a register with a correct 3-bit address,
836a minimum of 24 operations is required.
837
838\subsection{End-to-End Significance}
839
840In a study of several XML technologies applied to
841the problem of GML to SVG transformation, the parabix
842implementation (parallel bit streams for XML) was
843found to the fastest with a cost of approximately
84415 CPU cycles per input byte \cite{Herdy}.  Within parabix,
845transposition to parallel bit stream form requires
846approximately 1.1 cycles per byte \cite{CASCON08}.
847All other things being equal, a 3X speed-up of transposition
848alone would improve end-to-end performance in a
849complete XML processing application by more than 4\%.
850
851
852\section{Parallel to Serial Conversion}
853
854Parallel bit stream applications may apply string editing
855operations in bit space to substitute, delete or insert
856parallel sets of bits at particular positions.  In such cases,
857the inverse transform that converts a set of parallel bit
858streams back into byte space is needed.  In the example of
859UTF-8 to UTF-16 transcoding, the inverse transform is
860actually used twice for each application of the forward
861transform, to separately compute the high and low byte
862streams of each UTF-16 code unit.  Those two byte streams
863are subsequentely merged to form the final result.
864
865Algorithms for performing the inverse transform mirror those
866of the forward transform, employing SIMD merge operations
867in place of pack operations.  The best algorithm known
868to us on the commodity SIMD architectures takes advantage
869of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
870operations that are available with each of the SSE and Altivec instruction
871sets.  These algorithms take 72 operations to perform the
872inverse transposition of 8 parallel registers of bit stream
873data into 8 serial registers of byte stream data. 
874
875\begin{figure}[tbh]
876\begin{center}\small
877\begin{verbatim}
878bit01_r0 = simd<1>::mergeh(bit0, bit1);
879bit01_r1 = simd<1>::mergel(bit0, bit1);
880bit23_r0 = simd<1>::mergeh(bit2, bit3);
881bit23_r1 = simd<1>::mergel(bit2, bit3);
882bit45_r0 = simd<1>::mergeh(bit4, bit5);
883bit45_r1 = simd<1>::mergel(bit4, bit5);
884bit67_r0 = simd<1>::mergeh(bit6, bit7);
885bit67_r1 = simd<1>::mergel(bit6, bit7);
886bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
887bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
888bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
889bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
890bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
891bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
892bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
893bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
894s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
895s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
896s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
897s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
898s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
899s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
900s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
901s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
902\end{verbatim}
903\end{center}
904\label{p2s-inductive}
905\caption{Parallel to Serial Transposition by Inductive Doubling}
906\end{figure}
907
908An algorithm employing only 24 operations using the
909inductive doubling instruction set architecture is relatively
910straightforward.. In stage 1, parallel registers for individual bit streams
911are first merged with bit-level interleaving
912using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
913operations.  For each of the four pairs of consecutive
914even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
915two consecutive registers of bitpair data are produced.
916In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
917are then applied to merge to bitpair streams to produce streams
918of nybbles for the high and low nybble of each byte.  Finally,
919the nybble streams are merged in stage 3 to produce the
920desired byte stream data.  The full inductive doubling
921algorithm for parallel to serial transposition is shown in Figure
922\ref{p2s-inductive}.
923
924This algorithm is again optimal, requiring the fewest operations
925of any possible algorithm using any 3-register instruction set
926model.  The proof directly follows that for the serial to parallel
927transposition.
928
929The existence of high-performance algorithms for transformation of
930character data between byte stream and parallel bit stream form
931in both directions makes it possible to consider applying these
932transformations multiple times during text processing applications.
933Just as the time domain and frequency domain each have their
934use in signal processing applications, the byte stream form and
935parallel bit stream form can then each be used at will in
936character stream applications.
937
938
939
940
941
942\section{Parallel Bit Deletion}
943
944\begin{figure*}[tbh]
945\begin{center}
946\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
947\hline
948\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
949\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
950\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
951\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
952\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
953\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
954\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
955\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
956\end{tabular}
957\end{center}
958\label{centraldelstep}
959\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
960\end{figure*}
961
962Parallel bit deletion is an important operation that allows string
963editing operations to be carried out while in parallel bit stream
964form.  It is also fundamental to UTF-8 to UTF-16 transcoding
965using parallel bit streams, allowing the excess code unit
966positions for UTF-8 two-, three- and four-byte sequences to
967be deleted once the sixteen parallel bit streams of UTF-16 have
968been computed \cite{PPoPP08}.
969
970Parallel bit deletion is specified using a deletion mask.
971A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
972to be deleted and 0s at positions identifying bits to be retained.
973For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
974bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
975accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
976
977Bit deletion may be performed using
978the parallel-prefix compress algorithm documented by
979Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
980only logic and shifts with a constant parameter to carry
981out the deletion process.  However, it requires $k^2$
982preprocessing steps for a final field width parameter
983of size $2^k$, as well as 4 operations per deletion step
984per stream.  Using the inductive doubling instruction set architecture
985it is possible to carry out bit deletion much more efficiently.
986
987Deletion within fixed size fields or registers may produce results that are either
988left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
989eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
990care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
991left-justified result produces an important intermediate form known as a
992{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
993right-justified and left-justified results from the application of the
9944-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
995stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
996the central result \verb:xxbdefhx:, which may easily be converted to a either a
997left or a right justified 8-element result by an appropriate shift operation.
998
999
1000The observation about how two $n$-bit central deletion results can
1001combine to yield a $2n$ central deletion result provides the basis
1002for an inductive doubling algorithm.  Figure \ref{centraldelstep}
1003illustrates the inductive process for the transition from 8-bit central
1004deletion results to 16-bit central deletion results.  The top row shows
1005the original deletion mask, while the second row shows the original
1006bit stream to which deletions are to be applied, with deleted bits zeroed out.
1007The third row shows the central result for each 8-bit field as the
1008result of the previous inductive step.
1009
1010To perform the $8 \rightarrow 16$ central deletion step, we first form
1011the population counts of 4-bit fields of the original deletion mask as
1012shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
1013an 8-bit central result, we perform a right shift by the population count
1014of the low half of the field.  Similarly,
1015left-justification requires a left-shift by the population count in the
1016high half of the field.
1017
1018The left and right shifts can be performed simultaneously using a rotate
1019left instruction.  Right justification by shifting an $n$ bit field
1020$i$ positions to the right is equivalent to a left rotate of $n-i$
1021positions.  These rotation amounts are computed by the operation
1022\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
1023except that don't care fields (which won't be subsequently used)
1024are marked \verb:XX:. 
1025
1026The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
1027as shown in row 6, and are combined with the right shift amounts
1028by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
1029as shown in row 7.  Using these computed values, the inductive step
1030is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
1031as shown in row 8.
1032
1033At each inductive doubling level, it requires 4 operations to compute the
1034required deletion infomation and one operation per bit stream to perform deletion.
1035Note that, if deletion is to be applied to a set of eight parallel bit streams,
1036the computed deletion information is used for each stream without recomputation,
1037thus requiring 12 operations per inductive level.
1038
1039In comparison to the parallel-prefix compress method, the method of central
1040deletion results using the inductive doubling architecture has far fewer operations.
1041The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
1042induction versus $4k^2$ for the parallel-prefix method.  Using the computed
1043deletion operation, only a single SIMD rotate operation per bit stream
1044per level is needed, in comparison with 4 operations per level for parallel-prefix
1045compress.
1046
1047
1048
1049\section{Beyond Parallel Bit Streams}
1050
1051It can be argued that text processing with parallel bit
1052streams by itself is not sufficiently important to justify
1053the circuit complexity of IDISA.  In this section, then
1054we move on to consider further applications that may
1055benefit from IDISA features.  These include additional
1056basic inductive doubling
1057algorithms such as parity, and least power-of-2 ceiling.
1058Further applications for narrow field widths are discussed
1059as well, such as packed DNA representations using 2-bit
1060field widths and packed decimal representations using 4-bit
1061field widths. Most significantly, however, we show that IDISA offers
1062a fully general solution to the problem of supporting
1063{\em horizontal} SWAR operations.
1064
1065\subsection{Inductive Doubling with Logic Operations}
1066
1067\begin{figure}
1068\begin{center}\small
1069\begin{verbatim}
1070y = x ^ (x >> 1);
1071y = y ^ (y >> 2);
1072y = y ^ (y >> 4);
1073y = y ^ (y >> 8);
1074y = y ^ (y >>16);
1075y = y & 1;
1076\end{verbatim}
1077\end{center}
1078\caption{Parity Reference Algorithm}
1079\label{HD-parity}
1080\end{figure}
1081
1082\begin{figure}
1083\begin{center}\small
1084\begin{verbatim}
1085x = x - 1;
1086x = x | (x >> 1);
1087x = x | (x >> 2);
1088x = x | (x >> 4);
1089x = x | (x >> 8);
1090x = x | (x >>16);
1091x = x + 1;
1092\end{verbatim}
1093\end{center}
1094\caption{Power-of-2 Ceiling Reference Algorithm}
1095\label{HD-clp2}
1096\end{figure}
1097
1098\begin{figure}
1099\begin{center}\small
1100\begin{verbatim}
1101y = simd<2>::xor<h,l>(x, x);
1102y = simd<4>::xor<h,l>(y, y);
1103y = simd<8>::xor<h,l>(y, y);
1104y = simd<16>::xor<h,l>(y, y);
1105y = simd<32>::xor<h,l>(y, y);
1106\end{verbatim}
1107\end{center}
1108\caption{IDISA Parity Implementation}
1109\label{ID-parity}
1110\end{figure}
1111
1112\begin{figure}
1113\begin{center}\small
1114\begin{verbatim}
1115x = simd<32>::sub(x, simd<32>::const(1));
1116x = simd<2>::or<x, h>(x, x);
1117x = simd<4>::or<x, h>(x, x);
1118x = simd<8>::or<x, h>(x, x);
1119x = simd<16>::or<x, h>(x, x);
1120x = simd<32>::or<x, h>(x, x);
1121x = simd<32>::add(x, simd<32>::const(1));
1122\end{verbatim}
1123\end{center}
1124\caption{IDISA Power-of-2 Ceiling Implementation}
1125\label{ID-clp2}
1126\end{figure}
1127
1128Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1129code for two further inductive doubling examples using
1130logical operators as the combining feature.  In the
1131first of these, the ``exclusive or'' operation is applied
1132to all bits in a 32-bit value to determine parity.
1133Parity has important applications for error-correcting
1134codes such as the various Hamming codes for detecting
1135and correcting numbers of bit errors dependent on the
1136number of parity bits added.  Warren's version uses
113711 operations for parity of a single 32-bit value;
1138a straightforward SWAR adaptation also uses 11 operations
1139for the parity of a set of 32-bit fields.
1140
1141Figure \ref{ID-parity} shows the IDISA parity implementation
1142with only 5 operations required.  This represents slightly
1143more than a 2X improvement.  The improvement is less than
11443X seen in other cases because one of the operands need
1145not be modified before applying the exclusive-or operation.
1146
1147Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1148code for the least power-of-2 ceiling of a 32-bit value.  The
1149technique is to employ inductive doubling to fill in one bits at
1150all bit positions after the first one bit in the original value to
1151first produce a value of the form $2^n-1$.  This is then incremented
1152to determine the power of 2 ceiling.  This function and
1153its straightforward adaptation for SWAR application on a
1154set of 32-bit fields requires 12 operations.   The
1155corresponding IDISA implementation of Figure \ref{ID-clp2}
1156requires 7 operations, just under a 2X improvement.
1157
1158\subsection{Packed DNA Representation}
1159
1160DNA sequences are often represented as strings consisting
1161of the four nucleotide codes A, C, G and T.  Internally,
1162these sequences are frequently represented in internal
1163form as packed sequences of 2-bit values.  The IDISA
1164\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1165allow these packed representations to be quickly computed
1166from byte-oriented string values by two steps of inductive
1167halving.   Similarly, conversion back to string form
1168can use two steps of inductive merging.  Without direct
1169support for these pack and merge operations, the SWAR
1170implementations of these conversions require the cost
1171of explicit masking and shifting in combination with
1172the 16-bit to 8-bit packing and 8-bit to 16-bit
1173merging operations supported by existing SWAR facilities
1174on commodity processors.
1175
1176\subsection{Bit Reverse}
1177
1178Include or omit?
1179
1180\subsection{String/Decimal/Integer Conversion}
1181\begin{figure}
1182\begin{center}\small
1183\begin{verbatim}
1184b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1185b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1186b = (d & 0x0000FFFF) + 10000 * (d >> 16);
1187\end{verbatim}
1188\end{center}
1189\caption{BCD to Integer Reference Algorithm}
1190\label{BCD2int}
1191\end{figure}
1192
1193\begin{figure}
1194\begin{center}\small
1195\begin{verbatim}
1196c10 = simd<16>:const(10);
1197c100 = simd<16>:const(100);
1198c10000 = simd<32>:const(10000);
1199b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d);
1200b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b);
1201b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b);
1202\end{verbatim}
1203\end{center}
1204\caption{IDISA Parity Implementation}
1205\label{ID-BCD2int}
1206\end{figure}
1207
1208Just as DNA sequences represent an important use case for
1209SWAR operations on 2-bit fields, packed sequences of
1210decimal or hexadecimal digits represent a common use case
1211for 4-bit fields.  These representations can be used
1212both as an intermediate form in numeric string to integer
1213conversion and as a direct representation for
1214packed binary coded decimal.
1215
1216Figure \ref{BCD2int} shows a three-step inductive
1217doubling implementation for conversion of 32-bit packed BCD
1218values to integer form.  The 32-bit value consists
1219of 8 4-bit decimal digits.  Pairs of digits are
1220first combined by multiplying the higher digit
1221of the pair by 10 and adding.  Pairs of these
1222two-digit results are then further combined by
1223multiplying the value of the higher of the two-digit
1224results by 100 and adding.  The final step is
1225to combine four-digit results by multiplying the
1226higher one by 10000 and adding.  Overall, 20
1227operations are required for this implementation
1228as well as the corresponding SWAR implementation
1229for sets of 32-bit fields.  Preloading of 6 constants
1230into registers for repeated use can reduce the number of
1231operations to 14 at the cost of significant register
1232pressure.
1233
1234The IDISA implementation of this algorithm is shown
1235in Figure \ref{ID-BCD2int}.  This implementation
1236shows an interesting variation in the use of
1237half-operand modifiers, with only one operand
1238of each of the addition and multiplication operations
1239modified at each level.  Overall, this implementation
1240requires 9 operations, or 6 operations with 3
1241preloaded constants.  This represents more than a 2X
1242reduction in instruction count as well as a 2X reduction
1243in register pressure.
1244
1245
1246
1247
1248\subsection{Systematic Support for Horizontal SIMD Operations}
1249
1250In SIMD parlance, {\em horizontal} operations are
1251operations which combine values from two or more fields
1252of the same register, in contrast to the normal
1253{\em vertical} operations which combine corresponding
1254fields of different registers.  Horizontal operations
1255can be found that combine two (e.g., \verb:haddpd: on SSE3),
1256four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1257or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1258horizontal operations have a vertical component as well.
1259For example, \verb:psadbw: first forms the absolute value of
1260the difference of eight corresponding byte fields before
1261performing horizontal add of the eight values, while
1262\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1263four unsigned 8-bit fields within one register
1264and then combines the result horizontally with
1265corresponding 32-bit fields of a second registers.
1266
1267The space of potential horizontal operations thus has
1268many dimensions, including not only the particular
1269combining operation and the operand field width, but
1270also the number of fields being combined, whether a
1271vertical combination is applied and whether it is applied
1272before or after the horizontal operation and what the
1273nature of the vertical combining operation is.
1274Within this space, commodity SIMD architectures tend
1275to support only a very few combinations, without any
1276particular attempt at systematic support for horizontal
1277operations in general.
1278
1279In contrast to this {\em ad hoc} support on commodity
1280processors, IDISA offers a completely systematic treatment
1281of horizontal operations without any special features beyond
1282the inductive doubling features already described.
1283In the simplest case, any vertical operation
1284\verb#simd<n>::F# on $n$-bit fields gives rise to
1285an immediate horizontal operation
1286\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1287pairs of $n/2$ bit fields.
1288For example, \verb#simd<16>::add<h,l># adds values
1289in adjacent 8 bit fields to produce 16 bit results,
1290while \verb#simd<32>::min<h,l># can produce the
1291minimum value of adjacent 16-bit fields.
1292Thus any binary horizontal operation can be implemented
1293in a single IDISA instruction making use of the \verb:<h,l>:
1294operand modifier combination.
1295
1296Horizontal combinations of four adjacent fields can also be
1297realized in a general way through two steps of inductive
1298doubling.  For example, consider the or-across operation \verb:si_orx:
1299of the SPU, that performs a logical or operation
1300on four 32-bit fields.  This four field combination
1301can easily be implemented with the following two operations.
1302%\begin{singlespace}
1303\begin{verbatim}
1304t = simd<64>::or<h,l>(x, x)
1305t = simd<128>::or<h,l>(t, t)
1306\end{verbatim}
1307%\end{singlespace}
1308
1309In general, systematic support for horizontal
1310combinations of sets of $2^h$ adjacent fields may
1311be realized through $h$ inductive double steps
1312in a similar fashion.
1313Thus, IDISA esssentially offers systematic support
1314for horizontal operations entirely through the
1315use of \verb:<h,l>: half-operand modifier
1316combinations.
1317
1318Systematic support for general horizontal operations
1319under IDISA also creates opportunity for a design tradeoff:
1320offsetting the circuit complexity of half-operand
1321modifiers with potential elimination of dedicated
1322logic for some {/ad hoc} horizontal SIMD operations.
1323Even if legacy support for these operations is required,
1324it may be possible to provide that support through
1325software or firmware rather than a full hardware
1326implementation.  Evaluation of these possibilities
1327in the context of particular architectures is a potential
1328area for further work.
1329
1330
1331\section{Implementation}
1332
1333We have carried implementation work for IDISA in three
1334ways.  First, we have constructed libraries that
1335implement the IDISA instructions by template and/or macro
1336expansion for each of MMX, SSE, Altivec, and SPU platforms.
1337Second, we have developed a model implementation
1338involving a modified operand fetch component
1339of a pipelined SIMD processor.  Third, we have written
1340and evaluated Verilog HDL description of this model
1341implementation.
1342
1343\subsection{IDISA Libraries}
1344
1345Implementation of IDISA instructions using template
1346and macro libraries has been useful in developing
1347and assessing the correctness of many of the algorithms
1348presented here.  Although these implementations do not
1349deliver the performance benefits associated with
1350direct hardware implementation of IDISA, they
1351have been quite useful in providing a practical means
1352for portable implementation of parallel bit stream
1353algorithms on multiple SWAR architectures.  However,
1354one additional facility has also proven necessary for
1355portability of parallel bit stream algorithms across
1356big-endian and little-endian architectures: the
1357notion of shift-forward and shift-back operations.
1358In essence, shift forward means shift to the left
1359on little-endian systems and shift to the right on
1360big-endian systems, while shift back has the reverse
1361interpretation.  Although this concept is unrelated to
1362inductive doubling, its inclusion with the IDISA
1363libraries has provided a suitable basis for portable
1364SIMD implementations of parallel bit stream algorithms.
1365Beyond this, the IDISA libraries have the additional
1366benefit of allowing the implementation
1367of inductive doubling algorithms at a higher level
1368abstraction, without need for programmer coding of
1369the underlying shift and mask operations.
1370
1371\subsection{IDISA Model}
1372\begin{figure}[tbh]
1373\begin{center}
1374\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1375\caption{IDISA Block Diagram}
1376\label{pipeline-model}
1377\end{center}
1378\end{figure}
1379
1380Figure \ref{pipeline-model} shows a block diagram
1381for a pipelined SIMD processor implementing IDISA.
1382The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1383registers each of width $N = 2^K$ bits. 
1384IDISA instructions identified by the Instruction Fetch
1385Unit (IFU) are forwarded for decoding to the SIMD
1386Instruction Decode Unit (SIDU).  This unit decodes
1387the instruction to produce
1388signals identifying the source and destination
1389operand registers, the half-operand modifiers, the
1390field width specification and the SIMD operation
1391to be applied.
1392
1393The SIDU supplies the source register information and the half-operand
1394modifier information to the SIMD Operand Fetch Unit (SOFU).
1395For each source operand, the SIDU provides an $A$-bit register
1396address and two 1-bit signals $h$ and $l$ indicating the value
1397of the decoded half-operand modifiers for this operand.
1398Only one of these values may be 1; both are 0 if
1399no modifier is specified.
1400In addition, the SIDU supplies decoded field width information
1401to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
1402The SIDU also supplies decoded SIMD opcode information to SIEU and
1403a decoded $A$-bit register address for the destination register to
1404the SIMD Result Write Back Unit (SRWBU).
1405
1406The SOFU is the key component of the IDISA model that
1407differs from that found in a traditional SWAR
1408processor.  For each of the two $A$-bit source
1409register addresses, SOFU is first responsible for
1410fetching the raw operand values from the SRF.
1411Then, before supplying operand values to the
1412SIEU, the SOFU applies the half-operand modification
1413logic as specified by the $h$, $l$, and field-width
1414signals.  The possibly modified operand values are then
1415provided to the SIEU for carrying out the SIMD operations.
1416A detailed model of SOFU logic is described in the following
1417subsection.
1418
1419The SIEU differs from similar execution units in
1420current commodity processors primarily by providing
1421SIMD operations at each field width
1422$n=2^k$ for $0 \leq k \leq K$.  This involves
1423additional circuitry for field widths not supported
1424in existing processors.  For inductive doubling
1425algorithms in support of parallel bit streams,
1426the principal need is for additional circuitry to
1427support 2-bit and 4-bit field widths.  This circuity
1428is generally less complicated than that for larger
1429fields.  Support for circuitry at these width
1430has other applications as well.   For example,
1431DNA sequences are frequently represented using
1432packed sequences of 2-bit codes for the four possible
1433nucleotides\cite{}, while the need for accurate financial
1434calculation has seen a resurgence of the 4-bit
1435packed BCD format for decimal floating point \cite{}.
1436
1437When execution of the SWAR instruction is
1438completed, the result value is then provided
1439to the SRWBU to update the value stored in the
1440SRF at the address specified by the $A$-bit
1441destination operand.
1442
1443\subsection{Operand Fetch Unit Logic}
1444
1445Discussion of gate-level implementation.
1446
1447
1448\section{Conclusions}
1449
1450In considering the architectural support for
1451SIMD text processing using the method of parallel bit streams,
1452this paper has presented the principle of inductive doubling
1453and a realization of that principle in the form of
1454IDISA, a modified SWAR instruction set architecture.
1455IDISA features support for SWAR operations at all power-of-2
1456field widths, including 2-bit and 4-bit widths, in particular,
1457as well as half-operand modifiers and pack and merge operations
1458to support efficient transition between successive power-of-two
1459field widths.  The principle innovation is the notion of
1460half-operand modifiers that eliminate the cost associated
1461with the explicit mask and shift operations required for
1462such transitions.
1463
1464Several algorithms key to parallel bit stream methods
1465have been examined and shown to benefit from dramatic
1466reductions in instruction count compared to the best
1467known algorithms on existing architectures.  In the case
1468of transposition algorithms to and from parallel bit stream
1469form, the architecture has been shown to make possible
1470straightforward inductive doubling algorithms with the
1471lowest total number of operations that can be achieved by
1472any possible 3-register SWAR architecture.
1473
1474Applications of IDISA in other areas have also been
1475examined.  The support for 2-bit and 4-bit field widths
1476in SWAR processing is beneficial for packed DNA representations
1477and packed decimal representations respectively.  Additional
1478inductive doubling examples are presented and the phenomenon
1479of power-of-2 transitions discussed more broadly.
1480Most significantly, IDISA supports a fully general approach
1481to horizontal SWAR operations, offering a considerable
1482improvement over the {\em ad hoc} sets of special-purpose
1483horizontal operations found in existing SWAR instruction sets.
1484
1485An IDISA implementation model has been presented employing
1486a customized operand fetch unit to implement the half-operand
1487modifier logic.  Gate-level implementation of this unit
1488has been analyzed and showed to be quite reasonable.
1489
1490Many special-purpose operations that are found in existing
1491processors could instead be implemented through efficient
1492IDISA sequences.  These include examples such as population
1493count, count leading and/or trailing zeroes and parity.
1494They also include specialized horizontal SIMD operations.
1495Thus, design tradeoffs can be made with the potential of
1496reducing the chip area devoted to special purpose instructions
1497in favor of more general IDISA features.
1498
1499Other tradeoffs may be possible in IDISA implementation itself.
1500Full support of IDISA features to the largest field widths
1501are not necessary in many cases.   For example, a given 128-bit
1502SIMD facility may support IDISA features only up to 32-bit
1503or 64-bit fields, similar to the current Altivec and SSE
1504architectures, respectively.   It may also be possible to
1505reduce the complexity of operand fetch circuitry by a factor
1506of two by dedicating one operand to a possible high half-operand
1507modifier and the other to a possible low half-operand modifier.
1508
1509Future research may consider the extension of inductive doubling
1510support in further ways.   For example, it may be possible
1511to develop a pipelined architecture supporting two or three
1512steps of inductive doubling in a single operation.
1513
1514%\appendix
1515%\section{Appendix Title}
1516%
1517%This is the text of the appendix, if you need one.
1518
1519\acks
1520
1521This research was supported by a Discovery Grant from the
1522Natural Sciences and Engineering Research Council of Canada.
1523
1524%\bibliographystyle{plainnat}
1525\bibliographystyle{plain}
1526\bibliography{asplos094-cameron}
1527%\begin{thebibliography}{}
1528
1529%\bibitem{smith02}
1530%Smith, P. Q. reference text
1531
1532%\end{thebibliography}
1533
1534
1535\end{document}
Note: See TracBrowser for help on using the repository browser.