source: docs/ASPLOS09/asplos094-cameron.tex @ 236

Last change on this file since 236 was 236, checked in by cameron, 11 years ago

Use <h,l> consistently instead of <l,h>

File size: 66.4 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152To assess the benefits of IDISA features in comparison to
153those found in existing SIMD support on commodity
154processors, we will focus on an idealized three-register
155model of SIMD instruction sets and evaluation of
156parallel bit stream and other computational kernels with
157respect to these features.  The three-register model is
158based on the general approach to binary SIMD operations
159as ones that operate on the contents of two source operand
160registers to produce a value to be written back to a
161single destination operand register.  This three-register
162model is directly used by the Altivec SIMD instructions
163of the Power PC, for example.  On the Intel SSE platform,
164the three-register model is used as a programmer's
165interface for the C-language intrinsics, while the
166underlying instructions use a two-register model with
167destructive updating.  The computational kernels we study consist
168of 100\% branch-free code operating on registers, without
169memory access   For such kernels, we use straight-line
170instruction count as the performance metric of interest,
171assuming that pipelined processors are well-designed
172to handle latencies. 
173
174
175The remainder of this paper is organized as follows.
176
177The second section of this paper introduces IDISA and the
178SIMD notation used throughout this paper.  A brief comparison of
179IDISA features with existing SIMD instruction
180sets of commodity processors such as the Intel SSE
181instruction set and the Power PC Altivec instruction set
182is also made.
183
184The third section provides a short first example of
185the inductive doubling principle in action through
186the case of population count.   Although this operation
187is not a strong determinant of performance for parallel bit
188stream applications, it is nevertheless an operation needed
189frequently enough in the general computing milieux to find
190its way into some instruction set architectures, typically
191at one particular field width.  By way of comparison, the
192inductive doubling architecture sacrifices some
193performance at the chosen field width, while offering a more
194general solution with frequently better performance at
195other field widths.
196
197The fourth section then moves on to consider the performance-critical
198and key task of conversion between serial byte streams and parallel
199bit streams.  A first algorithm that uses the existing SIMD
200operations common to SSE and Altivec is shown, requiring 72
201operations to transform 128 bytes of data using the three-register
202instruction form.  We then move on to consider how the task may
203be simplified using IDISA to
204require a mere 24 operations.  As well as providing a 3X speed-up,
205it is also argued that the version using the inductive doubling
206architecture is considerably simpler and easier to program.
207
208The fifth section then briefly considers the inverse transposition
209process, converting parallel bit stream data back into byte
210streams.  Again, an algorithm to carry out this task requires
21172 straight-line SIMD operations in the Altivec three-register
212form, but is reduced to a simpler 24 operations using IDISA.
213
214The sixth section then goes on to consider the problem of
215parallel bit deletion.  This operation is performance-critical
216to any applications that require filtering or
217editing operations on strings using the parallel bit stream
218algorithms.   For example, it is fundamental to the
219high-speed UTF-8 to UTF-16 transcoding algorithm that is
220often a critical component in XML parsing.  In this
221section, an inductive doubling algorithm based on the
222concept of a central deletion result is described and
223shown to have much better performance than a parallel-prefix
224alternative.
225
226The seventh section considers the issue of
227horizontal SIMD operations, that is, operations for combining
228sets of adjacent fields within individual SIMD registers rather than
229corresponding fields within sets of registers.  While existing
230SIMD instruction set architectures tend to only support a few
231ad hoc horizontal combinations, IDISA is shown to provide a systematic
232means for efficient horizontal combinations of any kind.
233
234An implementation model for IDISA is then considered
235in section 8 of the paper, focusing on a pipelined
236SIMD architecture featuring a modified operand fetch stage.
237A gate-count analysis of one feasible implementation is
238provided as well as a discussion of the implementation
239of required extensions to handle 2-bit and 4-bit fields not
240commonly supported on existing SIMD architectures.   Design
241tradeoffs are also considered focusing the potential removal of
242various {\em ad hoc} instructions on existing processors in favor of
243more general alternatives provided through IDISA.
244
245The ninth section concludes the paper with a summary of results
246and discussion of areas for future work.
247
248
249\section{Inductive Doubling Architecture}
250
251This section presents an idealized model for a single-instruction
252multiple-data (SIMD) instruction set architecture designed
253specifically to support inductive doubling algorithms in the
254domain of parallel bit stream programming.   The architecture is idealized
255in the sense that we concentrate on only the necessary features
256for our purpose, without enumerating the additional
257operations that would be required for
258SIMD applications in other domains.  The goal is to focus
259on the principles of inductive doubling support in a way
260that can accommodate a variety of realizations as other
261design constraints are brought to bear on the overall instruction set
262design.  First we introduce a simple model and notation for
263SIMD operations in general and then present the four key
264features of an idealized architecture in support of parallel
265bit stream programming.
266
267The idealized architecture supports typical SIMD integer
268operations common to existing commodity architectures such as SSE
269and Altivec.   The architecture is defined to support SIMD
270operations on registers of size $N=2^K$ bits, for some integer $K$.
271Typical values of $K$ for commodity processors include $K=6$ for
272the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
273the 128-bit registers of SSE and Altivec technology and $K=8$ for
274the upcoming Intel AVX technology.   The registers may be
275partitioned into $N/n$ fields of width $n=2^k$ bits for some values
276of $k \leq K$.   Typical values of $k$ used on commodity processors
277include $k = 3$ for SIMD operations on 8-bit fields (bytes),
278$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
279on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
280fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
281Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
282register $r$, using big-endian numbering.
283
284Let \verb:simd<n>: represent the class of SIMD operations defined
285on fields of size $n$ using C++ template syntax.  Given a
286binary function $F_n$ on $n$-bit fields, we denote the SIMD
287version of this operation as \verb#simd<n>::F#.  Given two
288SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
289respectively, the operation \verb#r=simd<n>::F(a,b)# stores
290the value $r$ in the register \verb:r: as determined by
291the simultaneous calculation of individual field values in
292accord with the following equation.
293\[ r_i = F_n(a_i, b_i) \]
294
295For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
296logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
297integers as follows.
298%\singlespace
299\begin{eqnarray}
300\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
301\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
302\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
303\end{eqnarray}
304%\doublespace
305The SSE and Altivec instruction sets support
306each of the addition and subtraction operations in SIMD form
307for 8, 16 and 32-bit fields, while the SSE set also includes
308the 64-bit forms.  For example, \verb#simd<8>::add# in our
309nomenclature is provided by the operation \verb:paddb:
310on SSE and the operation \verb:vaddubm: on Altivec.
311The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
312\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
313more constrained, requiring that all field shifts by the same amount.
314
315Given these definitions and notation, we now present
316the four key elements of an inductive doubling architecture.
317The first is a definition of a core set of binary functions
318on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
319The second is a set of {\em half-operand modifiers} that allow
320the inductive processing of fields of size $2n$ in terms of
321combinations of $n$-bit values selected from the fields.
322The third is the definition of packing operations that compress
323two consecutive registers of $n$-bit values into a single
324register of $n/2$-bit values.  The fourth is the definition
325of merging operations that produce a set of $2n$ bit fields
326by concatenating corresponding $n$-bit fields from two
327parallel registers.  Each of these features is described below.
328
329For the purpose of direct and efficient support for inductive
330doubling algorithms on bit streams, the provision of
331a core set of operations at field widths of 2 and 4 as
332well as the more traditional field witdhs of 8, 16 and 32
333is critical for elegant and efficient expression of the
334algorithms.  In essence, inductive doubling algorithms
335work by establishing some base property at either single
336or 2-bit fields.  Each iteration of the algorithm then
337goes on to establish the property for the power-of-2
338field width.  In order for this inductive step to be
339most conveniently and efficiently expressed, the
340core operations needed for the step should be available
341at each field width.  In the case of work with parallel
342bit streams, the operations \verb:add:, \verb:sub:,
343\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
344(rotate left) comprise the core.  In other domains,
345additional operations may be usefully included in the
346core depending on the work that needs to be performed
347at each inductive doubling level.
348
349Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
350also includes fields of width 1.  These are included for
351logical consistency, but are easily implemented by mapping
352directly to appropriate bitwise logic operations, which
353we assume are also available.  For example,
354\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
355bitwise exclusive-or operation on SIMD registers.
356
357The second key facility of the inductive doubling architecture
358is the potential application of half-operand modifiers to
359the fields of either or both of the operands of a SIMD
360operation.  These modifiers select either the
361low $n/2$
362bits of each $n$-bit field (modifier ``\verb:l:'') or the
363high $n/2$ bits (modifier ``\verb:h:'').  When required,
364the modifier ``\verb:x:'' means that the full $n$ bits
365should be used, unmodified.  The semantics of these
366modifiers are given by the following equations.
367%\singlespace
368\begin{eqnarray}
369l(r_n) & = & r_n \bmod 2^{n/2} \\
370h(r_n) & = & r_n / 2^{n/2} \\
371x(r_n) & = & r_n
372\end{eqnarray}
373%\doublespace
374In our notation, the half-operand modifiers are
375specified as optional template (compile-time) parameters
376for each of the binary functions.  Thus,
377\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
378the 2-bit quantity found in the low 2-bits of each 4-bit field
379of its first operand (\verb:a:)
380together with the corresponding 2-bit quantity found in the
381high 2-bits of its second operand (\verb:b:).
382In general, the purpose of the half-operand modifiers
383in support of inductive doubling is to allow the processing
384of $n$-bit fields to easily expressed in terms of
385combination of the results determined by processing
386$n/2$ bit fields.
387
388The third facility of the inductive doubling architecture
389is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
390The field values of \verb#r=simd<n>::pack(a,b)# are
391defined by the following equations.
392%\singlespace
393\begin{eqnarray}
394r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
395r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
396\end{eqnarray}
397%\doublespace
398Here conv is a function which performs conversion of an $n$-bit
399value to an $n/2$ bit value by signed saturation (although
400conversion by unsigned saturation would also suit our purpose).
401
402Half-operand modifiers may also be used with the pack
403operations.  Thus packing with conversion by masking off all
404but the low $n/2$ bits of each field may be
405be performed using the operation \verb#simd<n>::pack<l,l>#
406
407
408The final facility of the inductive doubling architecture is
409a set of merging operations \verb#simd<n>::mergeh# and
410\verb#simd<n>::mergel# that produce $2n$-bit fields
411by concatenating corresponding $n$-bit fields from the
412operand registers.   The respective
413operations \verb#r=simd<n>::mergeh(a,b)# and
414\verb#s=simd<n>::mergel(a,b)#
415are defined by the following equations.
416%\singlespace
417\begin{eqnarray}
418r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
419s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
420\end{eqnarray}
421%\doublespace
422Both SSE and Altivec provide versions of pack and merge operations
423for certain field widths.  The pack operations are provided
424with operands having 16-bit or 32-bit fields on each platform, although
425with some variation in how conversion is carried out.
426The merge operations are provided at 8-bit, 16-bit and 32-bit
427field widths on both architectures and also at the 64-bit level
428on SSE.
429
430This completes the description of the proposed inductive
431doubling architecture.  As described, many of the features
432are already available with the SIMD facilities of
433existing commodity processors.  The extensions enumerated
434here are relatively straightforward.  The innovation
435is to specifically tackle the design of facilities to
436offer systematic support for transitions between power-of-2 field widths.
437As we shall show in the remainder of this paper, these facilities
438can dramatically reduce instruction count in core parallel bit
439stream algorithms, with a factor of 3 reduction being typical.
440The next section goes on to illustrate a first simple example
441for the task of population count.
442
443\section{Population Count}
444
445\begin{figure}
446\begin{center}\small
447\begin{verbatim}
448c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
449c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
450c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
451c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
452c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
453\end{verbatim}
454\end{center}
455\caption{Population Count Reference Algorithm}
456\label{HD-pop}
457\end{figure}
458
459\begin{figure}
460\begin{center}\small
461\begin{verbatim}
462c = simd<2>::add<h,l>(x, x);
463c = simd<4>::add<h,l>(c, c);
464c = simd<8>::add<h,l>(c, c);
465c = simd<16>::add<h,l>(c, c);
466c = simd<32>::add<h,l>(c, c);
467\end{verbatim}
468\end{center}
469\caption{Ideal SIMD Population Count}
470\label{inductivepopcount}
471\end{figure}
472
473As an initial example to illustrate the principle of inductive doubling
474in practice, consider the problem of {\em population count}: determining
475the number of one bits within a particular bit field.  It is important
476enough for such operations as
477calculating Hamming distance to be included as a built-in instruction
478on some processors.  For example, the SPU of the Cell Broadband Engine
479has a SIMD population count instruction \verb:si_cntb: for simultaneously
480determining the
481number of 1 bits within each byte of a 16-byte register.
482In text processing with parallel bit streams, population count has direct
483application to keeping track of line numbers for error reporting, for example.
484Given a bit block identifying the positions of newline characters within
485a block of characters being processed, the population count of the
486bit block can be used to efficiently and conveniently be used to update
487the line number upon completion of block processing.
488
489Figure \ref{HD-pop} presents a traditional divide-and-conquer
490implementation for a 32-bit integer {\tt x} adapted from
491Warren \cite{HackersDelight}, while
492Figure \ref{inductivepopcount} shows the corresponding IDISA
493implementation for a vector of 32-bit fields.  Each implementation employs
494five steps of inductive doubling to produce population counts
495within 32 bit fields.  The traditional implementation employs
496explicit masking and shifting operations, while these
497operations are implicit within the semantics of the inductive
498doubling instructions shown in Figure \ref{inductivepopcount}.
499In each implementation, the first step determines the
500the population counts within 2-bit fields
501by adding the high bit of each such field to the low bit
502to produce a set of 2-bit counts in {\tt c}.
503In the second step, the counts within 4-bit fields of {\tt c} are determined
504by adding the counts of the corresponding high and low 2-bit subfields.
505Continuing in this fashion,
506the final population counts within 32-bit fields are determined in five steps.
507
508With the substitution of longer mask constants replicated for four
50932-bit fields, the implementation of Figure \ref{HD-pop} can be
510directly adapted to SWAR processing using 128-bit registers.
511Each binary operator is replaced by a corresponding binary
512SIMD operation.   Without the IDISA features, a
513straightforward SWAR implementation of population count for
51432-bit fields thus employs 10 operations to load or generate
515mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
516total of 30 operations to complete the task.   Employing
517optimization identified by Warren, this can be reduced to
51820 operations, 5 of which are required to generate mask constants.
519At the cost of register pressure, it is possible that these constants
520could be kept preloaded in long vector processing.
521In any case, the IDISA implementation requires only 5 operations
522to carry out the work requiring 15 to 20 operations with the
523optimized version using standard SWAR operations.   IDISA thus offers
524offers a 3X to 4X improvement in instruction count as well as
525a reduction in register pressure.
526
527The pattern illustrated by population count is typical.  An
528inductive doubling algorithm of $n$ steps typically applies
529two mask or shift operations at each step for each of the
530two operands being combined in the step.   If the combination
531can be implemented in a single SWAR operation, then a total
532of $3n$ operations are the minimum required for the SWAR
533algorithm with IDISA features.   An IDISA implementation
534typically eliminates the explicit mask and shift operations
535through appropriate half-operand modifiers, reducing the
536total instruction count to $n$.   Thus a 3X improvement
537is typical.
538
539\section{Transposition to Parallel Bit Streams}
540
541In this section, we consider the first major
542application of the inductive doubling architecture:
543transposition of byte stream data to parallel bit stream
544form.   Of course, this operation is critical to the
545method of parallel bit streams and all applications
546of the method can benefit from a highly efficient
547transposition process.  Before considering how
548the inductive doubling architecture supports this
549transposition process, however, we first consider
550algorithms on existing architectures.  Two algorithms
551are presented; the best of these requires 72
552SIMD operations in the three-register model to perform
553transposition of eight serial registers of byte stream data into
554eight parallel registers of bit stream data.
555
556We then go on to show how the transposition problem
557can be solved using the inductive doubling architecture
558with a mere 24 three-register SIMD operations.  We also prove
559that this is optimal for any three-register instruction set model.
560
561
562\begin{figure}[tbh]
563\begin{center}
564\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
565\caption{Input/Output Model for Serial to Parallel Transposition}
566\label{s2p-spec}
567\end{center}
568
569\end{figure}
570Figure \ref{s2p-spec} illustrates the input-output requirements of
571the transposition problem.  We assume that inputs and
572outputs are each SIMD registers of size $N=2^K$ bits.
573The input consists of $N$ bytes of serial byte data,
574stored consecutively in eight SIMD registers each holding
575$N/8$ bytes.  The output consists of eight parallel
576registers, one each for the eight individual bit positions
577within a byte.  Upon completion of the transposition process,
578each output register is to hold the $N$ bits corresponding
579to the selected bit position in the sequence of $N$ input
580bytes.
581
582\subsection{Bit Gathering Algorithm}
583
584\begin{figure}[tbh]
585\begin{center}
586\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
587\caption{Serial to Parallel Transposition Using Bit-Gathering}
588\label{gather}
589\end{center}
590\end{figure}
591One straightforward algorithm for implementing the transposition process
592takes advantage of SIMD bit gathering operations that exist
593on some architectures.  This operation gathers one bit per byte
594from a particular position within each byte of a SIMD register.
595For example, the {\tt pmovmskb} operation of the Intel MMX and
596SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
597consisting of the high bit of each byte.  Similarly, the
598{\tt \verb:si_gbb:} operation of the synergistic processing units of the
599Cell Broadband Engine gathers together the low bit of each byte
600from the SIMD register.  Figure \ref{gather} illustrates the
601bit gathering process.
602
603For each bit stream of output, the bit gather algorithm requires
604one gather operation for each of the 8 input registers,
605giving 64 bit gather operations in all.  In addition, for seven
606of the eight bit positions, it is necessary to shift the bits
607of each input register into the conventional position for
608gathering.  A total of 56 shift operations are required for this
609task.  Finally, the result of each bit gather operation must
610be properly inserted into the output stream.  If the first
611gather operation for a stream can be used to also initialize
612the output register, there will then need to be 7 insert
613operations for the results of the remaining gather operations
614for this stream, making 56 insert operations in all.
615The insert step may be more complex than a single operation
616in some cases, but we use one operation per insert as a lower bound.
617Thus, the bit gather algorithm requires
618at least 176 operations to perform the transposition task.
619
620\subsection{BytePack Algorithm}
621
622A much more efficient transposition algorithm on commodity
623SIMD architectures (SSE and Altivec) involves three
624stages of binary division transformation.  This is similar
625to the three stage bit matrix inversion described by
626Warren  \cite{HackersDelight}, although modified to use SIMD operations.
627In each stage, input streams are divided into two half-length output streams.
628The first stage separates the bits at even numbered positions from those
629at odd number positions.  The two output streams from the first
630stage are then further divided in the second stage.
631The stream comprising even numbered bits from the original byte stream
632divides into one stream consisting of bits from positions 0 and 4 of each
633byte in the original stream and a second stream consisting of bits
634from positions 2 and 6 of each original byte.  The stream of bits from
635odd positions is similarly divided into streams for bits from Each of the
636positions 1 and 5 and bits from positions 2 and 6.
637Finally, each of the four streams resulting from the second stage are
638divided into the desired individual bit streams in the third stage.
639
640\begin{figure}[tbh]
641\begin{center}\small
642\begin{verbatim}
643t0 = simd<16>::pack<h,h>(s0, s1);
644t1 = simd<16>::pack<l,l>(s0, s1);
645p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
646p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
647\end{verbatim}
648\end{center}
649\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
650\label{s2pstep}
651\end{figure}
652
653The binary division transformations are accomplished in each stage
654using byte packing, shifting and masking.  In each stage, a
655transposition step combines each pair of serial input registers to
656produce a pair of parallel output registers.  In essence,
657doublebytes from the input registers are packed into bytes
658in the output registers, with the bits from even positions stored
659in the bytes of one output stream and the bits from odd positions
660stored in the bytes of the second output stream.
661Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
662producing two parallel registers \verb:p0: and \verb:p1: from
663two serial registers \verb:s0: and \verb:s1:.  This step is applied
664four times in stage 1; stages 2 and 3 also consist of four applications
665of a similar step with different shift and masking constants.
666
667Although we have used the idealized SIMD notation here, each of the
668operations maps to a single operation in the Altivec set and a small number
669of operations in the SSE set.  Using the Altivec set, there are
6706 operations per step for a total of 24 operations per stage. 
671The three stages combined required 72 operations to transpose 128 bytes
672to parallel bit stream form.  This is the best algorithm known to
673us for existing SIMD architectures.
674
675\subsection{Inductive Halving Algorithm}
676
677Using the inductive doubling architecture, it is possible to design
678a transposition algorithm that is both easier to understand and requires
679many fewer operations than the the bytepack algorithm described above.
680We call it the inductive halving algorithm for serial to parallel
681transposition, because it proceeds by reducing byte streams to
682two sets of nybble streams in a first stage, dividing the nybble
683streams into streams of bitpairs in a second stage and finally
684dividing the bitpair streams into bit streams in the third stage.
685
686
687\begin{figure}[tbh]
688\small
689\begin{verbatim}
690p0 = simd<8>::pack<h,h>(s0, s1);
691p1 = simd<8>::pack<l,l>(s0, s1);
692\end{verbatim}
693\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
694\label{halvingstep}
695\end{figure}
696
697Figure \ref{halvingstep} shows one step in stage 1 of the inductive
698halving algorithm, comprising just two SIMD operations.
699The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
700from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
701the low nybble of each byte.  As in the bytepack algorithm, this step is
702applied 4 times in stage 1, for a total of 8 operations.
703
704Stage 2 of the inductive halving algorithm reduces nybble streams
705to streams of bit pairs.  The basic step in this algorithm consists
706of one \verb#simd<4>::pack<h,h># operation to extract the high pair
707of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
708low pair of each nybble.  Four applications of this step complete stage 2.
709
710Stage 3 similarly uses four applications of a step that uses a
711\verb#simd<2>::pack<h,h># operation to extract the high bit of
712each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
713each pair.  The complete algorithm to transform eight serial
714byte registers s0 through s7 into the eight parallel bit stream
715registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
716
717\begin{figure}[tbh]
718\small
719\begin{verbatim}
720hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
721lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
722hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
723lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
724hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
725lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
726hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
727lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
728hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
729hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
730lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
731ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
732hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
733hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
734lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
735ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
736bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
737bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
738bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
739bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
740bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
741bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
742bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
743bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
744\end{verbatim}
745\caption{Complete Inductive Halving Algorithm}
746\label{halvingalgorithm}
747\end{figure}
748
749\subsection{Optimality of the Inductive Halving Algorithm}
750
751Here we show that the inductive halving algorithm presented in
752the previous subsection is optimal in the following sense:
753no other algorithm on any 3-register SIMD architecture can use
754fewer than 24 operations to transform eight serial registers
755of byte stream data into eight parallel registers of bit stream data.
756By 3-register SIMD architecture, we refer to any architecture
757that uses SIMD instructions consistent with our overall model of
758binary operations using two input register operands to produce
759one output register value.
760
761Observe that the $N$ data bits from each input register must be
762distributed $N/8$ each to the 8 output registers by virtue of
763the problem definition.  Each output register can effectively
764be given a 3-bit address; the partitioning problem can be viewed
765as moving data to the correct address.  However, each
766operation can move results into at most one register. 
767At most this can result in the assignment of one correct address
768bit for each of the $N$ input bits.  As all $8N$ input bits
769need to be moved to a register with a correct 3-bit address,
770a minimum of 24 operations is required.
771
772\section{Parallel to Serial Conversion}
773
774Parallel bit stream applications may apply string editing
775operations in bit space to substitute, delete or insert
776parallel sets of bits at particular positions.  In such cases,
777the inverse transform that converts a set of parallel bit
778streams back into byte space is needed.  In the example of
779UTF-8 to UTF-16 transcoding, the inverse transform is
780actually used twice for each application of the forward
781transform, to separately compute the high and low byte
782streams of each UTF-16 code unit.  Those two byte streams
783are subsequentely merged to form the final result.
784
785Algorithms for performing the inverse transform mirror those
786of the forward transform, employing SIMD merge operations
787in place of pack operations.  The best algorithm known
788to us on the commodity SIMD architectures takes advantage
789of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
790operations that are available with each of the SSE and Altivec instruction
791sets.  These algorithms take 72 operations to perform the
792inverse transposition of 8 parallel registers of bit stream
793data into 8 serial registers of byte stream data. 
794
795\begin{figure}[tbh]
796\begin{center}\small
797\begin{verbatim}
798bit01_r0 = simd<1>::mergeh(bit0, bit1);
799bit01_r1 = simd<1>::mergel(bit0, bit1);
800bit23_r0 = simd<1>::mergeh(bit2, bit3);
801bit23_r1 = simd<1>::mergel(bit2, bit3);
802bit45_r0 = simd<1>::mergeh(bit4, bit5);
803bit45_r1 = simd<1>::mergel(bit4, bit5);
804bit67_r0 = simd<1>::mergeh(bit6, bit7);
805bit67_r1 = simd<1>::mergel(bit6, bit7);
806bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
807bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
808bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
809bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
810bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
811bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
812bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
813bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
814s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
815s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
816s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
817s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
818s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
819s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
820s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
821s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
822\end{verbatim}
823\end{center}
824\label{p2s-inductive}
825\caption{Parallel to Serial Transposition by Inductive Doubling}
826\end{figure}
827
828An algorithm employing only 24 operations using the
829inductive doubling instruction set architecture is relatively
830straightforward.. In stage 1, parallel registers for individual bit streams
831are first merged with bit-level interleaving
832using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
833operations.  For each of the four pairs of consecutive
834even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
835two consecutive registers of bitpair data are produced.
836In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
837are then applied to merge to bitpair streams to produce streams
838of nybbles for the high and low nybble of each byte.  Finally,
839the nybble streams are merged in stage 3 to produce the
840desired byte stream data.  The full inductive doubling
841algorithm for parallel to serial transposition is shown in Figure
842\ref{p2s-inductive}.
843
844This algorithm is again optimal, requiring the fewest operations
845of any possible algorithm using any 3-register instruction set
846model.  The proof directly follows that for the serial to parallel
847transposition.
848
849The existence of high-performance algorithms for transformation of
850character data between byte stream and parallel bit stream form
851in both directions makes it possible to consider applying these
852transformations multiple times during text processing applications.
853Just as the time domain and frequency domain each have their
854use in signal processing applications, the byte stream form and
855parallel bit stream form can then each be used at will in
856character stream applications.
857
858
859
860
861
862\section{Parallel Bit Deletion}
863
864\begin{figure*}[tbh]
865\begin{center}
866\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
867\hline
868\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
869\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
870\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
871\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
872\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
873\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
874\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
875\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
876\end{tabular}
877\end{center}
878\label{centraldelstep}
879\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
880\end{figure*}
881
882Parallel bit deletion is an important operation that allows string
883editing operations to be carried out while in parallel bit stream
884form.  It is also fundamental to UTF-8 to UTF-16 transcoding
885using parallel bit streams, allowing the excess code unit
886positions for UTF-8 two-, three- and four-byte sequences to
887be deleted once the sixteen parallel bit streams of UTF-16 have
888been computed \cite{PPoPP08}.
889
890Parallel bit deletion is specified using a deletion mask.
891A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
892to be deleted and 0s at positions identifying bits to be retained.
893For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
894bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
895accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
896
897Bit deletion may be performed using
898the parallel-prefix compress algorithm documented by
899Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
900only logic and shifts with a constant parameter to carry
901out the deletion process.  However, it requires $k^2$
902preprocessing steps for a final field width parameter
903of size $2^k$, as well as 4 operations per deletion step
904per stream.  Using the inductive doubling instruction set architecture
905it is possible to carry out bit deletion much more efficiently.
906
907Deletion within fixed size fields or registers may produce results that are either
908left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
909eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
910care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
911left-justified result produces an important intermediate form known as a
912{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
913right-justified and left-justified results from the application of the
9144-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
915stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
916the central result \verb:xxbdefhx:, which may easily be converted to a either a
917left or a right justified 8-element result by an appropriate shift operation.
918
919
920The observation about how two $n$-bit central deletion results can
921combine to yield a $2n$ central deletion result provides the basis
922for an inductive doubling algorithm.  Figure \ref{centraldelstep}
923illustrates the inductive process for the transition from 8-bit central
924deletion results to 16-bit central deletion results.  The top row shows
925the original deletion mask, while the second row shows the original
926bit stream to which deletions are to be applied, with deleted bits zeroed out.
927The third row shows the central result for each 8-bit field as the
928result of the previous inductive step.
929
930To perform the $8 \rightarrow 16$ central deletion step, we first form
931the population counts of 4-bit fields of the original deletion mask as
932shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
933an 8-bit central result, we perform a right shift by the population count
934of the low half of the field.  Similarly,
935left-justification requires a left-shift by the population count in the
936high half of the field.
937
938The left and right shifts can be performed simultaneously using a rotate
939left instruction.  Right justification by shifting an $n$ bit field
940$i$ positions to the right is equivalent to a left rotate of $n-i$
941positions.  These rotation amounts are computed by the operation
942\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
943except that don't care fields (which won't be subsequently used)
944are marked \verb:XX:. 
945
946The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
947as shown in row 6, and are combined with the right shift amounts
948by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
949as shown in row 7.  Using these computed values, the inductive step
950is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
951as shown in row 8.
952
953At each inductive doubling level, it requires 4 operations to compute the
954required deletion infomation and one operation per bit stream to perform deletion.
955Note that, if deletion is to be applied to a set of eight parallel bit streams,
956the computed deletion information is used for each stream without recomputation,
957thus requiring 12 operations per inductive level.
958
959In comparison to the parallel-prefix compress method, the method of central
960deletion results using the inductive doubling architecture has far fewer operations.
961The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
962induction versus $4k^2$ for the parallel-prefix method.  Using the computed
963deletion operation, only a single SIMD rotate operation per bit stream
964per level is needed, in comparison with 4 operations per level for parallel-prefix
965compress.
966
967
968
969\section{Beyond Parallel Bit Streams}
970
971It can be argued that text processing with parallel bit
972streams by itself is not sufficiently important to justify
973the circuit complexity of IDISA.  In this section, then
974we move on to consider further applications that may
975benefit from IDISA features.  These include additional
976basic inductive doubling
977algorithms such as parity, and least power-of-2 ceiling.
978Further applications for narrow field widths are discussed
979as well, such as packed DNA representations using 2-bit
980field widths and packed decimal representations using 4-bit
981field widths. Most significantly, however, we show that IDISA offers
982a fully general solution to the problem of supporting
983{\em horizontal} SWAR operations.
984
985\subsection{Inductive Doubling with Logic Operations}
986
987\begin{figure}
988\begin{center}\small
989\begin{verbatim}
990y = x ^ (x >> 1);
991y = y ^ (y >> 2);
992y = y ^ (y >> 4);
993y = y ^ (y >> 8);
994y = y ^ (y >>16);
995y = y & 1;
996\end{verbatim}
997\end{center}
998\caption{Parity Reference Algorithm}
999\label{HD-parity}
1000\end{figure}
1001
1002\begin{figure}
1003\begin{center}\small
1004\begin{verbatim}
1005x = x - 1;
1006x = x | (x >> 1);
1007x = x | (x >> 2);
1008x = x | (x >> 4);
1009x = x | (x >> 8);
1010x = x | (x >>16);
1011x = x + 1;
1012\end{verbatim}
1013\end{center}
1014\caption{Power-of-2 Ceiling Reference Algorithm}
1015\label{HD-clp2}
1016\end{figure}
1017
1018\begin{figure}
1019\begin{center}\small
1020\begin{verbatim}
1021y = simd<2>::xor<h,l>(x, x);
1022y = simd<4>::xor<h,l>(y, y);
1023y = simd<8>::xor<h,l>(y, y);
1024y = simd<16>::xor<h,l>(y, y);
1025y = simd<32>::xor<h,l>(y, y);
1026\end{verbatim}
1027\end{center}
1028\caption{IDISA Parity Implementation}
1029\label{ID-parity}
1030\end{figure}
1031
1032\begin{figure}
1033\begin{center}\small
1034\begin{verbatim}
1035x = simd<32>::sub(x, simd<32>::const(1));
1036x = simd<2>::or<x, h>(x, x);
1037x = simd<4>::or<x, h>(x, x);
1038x = simd<8>::or<x, h>(x, x);
1039x = simd<16>::or<x, h>(x, x);
1040x = simd<32>::or<x, h>(x, x);
1041x = simd<32>::add(x, simd<32>::const(1));
1042\end{verbatim}
1043\end{center}
1044\caption{IDISA Power-of-2 Ceiling Implementation}
1045\label{ID-clp2}
1046\end{figure}
1047
1048Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1049code for two further inductive doubling examples using
1050logical operators as the combining feature.  In the
1051first of these, the ``exclusive or'' operation is applied
1052to all bits in a 32-bit value to determine parity.
1053Parity has important applications for error-correcting
1054codes such as the various Hamming codes for detecting
1055and correcting numbers of bit errors dependent on the
1056number of parity bits added.  Warren's version uses
105711 operations for parity of a single 32-bit value;
1058a straightforward SWAR adaptation also uses 11 operations
1059for the parity of a set of 32-bit fields.
1060
1061Figure \ref{ID-parity} shows the IDISA parity implementation
1062with only 5 operations required.  This represents slightly
1063more than a 2X improvement.  The improvement is less than
10643X seen in other cases because one of the operands need
1065not be modified before applying the exclusive-or operation.
1066
1067Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1068code for the least power-of-2 ceiling of a 32-bit value.  The
1069technique is to employ inductive doubling to fill in one bits at
1070all bit positions after the first one bit in the original value to
1071first produce a value of the form $2^n-1$.  This is then incremented
1072to determine the power of 2 ceiling.  This function and
1073its straightforward adaptation for SWAR application on a
1074set of 32-bit fields requires 12 operations.   The
1075corresponding IDISA implementation of Figure \ref{ID-clp2}
1076requires 7 operations, just under a 2X improvement.
1077
1078\subsection{Packed DNA Representation}
1079
1080DNA sequences are often represented as strings consisting
1081of the four nucleotide codes A, C, G and T.  Internally,
1082these sequences are frequently represented in internal
1083form as packed sequences of 2-bit values.  The IDISA
1084\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1085allow these packed representations to be quickly computed
1086from byte-oriented string values by two steps of inductive
1087halving.   Similarly, conversion back to string form
1088can use two steps of inductive merging.  Without direct
1089support for these pack and merge operations, the SWAR
1090implementations of these conversions require the cost
1091of explicit masking and shifting in combination with
1092the 16-bit to 8-bit packing and 8-bit to 16-bit
1093merging operations supported by existing SWAR facilities
1094on commodity processors.
1095
1096\subsection{Bit Reverse}
1097
1098Include or omit?
1099
1100\subsection{String/Decimal/Integer Conversion}
1101\begin{figure}
1102\begin{center}\small
1103\begin{verbatim}
1104b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1105b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1106b = (d & 0x0000FFFF) + 10000 * (d >> 16);
1107\end{verbatim}
1108\end{center}
1109\caption{BCD to Integer Reference Algorithm}
1110\label{BCD2int}
1111\end{figure}
1112
1113\begin{figure}
1114\begin{center}\small
1115\begin{verbatim}
1116c10 = simd<16>:const(10);
1117c100 = simd<16>:const(100);
1118c10000 = simd<32>:const(10000);
1119b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d);
1120b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b);
1121b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b);
1122\end{verbatim}
1123\end{center}
1124\caption{IDISA Parity Implementation}
1125\label{ID-BCD2int}
1126\end{figure}
1127
1128Just as DNA sequences represent an important use case for
1129SWAR operations on 2-bit fields, packed sequences of
1130decimal or hexadecimal digits represent a common use case
1131for 4-bit fields.  These representations can be used
1132both as an intermediate form in numeric string to integer
1133conversion and as a direct representation for
1134packed binary coded decimal.
1135
1136Figure \ref{BCD2int} shows a three-step inductive
1137doubling implementation for conversion of 32-bit packed BCD
1138values to integer form.  The 32-bit value consists
1139of 8 4-bit decimal digits.  Pairs of digits are
1140first combined by multiplying the higher digit
1141of the pair by 10 and adding.  Pairs of these
1142two-digit results are then further combined by
1143multiplying the value of the higher of the two-digit
1144results by 100 and adding.  The final step is
1145to combine four-digit results by multiplying the
1146higher one by 10000 and adding.  Overall, 20
1147operations are required for this implementation
1148as well as the corresponding SWAR implementation
1149for sets of 32-bit fields.  Preloading of 6 constants
1150into registers for repeated use can reduce the number of
1151operations to 14 at the cost of significant register
1152pressure.
1153
1154The IDISA implementation of this algorithm is shown
1155in Figure \ref{ID-BCD2int}.  This implementation
1156shows an interesting variation in the use of
1157half-operand modifiers, with only one operand
1158of each of the addition and multiplication operations
1159modified at each level.  Overall, this implementation
1160requires 9 operations, or 6 operations with 3
1161preloaded constants.  This represents more than a 2X
1162reduction in instruction count as well as a 2X reduction
1163in register pressure.
1164
1165
1166
1167
1168\subsection{Systematic Support for Horizontal SIMD Operations}
1169
1170In SIMD parlance, {\em horizontal} operations are
1171operations which combine values from two or more fields
1172of the same register, in contrast to the normal
1173{\em vertical} operations which combine corresponding
1174fields of different registers.  Horizontal operations
1175can be found that combine two (e.g., \verb:haddpd: on SSE3),
1176four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1177or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1178horizontal operations have a vertical component as well.
1179For example, \verb:psadbw: first forms the absolute value of
1180the difference of eight corresponding byte fields before
1181performing horizontal add of the eight values, while
1182\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1183four unsigned 8-bit fields within one register
1184and then combines the result horizontally with
1185corresponding 32-bit fields of a second registers.
1186
1187The space of potential horizontal operations thus has
1188many dimensions, including not only the particular
1189combining operation and the operand field width, but
1190also the number of fields being combined, whether a
1191vertical combination is applied and whether it is applied
1192before or after the horizontal operation and what the
1193nature of the vertical combining operation is.
1194Within this space, commodity SIMD architectures tend
1195to support only a very few combinations, without any
1196particular attempt at systematic support for horizontal
1197operations in general.
1198
1199In contrast to this {\em ad hoc} support on commodity
1200processors, IDISA offers a completely systematic treatment
1201of horizontal operations without any special features beyond
1202the inductive doubling features already described.
1203In the simplest case, any vertical operation
1204\verb#simd<n>::F# on $n$-bit fields gives rise to
1205an immediate horizontal operation
1206\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1207pairs of $n/2$ bit fields.
1208For example, \verb#simd<16>::add<h,l># adds values
1209in adjacent 8 bit fields to produce 16 bit results,
1210while \verb#simd<32>::min<h,l># can produce the
1211minimum value of adjacent 16-bit fields.
1212Thus any binary horizontal operation can be implemented
1213in a single IDISA instruction making use of the \verb:<h,l>:
1214operand modifier combination.
1215
1216Horizontal combinations of four adjacent fields can also be
1217realized in a general way through two steps of inductive
1218doubling.  For example, consider the or-across operation \verb:si_orx:
1219of the SPU, that performs a logical or operation
1220on four 32-bit fields.  This four field combination
1221can easily be implemented with the following two operations.
1222%\begin{singlespace}
1223\begin{verbatim}
1224t = simd<64>::or<h,l>(x, x)
1225t = simd<128>::or<h,l>(t, t)
1226\end{verbatim}
1227%\end{singlespace}
1228
1229In general, systematic support for horizontal
1230combinations of sets of $2^h$ adjacent fields may
1231be realized through $h$ inductive double steps
1232in a similar fashion.
1233Thus, IDISA esssentially offers systematic support
1234for horizontal operations entirely through the
1235use of \verb:<h, l>: half-operand modifier
1236combinations.
1237
1238Systematic support for general horizontal operations
1239under IDISA also creates opportunity for a design tradeoff:
1240offsetting the circuit complexity of half-operand
1241modifiers with potential elimination of dedicated
1242logic for some {/ad hoc} horizontal SIMD operations.
1243Even if legacy support for these operations is required,
1244it may be possible to provide that support through
1245software or firmware rather than a full hardware
1246implementation.  Evaluation of these possibilities
1247in the context of particular architectures is a potential
1248area for further work.
1249
1250
1251\begin{figure*}
1252\begin{center}
1253\begin{tabular}{|c||c|c|c|}
1254\hline
1255Kernel & Altivec ops & IDISA ops & Speed-up  \\ \hline
1256pop\_count<32> &  & 5n & 3X  \\ \hline
1257
1258bit\_reverse<32> & & & \\ \hline
1259Gray2binary<32> & 10n & 5n &  2X\\ \hline
1260\end{tabular}
1261\end{center}
1262\label{perftable}
1263\caption{Performance Results}
1264\end{figure*}
1265
1266
1267
1268\section{Implementation}
1269
1270We have carried implementation work for IDISA in three
1271ways.  First, we have constructed libraries that
1272implement the IDISA instructions by template and/or macro
1273expansion for each of MMX, SSE, Altivec, and SPU platforms.
1274Second, we have developed a model implementation
1275involving a modified operand fetch component
1276of a pipelined SIMD processor.  Third, we have written
1277and evaluated Verilog HDL description of this model
1278implementation.
1279
1280\subsection{IDISA Libraries}
1281
1282Implementation of IDISA instructions using template
1283and macro libraries has been useful in developing
1284and assessing the correctness of many of the algorithms
1285presented here.  Although these implementations do not
1286deliver the performance benefits associated with
1287direct hardware implementation of IDISA, they
1288have been quite useful in providing a practical means
1289for portable implementation of parallel bit stream
1290algorithms on multiple SWAR architectures.  However,
1291one additional facility has also proven necessary for
1292portability of parallel bit stream algorithms across
1293big-endian and little-endian architectures: the
1294notion of shift-forward and shift-back operations.
1295In essence, shift forward means shift to the left
1296on little-endian systems and shift to the right on
1297big-endian systems, while shift back has the reverse
1298interpretation.  Although this concept is unrelated to
1299inductive doubling, its inclusion with the IDISA
1300libraries has provided a suitable basis for portable
1301SIMD implementations of parallel bit stream algorithms.
1302Beyond this, the IDISA libraries have the additional
1303benefit of allowing the implementation
1304of inductive doubling algorithms at a higher level
1305abstraction, without need for programmer coding of
1306the underlying shift and mask operations.
1307
1308\subsection{IDISA Model}
1309\begin{figure}[tbh]
1310\begin{center}
1311\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1312\caption{IDISA Block Diagram}
1313\label{pipeline-model}
1314\end{center}
1315\end{figure}
1316
1317Figure \ref{pipeline-model} shows a block diagram
1318for a pipelined SIMD processor implementing IDISA.
1319The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1320registers each of width $N = 2^K$ bits. 
1321IDISA instructions identified by the Instruction Fetch
1322Unit (IFU) are forwarded for decoding to the SIMD
1323Instruction Decode Unit (SIDU).  This unit decodes
1324the instruction to produce
1325signals identifying the source and destination
1326operand registers, the half-operand modifiers, the
1327field width specification and the SIMD operation
1328to be applied.
1329
1330The SIDU supplies the source register information and the half-operand
1331modifier information to the SIMD Operand Fetch Unit (SOFU).
1332For each source operand, the SIDU provides an $A$-bit register
1333address and two 1-bit signals $h$ and $l$ indicating the value
1334of the decoded half-operand modifiers for this operand.
1335Only one of these values may be 1; both are 0 if
1336no modifier is specified.
1337In addition, the SIDU supplies decoded field width information
1338to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
1339The SIDU also supplies decoded SIMD opcode information to SIEU and
1340a decoded $A$-bit register address for the destination register to
1341the SIMD Result Write Back Unit (SRWBU).
1342
1343The SOFU is the key component of the IDISA model that
1344differs from that found in a traditional SWAR
1345processor.  For each of the two $A$-bit source
1346register addresses, SOFU is first responsible for
1347fetching the raw operand values from the SRF.
1348Then, before supplying operand values to the
1349SIEU, the SOFU applies the half-operand modification
1350logic as specified by the $h$, $l$, and field-width
1351signals.  The possibly modified operand values are then
1352provided to the SIEU for carrying out the SIMD operations.
1353A detailed model of SOFU logic is described in the following
1354subsection.
1355
1356The SIEU differs from similar execution units in
1357current commodity processors primarily by providing
1358SIMD operations at each field width
1359$n=2^k$ for $0 \leq k \leq K$.  This involves
1360additional circuitry for field widths not supported
1361in existing processors.  For inductive doubling
1362algorithms in support of parallel bit streams,
1363the principal need is for additional circuitry to
1364support 2-bit and 4-bit field widths.  This circuity
1365is generally less complicated than that for larger
1366fields.  Support for circuitry at these width
1367has other applications as well.   For example,
1368DNA sequences are frequently represented using
1369packed sequences of 2-bit codes for the four possible
1370nucleotides\cite{}, while the need for accurate financial
1371calculation has seen a resurgence of the 4-bit
1372packed BCD format for decimal floating point \cite{}.
1373
1374When execution of the SWAR instruction is
1375completed, the result value is then provided
1376to the SRWBU to update the value stored in the
1377SRF at the address specified by the $A$-bit
1378destination operand.
1379
1380\subsection{Operand Fetch Unit Logic}
1381
1382Discussion of gate-level implementation.
1383
1384
1385\section{Conclusions}
1386
1387In considering the architectural support for
1388SIMD text processing using the method of parallel bit streams,
1389this paper has presented the principle of inductive doubling
1390and a realization of that principle in the form of
1391IDISA, a modified SWAR instruction set architecture.
1392IDISA features support for SWAR operations at all power-of-2
1393field widths, including 2-bit and 4-bit widths, in particular,
1394as well as half-operand modifiers and pack and merge operations
1395to support efficient transition between successive power-of-two
1396field widths.  The principle innovation is the notion of
1397half-operand modifiers that eliminate the cost associated
1398with the explicit mask and shift operations required for
1399such transitions.
1400
1401Several algorithms key to parallel bit stream methods
1402have been examined and shown to benefit from dramatic
1403reductions in instruction count compared to the best
1404known algorithms on existing architectures.  In the case
1405of transposition algorithms to and from parallel bit stream
1406form, the architecture has been shown to make possible
1407straightforward inductive doubling algorithms with the
1408lowest total number of operations that can be achieved by
1409any possible 3-register SWAR architecture.
1410
1411Applications of IDISA in other areas have also been
1412examined.  The support for 2-bit and 4-bit field widths
1413in SWAR processing is beneficial for packed DNA representations
1414and packed decimal representations respectively.  Additional
1415inductive doubling examples are presented and the phenomenon
1416of power-of-2 transitions discussed more broadly.
1417Most significantly, IDISA supports a fully general approach
1418to horizontal SWAR operations, offering a considerable
1419improvement over the {\em ad hoc} sets of special-purpose
1420horizontal operations found in existing SWAR instruction sets.
1421
1422An IDISA implementation model has been presented employing
1423a customized operand fetch unit to implement the half-operand
1424modifier logic.  Gate-level implementation of this unit
1425has been analyzed and showed to be quite reasonable.
1426
1427Many special-purpose operations that are found in existing
1428processors could instead be implemented through efficient
1429IDISA sequences.  These include examples such as population
1430count, count leading and/or trailing zeroes and parity.
1431They also include specialized horizontal SIMD operations.
1432Thus, design tradeoffs can be made with the potential of
1433reducing the chip area devoted to special purpose instructions
1434in favor of more general IDISA features.
1435
1436Other tradeoffs may be possible in IDISA implementation itself.
1437Full support of IDISA features to the largest field widths
1438are not necessary in many cases.   For example, a given 128-bit
1439SIMD facility may support IDISA features only up to 32-bit
1440or 64-bit fields, similar to the current Altivec and SSE
1441architectures, respectively.   It may also be possible to
1442reduce the complexity of operand fetch circuitry by a factor
1443of two by dedicating one operand to a possible high half-operand
1444modifier and the other to a possible low half-operand modifier.
1445
1446Future research may consider the extension of inductive doubling
1447support in further ways.   For example, it may be possible
1448to develop a pipelined architecture supporting two or three
1449steps of inductive doubling in a single operation.
1450
1451%\appendix
1452%\section{Appendix Title}
1453%
1454%This is the text of the appendix, if you need one.
1455
1456\acks
1457
1458This research was supported by a Discovery Grant from the
1459Natural Sciences and Engineering Research Council of Canada.
1460
1461%\bibliographystyle{plainnat}
1462\bibliographystyle{plain}
1463\bibliography{asplos094-cameron}
1464%\begin{thebibliography}{}
1465
1466%\bibitem{smith02}
1467%Smith, P. Q. reference text
1468
1469%\end{thebibliography}
1470
1471
1472\end{document}
Note: See TracBrowser for help on using the repository browser.