source: docs/ASPLOS09/asplos094-cameron.tex @ 230

Last change on this file since 230 was 230, checked in by cameron, 11 years ago

Further improvements to population count section.

File size: 56.6 KB
RevLine 
[227]1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152To assess the benefits of IDISA features in comparison to
153those found in existing SIMD support on commodity
154processors, we will focus on an idealized three-register
155model of SIMD instruction sets and evaluation of
156parallel bit stream and other computational kernels with
157respect to these features.  The three-register model is
158based on the general approach to binary SIMD operations
159as ones that operate on the contents of two source operand
160registers to produce a value to be written back to a
161single destination operand register.  This three-register
162model is directly used by the Altivec SIMD instructions
163of the Power PC, for example.  On the Intel SSE platform,
164the three-register model is used as a programmer's
165interface for the C-language intrinsics, while the
166underlying instructions use a two-register model with
167destructive updating.  The computational kernels we study consist
168of 100\% branch-free code operating on registers, without
169memory access   For such kernels, we use straight-line
170instruction count as the performance metric of interest,
171assuming that pipelined processors are well-designed
172to handle latencies. 
173
174
175The remainder of this paper is organized as follows.
176
177The second section of this paper introduces IDISA and the
178SIMD notation used throughout this paper.  A brief comparison of
179IDISA features with existing SIMD instruction
180sets of commodity processors such as the Intel SSE
181instruction set and the Power PC Altivec instruction set
182is also made.
183
184The third section provides a short first example of
185the inductive doubling principle in action through
186the case of population count.   Although this operation
187is not a strong determinant of performance for parallel bit
188stream applications, it is nevertheless an operation needed
189frequently enough in the general computing milieux to find
190its way into some instruction set architectures, typically
191at one particular field width.  By way of comparison, the
192inductive doubling architecture sacrifices some
193performance at the chosen field width, while offering a more
194general solution with frequently better performance at
195other field widths.
196
197The fourth section then moves on to consider the performance-critical
198and key task of conversion between serial byte streams and parallel
199bit streams.  A first algorithm that uses the existing SIMD
200operations common to SSE and Altivec is shown, requiring 72
201operations to transform 128 bytes of data using the three-register
202instruction form.  We then move on to consider how the task may
203be simplified using IDISA to
204require a mere 24 operations.  As well as providing a 3X speed-up,
205it is also argued that the version using the inductive doubling
206architecture is considerably simpler and easier to program.
207
208The fifth section then briefly considers the inverse transposition
209process, converting parallel bit stream data back into byte
210streams.  Again, an algorithm to carry out this task requires
21172 straight-line SIMD operations in the Altivec three-register
212form, but is reduced to a simpler 24 operations using IDISA.
213
214The sixth section then goes on to consider the problem of
215parallel bit deletion.  This operation is performance-critical
216to any applications that require filtering or
217editing operations on strings using the parallel bit stream
218algorithms.   For example, it is fundamental to the
219high-speed UTF-8 to UTF-16 transcoding algorithm that is
220often a critical component in XML parsing.  In this
221section, an inductive doubling algorithm based on the
222concept of a central deletion result is described and
223shown to have much better performance than a parallel-prefix
224alternative.
225
226The seventh section considers the issue of
227horizontal SIMD operations, that is, operations for combining
228sets of adjacent fields within individual SIMD registers rather than
229corresponding fields within sets of registers.  While existing
230SIMD instruction set architectures tend to only support a few
231ad hoc horizontal combinations, IDISA is shown to provide a systematic
232means for efficient horizontal combinations of any kind.
233
234An implementation model for IDISA is then considered
235in section 8 of the paper, focusing on a pipelined
236SIMD architecture featuring a modified operand fetch stage.
237A gate-count analysis of one feasible implementation is
238provided as well as a discussion of the implementation
239of required extensions to handle 2-bit and 4-bit fields not
240commonly supported on existing SIMD architectures.   Design
241tradeoffs are also considered focusing the potential removal of
242various {\em ad hoc} instructions on existing processors in favor of
243more general alternatives provided through IDISA.
244
245The ninth section concludes the paper with a summary of results
246and discussion of areas for future work.
247
248
249\section{Inductive Doubling Architecture}
250
251This section presents an idealized model for a single-instruction
252multiple-data (SIMD) instruction set architecture designed
253specifically to support inductive doubling algorithms in the
254domain of parallel bit stream programming.   The architecture is idealized
255in the sense that we concentrate on only the necessary features
256for our purpose, without enumerating the additional
257operations that would be required for
258SIMD applications in other domains.  The goal is to focus
259on the principles of inductive doubling support in a way
260that can accommodate a variety of realizations as other
261design constraints are brought to bear on the overall instruction set
262design.  First we introduce a simple model and notation for
263SIMD operations in general and then present the four key
264features of an idealized architecture in support of parallel
265bit stream programming.
266
267The idealized architecture supports typical SIMD integer
268operations common to existing commodity architectures such as SSE
269and Altivec.   The architecture is defined to support SIMD
270operations on registers of size $N=2^K$ bits, for some integer $K$.
271Typical values of $K$ for commodity processors include $K=6$ for
272the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
273the 128-bit registers of SSE and Altivec technology and $K=8$ for
274the upcoming Intel AVX technology.   The registers may be
275partitioned into $N/n$ fields of width $n=2^k$ bits for some values
276of $k \leq K$.   Typical values of $k$ used on commodity processors
277include $k = 3$ for SIMD operations on 8-bit fields (bytes),
278$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
279on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
280fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
281Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
282register $r$, using big-endian numbering.
283
284Let \verb:simd<n>: represent the class of SIMD operations defined
285on fields of size $n$ using C++ template syntax.  Given a
286binary function $F_n$ on $n$-bit fields, we denote the SIMD
287version of this operation as \verb#simd<n>::F#.  Given two
288SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
289respectively, the operation \verb#r=simd<n>::F(a,b)# stores
290the value $r$ in the register \verb:r: as determined by
291the simultaneous calculation of individual field values in
292accord with the following equation.
293\[ r_i = F_n(a_i, b_i) \]
294
295For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
296logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
297integers as follows.
298%\singlespace
299\begin{eqnarray}
300\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
301\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
302\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
303\end{eqnarray}
304%\doublespace
305The SSE and Altivec instruction sets support
306each of the addition and subtraction operations in SIMD form
307for 8, 16 and 32-bit fields, while the SSE set also includes
308the 64-bit forms.  For example, \verb#simd<8>::add# in our
309nomenclature is provided by the operation \verb:paddb:
310on SSE and the operation \verb:vaddubm: on Altivec.
311The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
312\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
313more constrained, requiring that all field shifts by the same amount.
314
315Given these definitions and notation, we now present
316the four key elements of an inductive doubling architecture.
317The first is a definition of a core set of binary functions
318on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
319The second is a set of {\em half-operand modifiers} that allow
320the inductive processing of fields of size $2n$ in terms of
321combinations of $n$-bit values selected from the fields.
322The third is the definition of packing operations that compress
323two consecutive registers of $n$-bit values into a single
324register of $n/2$-bit values.  The fourth is the definition
325of merging operations that produce a set of $2n$ bit fields
326by concatenating corresponding $n$-bit fields from two
327parallel registers.  Each of these features is described below.
328
329For the purpose of direct and efficient support for inductive
330doubling algorithms on bit streams, the provision of
331a core set of operations at field widths of 2 and 4 as
332well as the more traditional field witdhs of 8, 16 and 32
333is critical for elegant and efficient expression of the
334algorithms.  In essence, inductive doubling algorithms
335work by establishing some base property at either single
336or 2-bit fields.  Each iteration of the algorithm then
337goes on to establish the property for the power-of-2
338field width.  In order for this inductive step to be
339most conveniently and efficiently expressed, the
340core operations needed for the step should be available
341at each field width.  In the case of work with parallel
342bit streams, the operations \verb:add:, \verb:sub:,
343\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
344(rotate left) comprise the core.  In other domains,
345additional operations may be usefully included in the
346core depending on the work that needs to be performed
347at each inductive doubling level.
348
349Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
350also includes fields of width 1.  These are included for
351logical consistency, but are easily implemented by mapping
352directly to appropriate bitwise logic operations, which
353we assume are also available.  For example,
354\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
355bitwise exclusive-or operation on SIMD registers.
356
357The second key facility of the inductive doubling architecture
358is the potential application of half-operand modifiers to
359the fields of either or both of the operands of a SIMD
360operation.  These modifiers select either the
361low $n/2$
362bits of each $n$-bit field (modifier ``\verb:l:'') or the
363high $n/2$ bits (modifier ``\verb:h:'').  When required,
364the modifier ``\verb:x:'' means that the full $n$ bits
365should be used, unmodified.  The semantics of these
366modifiers are given by the following equations.
367%\singlespace
368\begin{eqnarray}
369l(r_n) & = & r_n \bmod 2^{n/2} \\
370h(r_n) & = & r_n / 2^{n/2} \\
371x(r_n) & = & r_n
372\end{eqnarray}
373%\doublespace
374In our notation, the half-operand modifiers are
375specified as optional template (compile-time) parameters
376for each of the binary functions.  Thus,
377\verb#simd<4>::add<l,h>(a,b)# is an operation which adds
378the 2-bit quantity found in the low 2-bits of each 4-bit field
379of its first operand (\verb:a:)
380together with the corresponding 2-bit quantity found in the
381high 2-bits of its second operand (\verb:b:).
382In general, the purpose of the half-operand modifiers
383in support of inductive doubling is to allow the processing
384of $n$-bit fields to easily expressed in terms of
385combination of the results determined by processing
386$n/2$ bit fields.
387
388The third facility of the inductive doubling architecture
389is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
390The field values of \verb#r=simd<n>::pack(a,b)# are
391defined by the following equations.
392%\singlespace
393\begin{eqnarray}
394r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
395r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
396\end{eqnarray}
397%\doublespace
398Here conv is a function which performs conversion of an $n$-bit
399value to an $n/2$ bit value by signed saturation (although
400conversion by unsigned saturation would also suit our purpose).
401
402Half-operand modifiers may also be used with the pack
403operations.  Thus packing with conversion by masking off all
404but the low $n/2$ bits of each field may be
405be performed using the operation \verb#simd<n>::pack<l,l>#
406
407
408The final facility of the inductive doubling architecture is
409a set of merging operations \verb#simd<n>::mergeh# and
410\verb#simd<n>::mergel# that produce $2n$-bit fields
411by concatenating corresponding $n$-bit fields from the
412operand registers.   The respective
413operations \verb#r=simd<n>::mergeh(a,b)# and
414\verb#s=simd<n>::mergel(a,b)#
415are defined by the following equations.
416%\singlespace
417\begin{eqnarray}
418r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
419s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
420\end{eqnarray}
421%\doublespace
422Both SSE and Altivec provide versions of pack and merge operations
423for certain field widths.  The pack operations are provided
424with operands having 16-bit or 32-bit fields on each platform, although
425with some variation in how conversion is carried out.
426The merge operations are provided at 8-bit, 16-bit and 32-bit
427field widths on both architectures and also at the 64-bit level
428on SSE.
429
430This completes the description of the proposed inductive
431doubling architecture.  As described, many of the features
432are already available with the SIMD facilities of
433existing commodity processors.  The extensions enumerated
434here are relatively straightforward.  The innovation
435is to specifically tackle the design of facilities to
436offer systematic support for transitions between power-of-2 field widths.
437As we shall show in the remainder of this paper, these facilities
438can dramatically reduce instruction count in core parallel bit
439stream algorithms, with a factor of 3 reduction being typical.
440The next section goes on to illustrate a first simple example
441for the task of population count.
442
443\section{Population Count}
444
445\begin{figure}
446\begin{center}\small
447\begin{verbatim}
448c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
449c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
450c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
451c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
452c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
453\end{verbatim}
454\end{center}
455\caption{Population Count Reference Algorithm}
456\label{HD-pop}
457\end{figure}
458
459\begin{figure}
460\begin{center}\small
461\begin{verbatim}
462c = simd<2>::add<l,h>(x, x);
463c = simd<4>::add<l,h>(c, c);
464c = simd<8>::add<l,h>(c, c);
465c = simd<16>::add<l,h>(c, c);
466c = simd<32>::add<l,h>(c, c);
467\end{verbatim}
468\end{center}
469\caption{Ideal SIMD Population Count}
470\label{inductivepopcount}
471\end{figure}
472
473As an initial example to illustrate the principle of inductive doubling
474in practice, consider the problem of {\em population count}: determining
475the number of one bits within a particular bit field.  It is important
476enough for such operations as
477calculating Hamming distance to be included as a built-in instruction
478on some processors.  For example, the SPU of the Cell Broadband Engine
479has a SIMD population count instruction \verb:si_cntb: for simultaneously
480determining the
481number of 1 bits within each byte of a 16-byte register.
482In text processing with parallel bit streams, population count has direct
483application to keeping track of line numbers for error reporting, for example.
484Given a bit block identifying the positions of newline characters within
485a block of characters being processed, the population count of the
486bit block can be used to efficiently and conveniently be used to update
487the line number upon completion of block processing.
488
489Figure \ref{HD-pop} presents a traditional divide-and-conquer
490implementation for a 32-bit integer {\tt x} adapted from
491Warren \cite{HackersDelight}, while
[230]492Figure \ref{inductivepopcount} shows the corresponding IDISA
493implementation for a vector of 32-bit fields.  Each implementation employs
[227]494five steps of inductive doubling to produce population counts
495within 32 bit fields.  The traditional implementation employs
496explicit masking and shifting operations, while these
497operations are implicit within the semantics of the inductive
[230]498doubling instructions shown in Figure \ref{inductivepopcount}.
[227]499In each implementation, the first step determines the
500the population counts within 2-bit fields
501by adding the high bit of each such field to the low bit
502to produce a set of 2-bit counts in {\tt c}.
503In the second step, the counts within 4-bit fields of {\tt c} are determined
504by adding the counts of the corresponding high and low 2-bit subfields.
505Continuing in this fashion,
506the final population counts within 32-bit fields are determined in five steps.
507
508With the substitution of longer mask constants replicated for four
50932-bit fields, the implementation of Figure \ref{HD-pop} can be
[230]510directly adapted to SWAR processing using 128-bit registers.
511Each binary operator is replaced by a corresponding binary
512SIMD operation.   Without the IDISA features, a
513straightforward SWAR implementation of population count for
51432-bit fields thus employs 10 operations to load or generate
[227]515mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
[230]516total of 30 operations to complete the task.   Employing
517optimization identified by Warren, this can be reduced to
51820 operations, 5 of which are required to generate mask constants.
519At the cost of register pressure, it is possible that these constants
520could be kept preloaded in long vector processing.
521In any case, the IDISA implementation requires only 5 operations
522to carry out the work requiring 15 to 20 operations with the
523optimized version using standard SWAR operations.   IDISA thus offers
[227]524offers a 3X to 4X improvement in instruction count as well as
525a reduction in register pressure.
526
[230]527The pattern illustrated by population count is typical.  An
528inductive doubling algorithm of $n$ steps typically applies
529two mask or shift operations at each step for each of the
530two operands being combined in the step.   If the combination
531can be implemented in a single SWAR operation, then a total
532of $3n$ operations are the minimum required for the SWAR
533algorithm with IDISA features.   An IDISA implementation
534typically eliminates the explicit mask and shift operations
535through appropriate half-operand modifiers, reducing the
536total instruction count to $n$.   Thus a 3X improvement
537is typical.
538
[227]539\section{Transposition to Parallel Bit Streams}
540
541In this section, we consider the first major
542application of the inductive doubling architecture:
543transposition of byte stream data to parallel bit stream
544form.   Of course, this operation is critical to the
545method of parallel bit streams and all applications
546of the method can benefit from a highly efficient
547transposition process.  Before considering how
548the inductive doubling architecture supports this
549transposition process, however, we first consider
550algorithms on existing architectures.  Two algorithms
551are presented; the best of these requires 72
552SIMD operations in the three-register model to perform
553transposition of eight serial registers of byte stream data into
554eight parallel registers of bit stream data.
555
556We then go on to show how the transposition problem
557can be solved using the inductive doubling architecture
558with a mere 24 three-register SIMD operations.  We also prove
559that this is optimal for any three-register instruction set model.
560
561
562\begin{figure}[tbh]
563\begin{center}
564\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
565\caption{Input/Output Model for Serial to Parallel Transposition}
566\label{s2p-spec}
567\end{center}
568
569\end{figure}
570Figure \ref{s2p-spec} illustrates the input-output requirements of
571the transposition problem.  We assume that inputs and
572outputs are each SIMD registers of size $N=2^K$ bits.
573The input consists of $N$ bytes of serial byte data,
574stored consecutively in eight SIMD registers each holding
575$N/8$ bytes.  The output consists of eight parallel
576registers, one each for the eight individual bit positions
577within a byte.  Upon completion of the transposition process,
578each output register is to hold the $N$ bits corresponding
579to the selected bit position in the sequence of $N$ input
580bytes.
581
582\subsection{Bit Gathering Algorithm}
583
584\begin{figure}[tbh]
585\begin{center}
[229]586\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
[227]587\caption{Serial to Parallel Transposition Using Bit-Gathering}
588\label{gather}
589\end{center}
590\end{figure}
591One straightforward algorithm for implementing the transposition process
592takes advantage of SIMD bit gathering operations that exist
593on some architectures.  This operation gathers one bit per byte
594from a particular position within each byte of a SIMD register.
595For example, the {\tt pmovmskb} operation of the Intel MMX and
596SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
597consisting of the high bit of each byte.  Similarly, the
598{\tt \verb:si_gbb:} operation of the synergistic processing units of the
599Cell Broadband Engine gathers together the low bit of each byte
600from the SIMD register.  Figure \ref{gather} illustrates the
601bit gathering process.
602
603For each bit stream of output, the bit gather algorithm requires
604one gather operation for each of the 8 input registers,
605giving 64 bit gather operations in all.  In addition, for seven
606of the eight bit positions, it is necessary to shift the bits
607of each input register into the conventional position for
608gathering.  A total of 56 shift operations are required for this
609task.  Finally, the result of each bit gather operation must
610be properly inserted into the output stream.  If the first
611gather operation for a stream can be used to also initialize
612the output register, there will then need to be 7 insert
613operations for the results of the remaining gather operations
614for this stream, making 56 insert operations in all.
615The insert step may be more complex than a single operation
616in some cases, but we use one operation per insert as a lower bound.
617Thus, the bit gather algorithm requires
618at least 176 operations to perform the transposition task.
619
620\subsection{BytePack Algorithm}
621
622A much more efficient transposition algorithm on commodity
623SIMD architectures (SSE and Altivec) involves three
624stages of binary division transformation.  This is similar
625to the three stage bit matrix inversion described by
626Warren  \cite{HackersDelight}, although modified to use SIMD operations.
627In each stage, input streams are divided into two half-length output streams.
628The first stage separates the bits at even numbered positions from those
629at odd number positions.  The two output streams from the first
630stage are then further divided in the second stage.
631The stream comprising even numbered bits from the original byte stream
632divides into one stream consisting of bits from positions 0 and 4 of each
633byte in the original stream and a second stream consisting of bits
634from positions 2 and 6 of each original byte.  The stream of bits from
635odd positions is similarly divided into streams for bits from Each of the
636positions 1 and 5 and bits from positions 2 and 6.
637Finally, each of the four streams resulting from the second stage are
638divided into the desired individual bit streams in the third stage.
639
640\begin{figure}[tbh]
641\begin{center}
642\begin{verbatim}
643t0 = simd<16>::pack<h,h>(s0, s1);
644t1 = simd<16>::pack<l,l>(s0, s1);
645p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
646p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
647\end{verbatim}
648\end{center}
649\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
650\label{s2pstep}
651\end{figure}
652
653The binary division transformations are accomplished in each stage
654using byte packing, shifting and masking.  In each stage, a
655transposition step combines each pair of serial input registers to
656produce a pair of parallel output registers.  In essence,
657doublebytes from the input registers are packed into bytes
658in the output registers, with the bits from even positions stored
659in the bytes of one output stream and the bits from odd positions
660stored in the bytes of the second output stream.
661Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
662producing two parallel registers \verb:p0: and \verb:p1: from
663two serial registers \verb:s0: and \verb:s1:.  This step is applied
664four times in stage 1; stages 2 and 3 also consist of four applications
665of a similar step with different shift and masking constants.
666
667Although we have used the idealized SIMD notation here, each of the
668operations maps to a single operation in the Altivec set and a small number
669of operations in the SSE set.  Using the Altivec set, there are
6706 operations per step for a total of 24 operations per stage. 
671The three stages combined required 72 operations to transpose 128 bytes
672to parallel bit stream form.  This is the best algorithm known to
673us for existing SIMD architectures.
674
675\subsection{Inductive Halving Algorithm}
676
677Using the inductive doubling architecture, it is possible to design
678a transposition algorithm that is both easier to understand and requires
679many fewer operations than the the bytepack algorithm described above.
680We call it the inductive halving algorithm for serial to parallel
681transposition, because it proceeds by reducing byte streams to
682two sets of nybble streams in a first stage, dividing the nybble
683streams into streams of bitpairs in a second stage and finally
684dividing the bitpair streams into bit streams in the third stage.
685
686
687\begin{figure}[tbh]
688\begin{verbatim}
689p0 = simd<8>::pack<h,h>(s0, s1);
690p1 = simd<8>::pack<l,l>(s0, s1);
691\end{verbatim}
692\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
693\label{halvingstep}
694\end{figure}
695
696Figure \ref{halvingstep} shows one step in stage 1 of the inductive
697halving algorithm, comprising just two SIMD operations.
698The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
699from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
700the low nybble of each byte.  As in the bytepack algorithm, this step is
701applied 4 times in stage 1, for a total of 8 operations.
702
703Stage 2 of the inductive halving algorithm reduces nybble streams
704to streams of bit pairs.  The basic step in this algorithm consists
705of one \verb#simd<4>::pack<h,h># operation to extract the high pair
706of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
707low pair of each nybble.  Four applications of this step complete stage 2.
708
709Stage 3 similarly uses four applications of a step that uses a
710\verb#simd<2>::pack<h,h># operation to extract the high bit of
711each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
712each pair.  The complete algorithm to transform eight serial
713byte registers s0 through s7 into the eight parallel bit stream
714registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
715
716\begin{figure}[tbh]
717\begin{verbatim}
718hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
719lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
720hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
721lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
722hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
723lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
724hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
725lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
726hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
727hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
728lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
729ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
730hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
731hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
732lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
733ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
734bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
735bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
736bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
737bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
738bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
739bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
740bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
741bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
742\end{verbatim}
743\caption{Complete Inductive Halving Algorithm}
744\label{halvingalgorithm}
745\end{figure}
746
747\subsection{Optimality of the Inductive Halving Algorithm}
748
749Here we show that the inductive halving algorithm presented in
750the previous subsection is optimal in the following sense:
751no other algorithm on any 3-register SIMD architecture can use
752fewer than 24 operations to transform eight serial registers
753of byte stream data into eight parallel registers of bit stream data.
754By 3-register SIMD architecture, we refer to any architecture
755that uses SIMD instructions consistent with our overall model of
756binary operations using two input register operands to produce
757one output register value.
758
759Observe that the $N$ data bits from each input register must be
760distributed $N/8$ each to the 8 output registers by virtue of
761the problem definition.  Each output register can effectively
762be given a 3-bit address; the partitioning problem can be viewed
763as moving data to the correct address.  However, each
764operation can move results into at most one register. 
765At most this can result in the assignment of one correct address
766bit for each of the $N$ input bits.  As all $8N$ input bits
767need to be moved to a register with a correct 3-bit address,
768a minimum of 24 operations is required.
769
770\section{Parallel to Serial Conversion}
771
772Parallel bit stream applications may apply string editing
773operations in bit space to substitute, delete or insert
774parallel sets of bits at particular positions.  In such cases,
775the inverse transform that converts a set of parallel bit
776streams back into byte space is needed.  In the example of
777UTF-8 to UTF-16 transcoding, the inverse transform is
778actually used twice for each application of the forward
779transform, to separately compute the high and low byte
780streams of each UTF-16 code unit.  Those two byte streams
781are subsequentely merged to form the final result.
782
783Algorithms for performing the inverse transform mirror those
784of the forward transform, employing SIMD merge operations
785in place of pack operations.  The best algorithm known
786to us on the commodity SIMD architectures takes advantage
787of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
788operations that are available with each of the SSE and Altivec instruction
789sets.  These algorithms take 72 operations to perform the
790inverse transposition of 8 parallel registers of bit stream
791data into 8 serial registers of byte stream data. 
792
793\begin{figure}[tbh]
794\begin{center}
795\begin{verbatim}
796bit01_r0 = simd<1>::mergeh(bit0, bit1);
797bit01_r1 = simd<1>::mergel(bit0, bit1);
798bit23_r0 = simd<1>::mergeh(bit2, bit3);
799bit23_r1 = simd<1>::mergel(bit2, bit3);
800bit45_r0 = simd<1>::mergeh(bit4, bit5);
801bit45_r1 = simd<1>::mergel(bit4, bit5);
802bit67_r0 = simd<1>::mergeh(bit6, bit7);
803bit67_r1 = simd<1>::mergel(bit6, bit7);
804bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
805bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
806bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
807bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
808bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
809bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
810bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
811bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
812s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
813s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
814s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
815s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
816s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
817s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
818s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
819s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
820\end{verbatim}
821\end{center}
822\label{p2s-inductive}
823\caption{Parallel to Serial Transposition by Inductive Doubling}
824\end{figure}
825
826An algorithm employing only 24 operations using the
827inductive doubling instruction set architecture is relatively
828straightforward.. In stage 1, parallel registers for individual bit streams
829are first merged with bit-level interleaving
830using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
831operations.  For each of the four pairs of consecutive
832even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
833two consecutive registers of bitpair data are produced.
834In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
835are then applied to merge to bitpair streams to produce streams
836of nybbles for the high and low nybble of each byte.  Finally,
837the nybble streams are merged in stage 3 to produce the
838desired byte stream data.  The full inductive doubling
839algorithm for parallel to serial transposition is shown in Figure
840\ref{p2s-inductive}.
841
842This algorithm is again optimal, requiring the fewest operations
843of any possible algorithm using any 3-register instruction set
844model.  The proof directly follows that for the serial to parallel
845transposition.
846
847The existence of high-performance algorithms for transformation of
848character data between byte stream and parallel bit stream form
849in both directions makes it possible to consider applying these
850transformations multiple times during text processing applications.
851Just as the time domain and frequency domain each have their
852use in signal processing applications, the byte stream form and
853parallel bit stream form can then each be used at will in
854character stream applications.
855
856
857
858
859
860\section{Parallel Bit Deletion}
861
862Parallel bit deletion is an important operation that allows string
863editing operations to be carried out while in parallel bit stream
864form.  It is also fundamental to UTF-8 to UTF-16 transcoding
865using parallel bit streams, allowing the excess code unit
866positions for UTF-8 two-, three- and four-byte sequences to
867be deleted once the sixteen parallel bit streams of UTF-16 have
868been computed \cite{PPoPP08}.
869
870Parallel bit deletion is specified using a deletion mask.
871A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
872to be deleted and 0s at positions identifying bits to be retained.
873For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
874bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
875accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
876
877Bit deletion may be performed using
878the parallel-prefix compress algorithm documented by
879Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
880only logic and shifts with a constant parameter to carry
881out the deletion process.  However, it requires $k^2$
882preprocessing steps for a final field width parameter
883of size $2^k$, as well as 4 operations per deletion step
884per stream.  Using the inductive doubling instruction set architecture
885it is possible to carry out bit deletion much more efficiently.
886
887Deletion within fixed size fields or registers may produce results that are either
888left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
889eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
890care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
891left-justified result produces an important intermediate form known as a
892{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
893right-justified and left-justified results from the application of the
8944-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
895stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
896the central result \verb:xxbdefhx:, which may easily be converted to a either a
897left or a right justified 8-element result by an appropriate shift operation.
898
899\begin{figure}
900\begin{center}
901\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
902\hline
903\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
904\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
905\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
906\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
907\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
908\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
909\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
910\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
911\end{tabular}
912\end{center}
913\label{centraldelstep}
914\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
915\end{figure}
916
917The observation about how two $n$-bit central deletion results can
918combine to yield a $2n$ central deletion result provides the basis
919for an inductive doubling algorithm.  Figure \ref{centraldelstep}
920illustrates the inductive process for the transition from 8-bit central
921deletion results to 16-bit central deletion results.  The top row shows
922the original deletion mask, while the second row shows the original
923bit stream to which deletions are to be applied, with deleted bits zeroed out.
924The third row shows the central result for each 8-bit field as the
925result of the previous inductive step.
926
927To perform the $8 \rightarrow 16$ central deletion step, we first form
928the population counts of 4-bit fields of the original deletion mask as
929shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
930an 8-bit central result, we perform a right shift by the population count
931of the low half of the field.  Similarly,
932left-justification requires a left-shift by the population count in the
933high half of the field.
934
935The left and right shifts can be performed simultaneously using a rotate
936left instruction.  Right justification by shifting an $n$ bit field
937$i$ positions to the right is equivalent to a left rotate of $n-i$
938positions.  These rotation amounts are computed by the operation \newline
939\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
940except that don't care fields (which won't be subsequently used)
941are marked \verb:XX:. 
942
943The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
944as shown in row 6, and are combined with the right shift amounts
945by the selection operation \newline \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
946as shown in row 7.  Using these computed values, the inductive step
947is completed by application of the operation \newline \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
948as shown in row 8.
949
950At each inductive doubling level, it requires 4 operations to compute the
951required deletion infomation and one operation per bit stream to perform deletion.
952Note that, if deletion is to be applied to a set of eight parallel bit streams,
953the computed deletion information is used for each stream without recomputation,
954thus requiring 12 operations per inductive level.
955
956In comparison to the parallel-prefix compress method, the method of central
957deletion results using the inductive doubling architecture has far fewer operations.
958The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
959induction versus $4k^2$ for the parallel-prefix method.  Using the computed
960deletion operation, only a single SIMD rotate operation per bit stream
961per level is needed, in comparison with 4 operations per level for parallel-prefix
962compress.
963
964
965\section{Systematic Support for Horizontal SIMD Operations}
966
967In SIMD parlance, {\em horizontal} operations are
968operations which combine values from two or more fields
969of the same register, in contrast to the normal
970{\em vertical} operations which combine corresponding
971fields of different registers.  Horizontal operations
972can be found that combine two (e.g., haddpd on SSE3),
973four (e.g, \verb:si_orx: on SPU), eight (e.g, psadbw on SSE)
974or sixteen values (e.g., vcmpequb on Altivec).  Some
975horizontal operations have a vertical component as well.
976For example, psadbw first forms the absolute value of
977the difference of eight corresponding byte fields before
978performing horizontal add of the eight values, while
979vsum4ubs on Altivec performs horizontal add of sets of
980four unsigned 8-bit fields within one register
981and then combines the result horizontally with
982corresponding 32-bit fields of a second registers.
983
984The space of potential horizontal operations thus has
985many dimensions, including not only the particular
986combining operation and the operand field width, but
987also the number of fields being combined, whether a
988vertical combination is applied and whether it is applied
989before or after the horizontal operation and what the
990nature of the vertical combining operation is.
991Within this space, commodity SIMD architectures tend
992to support only a very few combinations, without any
993particular attempt at systematic support for horizontal
994operations in general.
995
996By making use of \verb:<l,h>: half-operand modifier
997combinations, the inductive doubling architecture
998offers systematic support for horizontal operations
999on pairs of adjacent fields.
1000For example, \verb#simd<16>::add<l,h># adds values
1001in adjacent 8 bit fields to produce 16 bit results,
1002while \verb#simd<32>::min<l,h># can produce the
1003minimum value of adjacent 16-bit fields.  In general,
1004\newline \verb#simd<n>::F<l,h># denotes the horizontal
1005binary combination of adjacent fields for any
1006operator $F$ and field width $n$.
1007
1008Horizontal combinations of larger numbers of fields
1009makes use of the inductive doubling property.
1010For example, consider the or-across operation \verb:si_orx:
1011of the SPU, that performs a logical or operation
1012on four 32-bit fields.  This four field combination
1013involves two steps in the inductive doubling approach.
1014%\begin{singlespace}
1015\begin{verbatim}
1016t = simd<64>::or<l,h>(x, x)
1017t = simd<128>::or<l,h>(t, t)
1018\end{verbatim}
1019%\end{singlespace}
1020This example is also interesting in showing a potential
1021value for supporting bitwise logical operations at
1022different field widths, i.e., specifically for use with
1023half-operand modifiers.
1024
1025Similarly, to combine any eight fields simply requires
1026three inductive doubling steps using the desired
1027operator at successive power-of-two field widths, while
1028combining sixteen fields requires four such operations.
1029In this way, the inductive doubling architecture provides
1030systematic support for horizontal operations well beyond
1031the existing facilities of commodity architectures,
1032although lacking some of the special features found in
1033some cases.
1034
1035
1036\section{Implementation}
1037
[228]1038We have carried implementation work for IDISA in three
1039ways.  First, we have constructed libraries that
1040implement the IDISA instructions by template and/or macro
1041expansion for each of MMX, SSE, Altivec, and SPU platforms.
1042Second, we have developed a model implementation
1043involving a modified operand fetch component
1044of a pipelined SIMD processor.  Third, we have written
1045and evaluated Verilog HDL description of this model
1046implementation.
[227]1047
[228]1048\subsection{IDISA Libraries}
[227]1049
[228]1050Implementation of IDISA instructions using template
1051and macro libraries has been useful in developing
1052and assessing the correctness of many of the algorithms
1053presented here.  Although these implementations do not
1054deliver the performance benefits associated with
1055direct hardware implementation of IDISA, they
1056have been quite useful in providing a practical means
1057for portable implementation of parallel bit stream
1058algorithms on multiple SWAR architectures.  However,
1059one additional facility has also proven necessary for
1060portability of parallel bit stream algorithms across
1061big-endian and little-endian architectures: the
1062notion of shift-forward and shift-back operations.
1063In essence, shift forward means shift to the left
1064on little-endian systems and shift to the right on
1065big-endian systems, while shift back has the reverse
1066interpretation.  Although this concept is unrelated to
1067inductive doubling, its inclusion with the IDISA
1068libraries has provided a suitable basis for portable
1069SIMD implementations of parallel bit stream algorithms.
1070Beyond this, the IDISA libraries have the additional
1071benefit of allowing the implementation
1072of inductive doubling algorithms at a higher level
1073abstraction, without need for programmer coding of
1074the underlying shift and mask operations.
[227]1075
[228]1076\subsection{IDISA Model}
1077Figure \ref{pipeline-model} shows a model architecture
1078for a pipelined SIMD processor implementing IDISA.
1079The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1080registers each of width $N = 2^K$ bits. 
1081IDISA instructions identified by the Instruction Fetch
1082Unit (IFU) are forwarded for decoding to the SIMD
1083Instruction Decode Unit (SIDU).  This unit decodes
1084the instruction to produce
1085signals identifying the source and destination
1086operand registers, the half-operand modifiers, the
1087field width specification and the SIMD operation
1088to be applied.
[227]1089
[228]1090The SIDU supplies the source register information and the half-operand
1091modifier information to the SIMD Operand Fetch Unit (SOFU).
1092For each source operand, the SIDU provides an $A$-bit register
1093address and two 1-bit signals $h$ and $l$ indicating the value
1094of the decoded half-operand modifiers for this operand.
1095Only one of these values may be 1; both are 0 if
1096no modifier is specified.
1097In addition, the SIDU supplies decoded field width information
1098to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
1099The SIDU also supplies decoded SIMD opcode information to SIEU and
1100a decoded $A$-bit register address for the destination register to
1101the SIMD Result Write Back Unit (SRWBU).
[227]1102
[228]1103The SOFU is the key component of the IDISA model that
1104differs from that found in a traditional SWAR
1105processor.  For each of the two $A$-bit source
1106register addresses, SOFU is first responsible for
1107fetching the raw operand values from the SRF.
1108Then, before supplying operand values to the
1109SIEU, the SOFU applies the half-operand modification
1110logic as specified by the $h$, $l$, and field-width
1111signals.  The possibly modified operand values are then
1112provided to the SIEU for carrying out the SIMD operations.
1113A detailed model of SOFU logic is described in the following
1114subsection.
1115
1116The SIEU differs from similar execution units in
1117current commodity processors primarily by providing
1118SIMD operations at each field width
1119$n=2^k$ for $0 \leq k \leq K$.  This involves
1120additional circuitry for field widths not supported
1121in existing processors.  For inductive doubling
1122algorithms in support of parallel bit streams,
1123the principal need is for additional circuitry to
1124support 2-bit and 4-bit field widths.  This circuity
1125is generally less complicated than that for larger
1126fields.  Support for circuitry at these width
1127has other applications as well.   For example,
1128DNA sequences are frequently represented using
1129packed sequences of 2-bit codes for the four possible
1130nucleotides\cite{}, while the need for accurate financial
1131calculation has seen a resurgence of the 4-bit
1132packed BCD format for decimal floating point \cite{}.
1133
1134When execution of the SWAR instruction is
1135completed, the result value is then provided
1136to the SRWBU to update the value stored in the
1137SRF at the address specified by the $A$-bit
1138destination operand.
1139
1140\subsection{Operand Fetch Unit Logic}
1141
1142
1143
1144
[227]1145\section{Conclusions}
1146
1147This paper has considered the issue of architectural support for
1148SIMD text processing using the method of parallel bit streams and has
1149argued that this architectural support can best be provided
1150through a SIMD instruction set architecture that implements
1151features for direct support of inductive doubling algorithms.
1152Four key features of the inductive doubling architecture have
1153been identified include support for operations at all
1154power-of-2 field widths, half-operand modifiers and
1155pack and merge operations.  The principle innovation is the
1156notion of half-operand modifiers to support efficient
1157transition between successive power-of-two field widths.
1158
1159Several algorithms key to parallel bit stream methods
1160have been examined and shown to benefit from dramatic
1161reductions in instruction count compared to the best
1162known algorithms on existing architecture.  In the case
1163of transposition algorithms to and from parallel bit stream
1164form, the architecture has been shown to make possible
1165straightforward inductive doubling algorithms with the
1166lowest total number of operations that can be achieved by any
1167possible 3-register SIMD architecture.
1168
1169The inductive doubling architecture also has considerable
1170benefits beyond its role in supporting SIMD programming
1171with parallel bit streams.  Notable among these is
1172that the architecture provides a framework for systematic
1173support of horizontal SIMD operations.
1174
1175
1176
1177%\appendix
1178%\section{Appendix Title}
1179%
1180%This is the text of the appendix, if you need one.
1181
1182\acks
1183
1184This research was supported by a Discovery Grant from the
1185Natural Sciences and Engineering Research Council of Canada.
1186
1187%\bibliographystyle{plainnat}
1188\bibliographystyle{plain}
1189\bibliography{asplos094-cameron}
1190%\begin{thebibliography}{}
1191
1192%\bibitem{smith02}
1193%Smith, P. Q. reference text
1194
1195%\end{thebibliography}
1196
1197
1198\end{document}
Note: See TracBrowser for help on using the repository browser.