source: docs/ASPLOS09/asplos094-cameron.tex @ 227

Last change on this file since 227 was 227, checked in by lindanl, 11 years ago

figure adjustment

File size: 53.0 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152To assess the benefits of IDISA features in comparison to
153those found in existing SIMD support on commodity
154processors, we will focus on an idealized three-register
155model of SIMD instruction sets and evaluation of
156parallel bit stream and other computational kernels with
157respect to these features.  The three-register model is
158based on the general approach to binary SIMD operations
159as ones that operate on the contents of two source operand
160registers to produce a value to be written back to a
161single destination operand register.  This three-register
162model is directly used by the Altivec SIMD instructions
163of the Power PC, for example.  On the Intel SSE platform,
164the three-register model is used as a programmer's
165interface for the C-language intrinsics, while the
166underlying instructions use a two-register model with
167destructive updating.  The computational kernels we study consist
168of 100\% branch-free code operating on registers, without
169memory access   For such kernels, we use straight-line
170instruction count as the performance metric of interest,
171assuming that pipelined processors are well-designed
172to handle latencies. 
173
174
175The remainder of this paper is organized as follows.
176
177The second section of this paper introduces IDISA and the
178SIMD notation used throughout this paper.  A brief comparison of
179IDISA features with existing SIMD instruction
180sets of commodity processors such as the Intel SSE
181instruction set and the Power PC Altivec instruction set
182is also made.
183
184The third section provides a short first example of
185the inductive doubling principle in action through
186the case of population count.   Although this operation
187is not a strong determinant of performance for parallel bit
188stream applications, it is nevertheless an operation needed
189frequently enough in the general computing milieux to find
190its way into some instruction set architectures, typically
191at one particular field width.  By way of comparison, the
192inductive doubling architecture sacrifices some
193performance at the chosen field width, while offering a more
194general solution with frequently better performance at
195other field widths.
196
197The fourth section then moves on to consider the performance-critical
198and key task of conversion between serial byte streams and parallel
199bit streams.  A first algorithm that uses the existing SIMD
200operations common to SSE and Altivec is shown, requiring 72
201operations to transform 128 bytes of data using the three-register
202instruction form.  We then move on to consider how the task may
203be simplified using IDISA to
204require a mere 24 operations.  As well as providing a 3X speed-up,
205it is also argued that the version using the inductive doubling
206architecture is considerably simpler and easier to program.
207
208The fifth section then briefly considers the inverse transposition
209process, converting parallel bit stream data back into byte
210streams.  Again, an algorithm to carry out this task requires
21172 straight-line SIMD operations in the Altivec three-register
212form, but is reduced to a simpler 24 operations using IDISA.
213
214The sixth section then goes on to consider the problem of
215parallel bit deletion.  This operation is performance-critical
216to any applications that require filtering or
217editing operations on strings using the parallel bit stream
218algorithms.   For example, it is fundamental to the
219high-speed UTF-8 to UTF-16 transcoding algorithm that is
220often a critical component in XML parsing.  In this
221section, an inductive doubling algorithm based on the
222concept of a central deletion result is described and
223shown to have much better performance than a parallel-prefix
224alternative.
225
226The seventh section considers the issue of
227horizontal SIMD operations, that is, operations for combining
228sets of adjacent fields within individual SIMD registers rather than
229corresponding fields within sets of registers.  While existing
230SIMD instruction set architectures tend to only support a few
231ad hoc horizontal combinations, IDISA is shown to provide a systematic
232means for efficient horizontal combinations of any kind.
233
234An implementation model for IDISA is then considered
235in section 8 of the paper, focusing on a pipelined
236SIMD architecture featuring a modified operand fetch stage.
237A gate-count analysis of one feasible implementation is
238provided as well as a discussion of the implementation
239of required extensions to handle 2-bit and 4-bit fields not
240commonly supported on existing SIMD architectures.   Design
241tradeoffs are also considered focusing the potential removal of
242various {\em ad hoc} instructions on existing processors in favor of
243more general alternatives provided through IDISA.
244
245The ninth section concludes the paper with a summary of results
246and discussion of areas for future work.
247
248
249\section{Inductive Doubling Architecture}
250
251This section presents an idealized model for a single-instruction
252multiple-data (SIMD) instruction set architecture designed
253specifically to support inductive doubling algorithms in the
254domain of parallel bit stream programming.   The architecture is idealized
255in the sense that we concentrate on only the necessary features
256for our purpose, without enumerating the additional
257operations that would be required for
258SIMD applications in other domains.  The goal is to focus
259on the principles of inductive doubling support in a way
260that can accommodate a variety of realizations as other
261design constraints are brought to bear on the overall instruction set
262design.  First we introduce a simple model and notation for
263SIMD operations in general and then present the four key
264features of an idealized architecture in support of parallel
265bit stream programming.
266
267The idealized architecture supports typical SIMD integer
268operations common to existing commodity architectures such as SSE
269and Altivec.   The architecture is defined to support SIMD
270operations on registers of size $N=2^K$ bits, for some integer $K$.
271Typical values of $K$ for commodity processors include $K=6$ for
272the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
273the 128-bit registers of SSE and Altivec technology and $K=8$ for
274the upcoming Intel AVX technology.   The registers may be
275partitioned into $N/n$ fields of width $n=2^k$ bits for some values
276of $k \leq K$.   Typical values of $k$ used on commodity processors
277include $k = 3$ for SIMD operations on 8-bit fields (bytes),
278$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
279on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
280fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
281Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
282register $r$, using big-endian numbering.
283
284Let \verb:simd<n>: represent the class of SIMD operations defined
285on fields of size $n$ using C++ template syntax.  Given a
286binary function $F_n$ on $n$-bit fields, we denote the SIMD
287version of this operation as \verb#simd<n>::F#.  Given two
288SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
289respectively, the operation \verb#r=simd<n>::F(a,b)# stores
290the value $r$ in the register \verb:r: as determined by
291the simultaneous calculation of individual field values in
292accord with the following equation.
293\[ r_i = F_n(a_i, b_i) \]
294
295For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
296logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
297integers as follows.
298%\singlespace
299\begin{eqnarray}
300\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
301\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
302\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
303\end{eqnarray}
304%\doublespace
305The SSE and Altivec instruction sets support
306each of the addition and subtraction operations in SIMD form
307for 8, 16 and 32-bit fields, while the SSE set also includes
308the 64-bit forms.  For example, \verb#simd<8>::add# in our
309nomenclature is provided by the operation \verb:paddb:
310on SSE and the operation \verb:vaddubm: on Altivec.
311The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
312\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
313more constrained, requiring that all field shifts by the same amount.
314
315Given these definitions and notation, we now present
316the four key elements of an inductive doubling architecture.
317The first is a definition of a core set of binary functions
318on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
319The second is a set of {\em half-operand modifiers} that allow
320the inductive processing of fields of size $2n$ in terms of
321combinations of $n$-bit values selected from the fields.
322The third is the definition of packing operations that compress
323two consecutive registers of $n$-bit values into a single
324register of $n/2$-bit values.  The fourth is the definition
325of merging operations that produce a set of $2n$ bit fields
326by concatenating corresponding $n$-bit fields from two
327parallel registers.  Each of these features is described below.
328
329For the purpose of direct and efficient support for inductive
330doubling algorithms on bit streams, the provision of
331a core set of operations at field widths of 2 and 4 as
332well as the more traditional field witdhs of 8, 16 and 32
333is critical for elegant and efficient expression of the
334algorithms.  In essence, inductive doubling algorithms
335work by establishing some base property at either single
336or 2-bit fields.  Each iteration of the algorithm then
337goes on to establish the property for the power-of-2
338field width.  In order for this inductive step to be
339most conveniently and efficiently expressed, the
340core operations needed for the step should be available
341at each field width.  In the case of work with parallel
342bit streams, the operations \verb:add:, \verb:sub:,
343\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
344(rotate left) comprise the core.  In other domains,
345additional operations may be usefully included in the
346core depending on the work that needs to be performed
347at each inductive doubling level.
348
349Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
350also includes fields of width 1.  These are included for
351logical consistency, but are easily implemented by mapping
352directly to appropriate bitwise logic operations, which
353we assume are also available.  For example,
354\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
355bitwise exclusive-or operation on SIMD registers.
356
357The second key facility of the inductive doubling architecture
358is the potential application of half-operand modifiers to
359the fields of either or both of the operands of a SIMD
360operation.  These modifiers select either the
361low $n/2$
362bits of each $n$-bit field (modifier ``\verb:l:'') or the
363high $n/2$ bits (modifier ``\verb:h:'').  When required,
364the modifier ``\verb:x:'' means that the full $n$ bits
365should be used, unmodified.  The semantics of these
366modifiers are given by the following equations.
367%\singlespace
368\begin{eqnarray}
369l(r_n) & = & r_n \bmod 2^{n/2} \\
370h(r_n) & = & r_n / 2^{n/2} \\
371x(r_n) & = & r_n
372\end{eqnarray}
373%\doublespace
374In our notation, the half-operand modifiers are
375specified as optional template (compile-time) parameters
376for each of the binary functions.  Thus,
377\verb#simd<4>::add<l,h>(a,b)# is an operation which adds
378the 2-bit quantity found in the low 2-bits of each 4-bit field
379of its first operand (\verb:a:)
380together with the corresponding 2-bit quantity found in the
381high 2-bits of its second operand (\verb:b:).
382In general, the purpose of the half-operand modifiers
383in support of inductive doubling is to allow the processing
384of $n$-bit fields to easily expressed in terms of
385combination of the results determined by processing
386$n/2$ bit fields.
387
388The third facility of the inductive doubling architecture
389is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
390The field values of \verb#r=simd<n>::pack(a,b)# are
391defined by the following equations.
392%\singlespace
393\begin{eqnarray}
394r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
395r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
396\end{eqnarray}
397%\doublespace
398Here conv is a function which performs conversion of an $n$-bit
399value to an $n/2$ bit value by signed saturation (although
400conversion by unsigned saturation would also suit our purpose).
401
402Half-operand modifiers may also be used with the pack
403operations.  Thus packing with conversion by masking off all
404but the low $n/2$ bits of each field may be
405be performed using the operation \verb#simd<n>::pack<l,l>#
406
407
408The final facility of the inductive doubling architecture is
409a set of merging operations \verb#simd<n>::mergeh# and
410\verb#simd<n>::mergel# that produce $2n$-bit fields
411by concatenating corresponding $n$-bit fields from the
412operand registers.   The respective
413operations \verb#r=simd<n>::mergeh(a,b)# and
414\verb#s=simd<n>::mergel(a,b)#
415are defined by the following equations.
416%\singlespace
417\begin{eqnarray}
418r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
419s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
420\end{eqnarray}
421%\doublespace
422Both SSE and Altivec provide versions of pack and merge operations
423for certain field widths.  The pack operations are provided
424with operands having 16-bit or 32-bit fields on each platform, although
425with some variation in how conversion is carried out.
426The merge operations are provided at 8-bit, 16-bit and 32-bit
427field widths on both architectures and also at the 64-bit level
428on SSE.
429
430This completes the description of the proposed inductive
431doubling architecture.  As described, many of the features
432are already available with the SIMD facilities of
433existing commodity processors.  The extensions enumerated
434here are relatively straightforward.  The innovation
435is to specifically tackle the design of facilities to
436offer systematic support for transitions between power-of-2 field widths.
437As we shall show in the remainder of this paper, these facilities
438can dramatically reduce instruction count in core parallel bit
439stream algorithms, with a factor of 3 reduction being typical.
440The next section goes on to illustrate a first simple example
441for the task of population count.
442
443\section{Population Count}
444
445\begin{figure}
446\begin{center}\small
447\begin{verbatim}
448c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
449c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
450c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
451c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
452c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
453\end{verbatim}
454\end{center}
455\caption{Population Count Reference Algorithm}
456\label{HD-pop}
457\end{figure}
458
459\begin{figure}
460\begin{center}\small
461\begin{verbatim}
462c = simd<2>::add<l,h>(x, x);
463c = simd<4>::add<l,h>(c, c);
464c = simd<8>::add<l,h>(c, c);
465c = simd<16>::add<l,h>(c, c);
466c = simd<32>::add<l,h>(c, c);
467\end{verbatim}
468\end{center}
469\caption{Ideal SIMD Population Count}
470\label{inductivepopcount}
471\end{figure}
472
473As an initial example to illustrate the principle of inductive doubling
474in practice, consider the problem of {\em population count}: determining
475the number of one bits within a particular bit field.  It is important
476enough for such operations as
477calculating Hamming distance to be included as a built-in instruction
478on some processors.  For example, the SPU of the Cell Broadband Engine
479has a SIMD population count instruction \verb:si_cntb: for simultaneously
480determining the
481number of 1 bits within each byte of a 16-byte register.
482In text processing with parallel bit streams, population count has direct
483application to keeping track of line numbers for error reporting, for example.
484Given a bit block identifying the positions of newline characters within
485a block of characters being processed, the population count of the
486bit block can be used to efficiently and conveniently be used to update
487the line number upon completion of block processing.
488
489Figure \ref{HD-pop} presents a traditional divide-and-conquer
490implementation for a 32-bit integer {\tt x} adapted from
491Warren \cite{HackersDelight}, while
492Figure \ref{inductivepopcount} shows the corresponding SWAR
493implementation for a vector of 32-bit fields using the inductive doubling
494instruction set architecture.  Each implementation employs
495five steps of inductive doubling to produce population counts
496within 32 bit fields.  The traditional implementation employs
497explicit masking and shifting operations, while these
498operations are implicit within the semantics of the inductive
499doubling instructions used in Figure \ref{inductivepopcount}.
500In each implementation, the first step determines the
501the population counts within 2-bit fields
502by adding the high bit of each such field to the low bit
503to produce a set of 2-bit counts in {\tt c}.
504In the second step, the counts within 4-bit fields of {\tt c} are determined
505by adding the counts of the corresponding high and low 2-bit subfields.
506Continuing in this fashion,
507the final population counts within 32-bit fields are determined in five steps.
508
509With the substitution of longer mask constants replicated for four
51032-bit fields, the implementation of Figure \ref{HD-pop} can be
511easily adapted to SWAR processing using 128-bit registers.
512Such an implementation requires 10 operations to load or generate
513mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
514total of 30 operations to complete the task in comparison
515to a mere   However, Warren further refines the
516implementation to an optimized version requiring only 20 operations,
5175 of which are required to generate mask constants.  At the cost
518of register pressure, it is possible that these constants
519could be kept preloaded.  In any case, the IDISA implementation
520offers a 3X to 4X improvement in instruction count as well as
521a reduction in register pressure.
522
523
524
525
526\section{Transposition to Parallel Bit Streams}
527
528In this section, we consider the first major
529application of the inductive doubling architecture:
530transposition of byte stream data to parallel bit stream
531form.   Of course, this operation is critical to the
532method of parallel bit streams and all applications
533of the method can benefit from a highly efficient
534transposition process.  Before considering how
535the inductive doubling architecture supports this
536transposition process, however, we first consider
537algorithms on existing architectures.  Two algorithms
538are presented; the best of these requires 72
539SIMD operations in the three-register model to perform
540transposition of eight serial registers of byte stream data into
541eight parallel registers of bit stream data.
542
543We then go on to show how the transposition problem
544can be solved using the inductive doubling architecture
545with a mere 24 three-register SIMD operations.  We also prove
546that this is optimal for any three-register instruction set model.
547
548
549\begin{figure}[tbh]
550\begin{center}
551\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
552\caption{Input/Output Model for Serial to Parallel Transposition}
553\label{s2p-spec}
554\end{center}
555
556\end{figure}
557Figure \ref{s2p-spec} illustrates the input-output requirements of
558the transposition problem.  We assume that inputs and
559outputs are each SIMD registers of size $N=2^K$ bits.
560The input consists of $N$ bytes of serial byte data,
561stored consecutively in eight SIMD registers each holding
562$N/8$ bytes.  The output consists of eight parallel
563registers, one each for the eight individual bit positions
564within a byte.  Upon completion of the transposition process,
565each output register is to hold the $N$ bits corresponding
566to the selected bit position in the sequence of $N$ input
567bytes.
568
569\subsection{Bit Gathering Algorithm}
570
571\begin{figure}[tbh]
572\begin{center}
573\includegraphics[width=90mm, trim= 50 100 0 0]{S2P.pdf}
574\caption{Serial to Parallel Transposition Using Bit-Gathering}
575\label{gather}
576\end{center}
577\end{figure}
578One straightforward algorithm for implementing the transposition process
579takes advantage of SIMD bit gathering operations that exist
580on some architectures.  This operation gathers one bit per byte
581from a particular position within each byte of a SIMD register.
582For example, the {\tt pmovmskb} operation of the Intel MMX and
583SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
584consisting of the high bit of each byte.  Similarly, the
585{\tt \verb:si_gbb:} operation of the synergistic processing units of the
586Cell Broadband Engine gathers together the low bit of each byte
587from the SIMD register.  Figure \ref{gather} illustrates the
588bit gathering process.
589
590For each bit stream of output, the bit gather algorithm requires
591one gather operation for each of the 8 input registers,
592giving 64 bit gather operations in all.  In addition, for seven
593of the eight bit positions, it is necessary to shift the bits
594of each input register into the conventional position for
595gathering.  A total of 56 shift operations are required for this
596task.  Finally, the result of each bit gather operation must
597be properly inserted into the output stream.  If the first
598gather operation for a stream can be used to also initialize
599the output register, there will then need to be 7 insert
600operations for the results of the remaining gather operations
601for this stream, making 56 insert operations in all.
602The insert step may be more complex than a single operation
603in some cases, but we use one operation per insert as a lower bound.
604Thus, the bit gather algorithm requires
605at least 176 operations to perform the transposition task.
606
607\subsection{BytePack Algorithm}
608
609A much more efficient transposition algorithm on commodity
610SIMD architectures (SSE and Altivec) involves three
611stages of binary division transformation.  This is similar
612to the three stage bit matrix inversion described by
613Warren  \cite{HackersDelight}, although modified to use SIMD operations.
614In each stage, input streams are divided into two half-length output streams.
615The first stage separates the bits at even numbered positions from those
616at odd number positions.  The two output streams from the first
617stage are then further divided in the second stage.
618The stream comprising even numbered bits from the original byte stream
619divides into one stream consisting of bits from positions 0 and 4 of each
620byte in the original stream and a second stream consisting of bits
621from positions 2 and 6 of each original byte.  The stream of bits from
622odd positions is similarly divided into streams for bits from Each of the
623positions 1 and 5 and bits from positions 2 and 6.
624Finally, each of the four streams resulting from the second stage are
625divided into the desired individual bit streams in the third stage.
626
627\begin{figure}[tbh]
628\begin{center}
629\begin{verbatim}
630t0 = simd<16>::pack<h,h>(s0, s1);
631t1 = simd<16>::pack<l,l>(s0, s1);
632p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
633p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
634\end{verbatim}
635\end{center}
636\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
637\label{s2pstep}
638\end{figure}
639
640The binary division transformations are accomplished in each stage
641using byte packing, shifting and masking.  In each stage, a
642transposition step combines each pair of serial input registers to
643produce a pair of parallel output registers.  In essence,
644doublebytes from the input registers are packed into bytes
645in the output registers, with the bits from even positions stored
646in the bytes of one output stream and the bits from odd positions
647stored in the bytes of the second output stream.
648Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
649producing two parallel registers \verb:p0: and \verb:p1: from
650two serial registers \verb:s0: and \verb:s1:.  This step is applied
651four times in stage 1; stages 2 and 3 also consist of four applications
652of a similar step with different shift and masking constants.
653
654Although we have used the idealized SIMD notation here, each of the
655operations maps to a single operation in the Altivec set and a small number
656of operations in the SSE set.  Using the Altivec set, there are
6576 operations per step for a total of 24 operations per stage. 
658The three stages combined required 72 operations to transpose 128 bytes
659to parallel bit stream form.  This is the best algorithm known to
660us for existing SIMD architectures.
661
662\subsection{Inductive Halving Algorithm}
663
664Using the inductive doubling architecture, it is possible to design
665a transposition algorithm that is both easier to understand and requires
666many fewer operations than the the bytepack algorithm described above.
667We call it the inductive halving algorithm for serial to parallel
668transposition, because it proceeds by reducing byte streams to
669two sets of nybble streams in a first stage, dividing the nybble
670streams into streams of bitpairs in a second stage and finally
671dividing the bitpair streams into bit streams in the third stage.
672
673
674\begin{figure}[tbh]
675\begin{verbatim}
676p0 = simd<8>::pack<h,h>(s0, s1);
677p1 = simd<8>::pack<l,l>(s0, s1);
678\end{verbatim}
679\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
680\label{halvingstep}
681\end{figure}
682
683Figure \ref{halvingstep} shows one step in stage 1 of the inductive
684halving algorithm, comprising just two SIMD operations.
685The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
686from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
687the low nybble of each byte.  As in the bytepack algorithm, this step is
688applied 4 times in stage 1, for a total of 8 operations.
689
690Stage 2 of the inductive halving algorithm reduces nybble streams
691to streams of bit pairs.  The basic step in this algorithm consists
692of one \verb#simd<4>::pack<h,h># operation to extract the high pair
693of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
694low pair of each nybble.  Four applications of this step complete stage 2.
695
696Stage 3 similarly uses four applications of a step that uses a
697\verb#simd<2>::pack<h,h># operation to extract the high bit of
698each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
699each pair.  The complete algorithm to transform eight serial
700byte registers s0 through s7 into the eight parallel bit stream
701registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
702
703\begin{figure}[tbh]
704\begin{verbatim}
705hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
706lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
707hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
708lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
709hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
710lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
711hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
712lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
713hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
714hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
715lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
716ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
717hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
718hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
719lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
720ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
721bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
722bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
723bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
724bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
725bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
726bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
727bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
728bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
729\end{verbatim}
730\caption{Complete Inductive Halving Algorithm}
731\label{halvingalgorithm}
732\end{figure}
733
734\subsection{Optimality of the Inductive Halving Algorithm}
735
736Here we show that the inductive halving algorithm presented in
737the previous subsection is optimal in the following sense:
738no other algorithm on any 3-register SIMD architecture can use
739fewer than 24 operations to transform eight serial registers
740of byte stream data into eight parallel registers of bit stream data.
741By 3-register SIMD architecture, we refer to any architecture
742that uses SIMD instructions consistent with our overall model of
743binary operations using two input register operands to produce
744one output register value.
745
746Observe that the $N$ data bits from each input register must be
747distributed $N/8$ each to the 8 output registers by virtue of
748the problem definition.  Each output register can effectively
749be given a 3-bit address; the partitioning problem can be viewed
750as moving data to the correct address.  However, each
751operation can move results into at most one register. 
752At most this can result in the assignment of one correct address
753bit for each of the $N$ input bits.  As all $8N$ input bits
754need to be moved to a register with a correct 3-bit address,
755a minimum of 24 operations is required.
756
757\section{Parallel to Serial Conversion}
758
759Parallel bit stream applications may apply string editing
760operations in bit space to substitute, delete or insert
761parallel sets of bits at particular positions.  In such cases,
762the inverse transform that converts a set of parallel bit
763streams back into byte space is needed.  In the example of
764UTF-8 to UTF-16 transcoding, the inverse transform is
765actually used twice for each application of the forward
766transform, to separately compute the high and low byte
767streams of each UTF-16 code unit.  Those two byte streams
768are subsequentely merged to form the final result.
769
770Algorithms for performing the inverse transform mirror those
771of the forward transform, employing SIMD merge operations
772in place of pack operations.  The best algorithm known
773to us on the commodity SIMD architectures takes advantage
774of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
775operations that are available with each of the SSE and Altivec instruction
776sets.  These algorithms take 72 operations to perform the
777inverse transposition of 8 parallel registers of bit stream
778data into 8 serial registers of byte stream data. 
779
780\begin{figure}[tbh]
781\begin{center}
782\begin{verbatim}
783bit01_r0 = simd<1>::mergeh(bit0, bit1);
784bit01_r1 = simd<1>::mergel(bit0, bit1);
785bit23_r0 = simd<1>::mergeh(bit2, bit3);
786bit23_r1 = simd<1>::mergel(bit2, bit3);
787bit45_r0 = simd<1>::mergeh(bit4, bit5);
788bit45_r1 = simd<1>::mergel(bit4, bit5);
789bit67_r0 = simd<1>::mergeh(bit6, bit7);
790bit67_r1 = simd<1>::mergel(bit6, bit7);
791bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
792bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
793bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
794bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
795bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
796bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
797bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
798bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
799s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
800s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
801s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
802s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
803s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
804s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
805s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
806s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
807\end{verbatim}
808\end{center}
809\label{p2s-inductive}
810\caption{Parallel to Serial Transposition by Inductive Doubling}
811\end{figure}
812
813An algorithm employing only 24 operations using the
814inductive doubling instruction set architecture is relatively
815straightforward.. In stage 1, parallel registers for individual bit streams
816are first merged with bit-level interleaving
817using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
818operations.  For each of the four pairs of consecutive
819even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
820two consecutive registers of bitpair data are produced.
821In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
822are then applied to merge to bitpair streams to produce streams
823of nybbles for the high and low nybble of each byte.  Finally,
824the nybble streams are merged in stage 3 to produce the
825desired byte stream data.  The full inductive doubling
826algorithm for parallel to serial transposition is shown in Figure
827\ref{p2s-inductive}.
828
829This algorithm is again optimal, requiring the fewest operations
830of any possible algorithm using any 3-register instruction set
831model.  The proof directly follows that for the serial to parallel
832transposition.
833
834The existence of high-performance algorithms for transformation of
835character data between byte stream and parallel bit stream form
836in both directions makes it possible to consider applying these
837transformations multiple times during text processing applications.
838Just as the time domain and frequency domain each have their
839use in signal processing applications, the byte stream form and
840parallel bit stream form can then each be used at will in
841character stream applications.
842
843
844
845
846
847\section{Parallel Bit Deletion}
848
849Parallel bit deletion is an important operation that allows string
850editing operations to be carried out while in parallel bit stream
851form.  It is also fundamental to UTF-8 to UTF-16 transcoding
852using parallel bit streams, allowing the excess code unit
853positions for UTF-8 two-, three- and four-byte sequences to
854be deleted once the sixteen parallel bit streams of UTF-16 have
855been computed \cite{PPoPP08}.
856
857Parallel bit deletion is specified using a deletion mask.
858A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
859to be deleted and 0s at positions identifying bits to be retained.
860For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
861bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
862accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
863
864Bit deletion may be performed using
865the parallel-prefix compress algorithm documented by
866Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
867only logic and shifts with a constant parameter to carry
868out the deletion process.  However, it requires $k^2$
869preprocessing steps for a final field width parameter
870of size $2^k$, as well as 4 operations per deletion step
871per stream.  Using the inductive doubling instruction set architecture
872it is possible to carry out bit deletion much more efficiently.
873
874Deletion within fixed size fields or registers may produce results that are either
875left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
876eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
877care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
878left-justified result produces an important intermediate form known as a
879{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
880right-justified and left-justified results from the application of the
8814-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
882stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
883the central result \verb:xxbdefhx:, which may easily be converted to a either a
884left or a right justified 8-element result by an appropriate shift operation.
885
886\begin{figure}
887\begin{center}
888\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
889\hline
890\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
891\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
892\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
893\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
894\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
895\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
896\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
897\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
898\end{tabular}
899\end{center}
900\label{centraldelstep}
901\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
902\end{figure}
903
904The observation about how two $n$-bit central deletion results can
905combine to yield a $2n$ central deletion result provides the basis
906for an inductive doubling algorithm.  Figure \ref{centraldelstep}
907illustrates the inductive process for the transition from 8-bit central
908deletion results to 16-bit central deletion results.  The top row shows
909the original deletion mask, while the second row shows the original
910bit stream to which deletions are to be applied, with deleted bits zeroed out.
911The third row shows the central result for each 8-bit field as the
912result of the previous inductive step.
913
914To perform the $8 \rightarrow 16$ central deletion step, we first form
915the population counts of 4-bit fields of the original deletion mask as
916shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
917an 8-bit central result, we perform a right shift by the population count
918of the low half of the field.  Similarly,
919left-justification requires a left-shift by the population count in the
920high half of the field.
921
922The left and right shifts can be performed simultaneously using a rotate
923left instruction.  Right justification by shifting an $n$ bit field
924$i$ positions to the right is equivalent to a left rotate of $n-i$
925positions.  These rotation amounts are computed by the operation \newline
926\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
927except that don't care fields (which won't be subsequently used)
928are marked \verb:XX:. 
929
930The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
931as shown in row 6, and are combined with the right shift amounts
932by the selection operation \newline \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
933as shown in row 7.  Using these computed values, the inductive step
934is completed by application of the operation \newline \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
935as shown in row 8.
936
937At each inductive doubling level, it requires 4 operations to compute the
938required deletion infomation and one operation per bit stream to perform deletion.
939Note that, if deletion is to be applied to a set of eight parallel bit streams,
940the computed deletion information is used for each stream without recomputation,
941thus requiring 12 operations per inductive level.
942
943In comparison to the parallel-prefix compress method, the method of central
944deletion results using the inductive doubling architecture has far fewer operations.
945The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
946induction versus $4k^2$ for the parallel-prefix method.  Using the computed
947deletion operation, only a single SIMD rotate operation per bit stream
948per level is needed, in comparison with 4 operations per level for parallel-prefix
949compress.
950
951
952\section{Systematic Support for Horizontal SIMD Operations}
953
954In SIMD parlance, {\em horizontal} operations are
955operations which combine values from two or more fields
956of the same register, in contrast to the normal
957{\em vertical} operations which combine corresponding
958fields of different registers.  Horizontal operations
959can be found that combine two (e.g., haddpd on SSE3),
960four (e.g, \verb:si_orx: on SPU), eight (e.g, psadbw on SSE)
961or sixteen values (e.g., vcmpequb on Altivec).  Some
962horizontal operations have a vertical component as well.
963For example, psadbw first forms the absolute value of
964the difference of eight corresponding byte fields before
965performing horizontal add of the eight values, while
966vsum4ubs on Altivec performs horizontal add of sets of
967four unsigned 8-bit fields within one register
968and then combines the result horizontally with
969corresponding 32-bit fields of a second registers.
970
971The space of potential horizontal operations thus has
972many dimensions, including not only the particular
973combining operation and the operand field width, but
974also the number of fields being combined, whether a
975vertical combination is applied and whether it is applied
976before or after the horizontal operation and what the
977nature of the vertical combining operation is.
978Within this space, commodity SIMD architectures tend
979to support only a very few combinations, without any
980particular attempt at systematic support for horizontal
981operations in general.
982
983By making use of \verb:<l,h>: half-operand modifier
984combinations, the inductive doubling architecture
985offers systematic support for horizontal operations
986on pairs of adjacent fields.
987For example, \verb#simd<16>::add<l,h># adds values
988in adjacent 8 bit fields to produce 16 bit results,
989while \verb#simd<32>::min<l,h># can produce the
990minimum value of adjacent 16-bit fields.  In general,
991\newline \verb#simd<n>::F<l,h># denotes the horizontal
992binary combination of adjacent fields for any
993operator $F$ and field width $n$.
994
995Horizontal combinations of larger numbers of fields
996makes use of the inductive doubling property.
997For example, consider the or-across operation \verb:si_orx:
998of the SPU, that performs a logical or operation
999on four 32-bit fields.  This four field combination
1000involves two steps in the inductive doubling approach.
1001%\begin{singlespace}
1002\begin{verbatim}
1003t = simd<64>::or<l,h>(x, x)
1004t = simd<128>::or<l,h>(t, t)
1005\end{verbatim}
1006%\end{singlespace}
1007This example is also interesting in showing a potential
1008value for supporting bitwise logical operations at
1009different field widths, i.e., specifically for use with
1010half-operand modifiers.
1011
1012Similarly, to combine any eight fields simply requires
1013three inductive doubling steps using the desired
1014operator at successive power-of-two field widths, while
1015combining sixteen fields requires four such operations.
1016In this way, the inductive doubling architecture provides
1017systematic support for horizontal operations well beyond
1018the existing facilities of commodity architectures,
1019although lacking some of the special features found in
1020some cases.
1021
1022
1023\section{Implementation}
1024
1025We have constructed libraries that provide
1026simulated implementation of the inductive doubling architecture
1027on each of the MMX, SSE, Altivec, and SPU platforms and have
1028used these libraries in the implementation of each of the
1029parallel bit stream algorithms discussed herein.
1030This implementation work has been successful in validating
1031the basic concepts underlying the inductive doubling instruction
1032set architecture.
1033
1034Implementation of the architecture on chip is beyond the
1035scope of our present resources and capabilities.  However,
1036the principal requirements are implementation of the various
1037operations at all power-of-2 field widths and implementation
1038of half-operand modifiers.  Implementation of SIMD operations
1039at additional field widths involves design trade-offs
1040with respect to transistor counts, available opcode space,
1041and the potential value of the new operations to SIMD
1042programmers.  From the perspective of parallel bit
1043stream programming, the primary need is for SIMD integer,
1044shift, pack and merge operations at field widths of 2, 4
1045and 8, as well as the field width of 1, where it makes
1046sense (e.g. with merge operations).  In support of the
1047general concept of inductive doubling architecture,
1048SIMD operations at large field widths (64, 128) are also
1049called for, but these operations cannot be justified on
1050the basis of parallel bit stream programming.
1051
1052Implementation of half-operand modifiers can logically
1053be carried out with additional circuitry attached to the
1054register fetch units of a pipelined processor.  This
1055circuitry would require control signals from the
1056instruction decode unit to identify the field widths
1057of operands and the particular half-operand modifier to be applied,
1058if any.  The additional logic required for instruction
1059decode and that required for operand modification
1060as part of the operand fetch process is expected to be
1061reasonably modest.
1062
1063Full assessment of implementation issues is an important
1064area for future work.
1065
1066
1067\section{Conclusions}
1068
1069This paper has considered the issue of architectural support for
1070SIMD text processing using the method of parallel bit streams and has
1071argued that this architectural support can best be provided
1072through a SIMD instruction set architecture that implements
1073features for direct support of inductive doubling algorithms.
1074Four key features of the inductive doubling architecture have
1075been identified include support for operations at all
1076power-of-2 field widths, half-operand modifiers and
1077pack and merge operations.  The principle innovation is the
1078notion of half-operand modifiers to support efficient
1079transition between successive power-of-two field widths.
1080
1081Several algorithms key to parallel bit stream methods
1082have been examined and shown to benefit from dramatic
1083reductions in instruction count compared to the best
1084known algorithms on existing architecture.  In the case
1085of transposition algorithms to and from parallel bit stream
1086form, the architecture has been shown to make possible
1087straightforward inductive doubling algorithms with the
1088lowest total number of operations that can be achieved by any
1089possible 3-register SIMD architecture.
1090
1091The inductive doubling architecture also has considerable
1092benefits beyond its role in supporting SIMD programming
1093with parallel bit streams.  Notable among these is
1094that the architecture provides a framework for systematic
1095support of horizontal SIMD operations.
1096
1097
1098
1099%\appendix
1100%\section{Appendix Title}
1101%
1102%This is the text of the appendix, if you need one.
1103
1104\acks
1105
1106This research was supported by a Discovery Grant from the
1107Natural Sciences and Engineering Research Council of Canada.
1108
1109%\bibliographystyle{plainnat}
1110\bibliographystyle{plain}
1111\bibliography{asplos094-cameron}
1112%\begin{thebibliography}{}
1113
1114%\bibitem{smith02}
1115%Smith, P. Q. reference text
1116
1117%\end{thebibliography}
1118
1119
1120\end{document}
Note: See TracBrowser for help on using the repository browser.