# source:docs/ASPLOS09/asplos094-cameron.tex@234

Last change on this file since 234 was 234, checked in by cameron, 11 years ago

File size: 64.6 KB
Line
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152To assess the benefits of IDISA features in comparison to
153those found in existing SIMD support on commodity
154processors, we will focus on an idealized three-register
155model of SIMD instruction sets and evaluation of
156parallel bit stream and other computational kernels with
157respect to these features.  The three-register model is
158based on the general approach to binary SIMD operations
159as ones that operate on the contents of two source operand
160registers to produce a value to be written back to a
161single destination operand register.  This three-register
162model is directly used by the Altivec SIMD instructions
163of the Power PC, for example.  On the Intel SSE platform,
164the three-register model is used as a programmer's
165interface for the C-language intrinsics, while the
166underlying instructions use a two-register model with
167destructive updating.  The computational kernels we study consist
168of 100\% branch-free code operating on registers, without
169memory access   For such kernels, we use straight-line
170instruction count as the performance metric of interest,
171assuming that pipelined processors are well-designed
172to handle latencies.
173
174
175The remainder of this paper is organized as follows.
176
177The second section of this paper introduces IDISA and the
178SIMD notation used throughout this paper.  A brief comparison of
179IDISA features with existing SIMD instruction
180sets of commodity processors such as the Intel SSE
181instruction set and the Power PC Altivec instruction set
183
184The third section provides a short first example of
185the inductive doubling principle in action through
186the case of population count.   Although this operation
187is not a strong determinant of performance for parallel bit
188stream applications, it is nevertheless an operation needed
189frequently enough in the general computing milieux to find
190its way into some instruction set architectures, typically
191at one particular field width.  By way of comparison, the
192inductive doubling architecture sacrifices some
193performance at the chosen field width, while offering a more
194general solution with frequently better performance at
195other field widths.
196
197The fourth section then moves on to consider the performance-critical
198and key task of conversion between serial byte streams and parallel
199bit streams.  A first algorithm that uses the existing SIMD
200operations common to SSE and Altivec is shown, requiring 72
201operations to transform 128 bytes of data using the three-register
202instruction form.  We then move on to consider how the task may
203be simplified using IDISA to
204require a mere 24 operations.  As well as providing a 3X speed-up,
205it is also argued that the version using the inductive doubling
206architecture is considerably simpler and easier to program.
207
208The fifth section then briefly considers the inverse transposition
209process, converting parallel bit stream data back into byte
210streams.  Again, an algorithm to carry out this task requires
21172 straight-line SIMD operations in the Altivec three-register
212form, but is reduced to a simpler 24 operations using IDISA.
213
214The sixth section then goes on to consider the problem of
215parallel bit deletion.  This operation is performance-critical
216to any applications that require filtering or
217editing operations on strings using the parallel bit stream
218algorithms.   For example, it is fundamental to the
219high-speed UTF-8 to UTF-16 transcoding algorithm that is
220often a critical component in XML parsing.  In this
221section, an inductive doubling algorithm based on the
222concept of a central deletion result is described and
223shown to have much better performance than a parallel-prefix
224alternative.
225
226The seventh section considers the issue of
227horizontal SIMD operations, that is, operations for combining
228sets of adjacent fields within individual SIMD registers rather than
229corresponding fields within sets of registers.  While existing
230SIMD instruction set architectures tend to only support a few
231ad hoc horizontal combinations, IDISA is shown to provide a systematic
232means for efficient horizontal combinations of any kind.
233
234An implementation model for IDISA is then considered
235in section 8 of the paper, focusing on a pipelined
236SIMD architecture featuring a modified operand fetch stage.
237A gate-count analysis of one feasible implementation is
238provided as well as a discussion of the implementation
239of required extensions to handle 2-bit and 4-bit fields not
240commonly supported on existing SIMD architectures.   Design
241tradeoffs are also considered focusing the potential removal of
242various {\em ad hoc} instructions on existing processors in favor of
243more general alternatives provided through IDISA.
244
245The ninth section concludes the paper with a summary of results
246and discussion of areas for future work.
247
248
249\section{Inductive Doubling Architecture}
250
251This section presents an idealized model for a single-instruction
252multiple-data (SIMD) instruction set architecture designed
253specifically to support inductive doubling algorithms in the
254domain of parallel bit stream programming.   The architecture is idealized
255in the sense that we concentrate on only the necessary features
256for our purpose, without enumerating the additional
257operations that would be required for
258SIMD applications in other domains.  The goal is to focus
259on the principles of inductive doubling support in a way
260that can accommodate a variety of realizations as other
261design constraints are brought to bear on the overall instruction set
262design.  First we introduce a simple model and notation for
263SIMD operations in general and then present the four key
264features of an idealized architecture in support of parallel
265bit stream programming.
266
267The idealized architecture supports typical SIMD integer
268operations common to existing commodity architectures such as SSE
269and Altivec.   The architecture is defined to support SIMD
270operations on registers of size $N=2^K$ bits, for some integer $K$.
271Typical values of $K$ for commodity processors include $K=6$ for
272the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
273the 128-bit registers of SSE and Altivec technology and $K=8$ for
274the upcoming Intel AVX technology.   The registers may be
275partitioned into $N/n$ fields of width $n=2^k$ bits for some values
276of $k \leq K$.   Typical values of $k$ used on commodity processors
277include $k = 3$ for SIMD operations on 8-bit fields (bytes),
278$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
279on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
280fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
281Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
282register $r$, using big-endian numbering.
283
284Let \verb:simd<n>: represent the class of SIMD operations defined
285on fields of size $n$ using C++ template syntax.  Given a
286binary function $F_n$ on $n$-bit fields, we denote the SIMD
287version of this operation as \verb#simd<n>::F#.  Given two
288SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
289respectively, the operation \verb#r=simd<n>::F(a,b)# stores
290the value $r$ in the register \verb:r: as determined by
291the simultaneous calculation of individual field values in
292accord with the following equation.
293$r_i = F_n(a_i, b_i)$
294
296logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
297integers as follows.
298%\singlespace
299\begin{eqnarray}
300\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
301\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
302\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
303\end{eqnarray}
304%\doublespace
305The SSE and Altivec instruction sets support
306each of the addition and subtraction operations in SIMD form
307for 8, 16 and 32-bit fields, while the SSE set also includes
308the 64-bit forms.  For example, \verb#simd<8>::add# in our
309nomenclature is provided by the operation \verb:paddb:
310on SSE and the operation \verb:vaddubm: on Altivec.
311The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
312\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
313more constrained, requiring that all field shifts by the same amount.
314
315Given these definitions and notation, we now present
316the four key elements of an inductive doubling architecture.
317The first is a definition of a core set of binary functions
318on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
319The second is a set of {\em half-operand modifiers} that allow
320the inductive processing of fields of size $2n$ in terms of
321combinations of $n$-bit values selected from the fields.
322The third is the definition of packing operations that compress
323two consecutive registers of $n$-bit values into a single
324register of $n/2$-bit values.  The fourth is the definition
325of merging operations that produce a set of $2n$ bit fields
326by concatenating corresponding $n$-bit fields from two
327parallel registers.  Each of these features is described below.
328
329For the purpose of direct and efficient support for inductive
330doubling algorithms on bit streams, the provision of
331a core set of operations at field widths of 2 and 4 as
332well as the more traditional field witdhs of 8, 16 and 32
333is critical for elegant and efficient expression of the
334algorithms.  In essence, inductive doubling algorithms
335work by establishing some base property at either single
336or 2-bit fields.  Each iteration of the algorithm then
337goes on to establish the property for the power-of-2
338field width.  In order for this inductive step to be
339most conveniently and efficiently expressed, the
340core operations needed for the step should be available
341at each field width.  In the case of work with parallel
342bit streams, the operations \verb:add:, \verb:sub:,
343\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
344(rotate left) comprise the core.  In other domains,
345additional operations may be usefully included in the
346core depending on the work that needs to be performed
347at each inductive doubling level.
348
349Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
350also includes fields of width 1.  These are included for
351logical consistency, but are easily implemented by mapping
352directly to appropriate bitwise logic operations, which
353we assume are also available.  For example,
354\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
355bitwise exclusive-or operation on SIMD registers.
356
357The second key facility of the inductive doubling architecture
358is the potential application of half-operand modifiers to
359the fields of either or both of the operands of a SIMD
360operation.  These modifiers select either the
361low $n/2$
362bits of each $n$-bit field (modifier \verb:l:'') or the
363high $n/2$ bits (modifier \verb:h:'').  When required,
364the modifier \verb:x:'' means that the full $n$ bits
365should be used, unmodified.  The semantics of these
366modifiers are given by the following equations.
367%\singlespace
368\begin{eqnarray}
369l(r_n) & = & r_n \bmod 2^{n/2} \\
370h(r_n) & = & r_n / 2^{n/2} \\
371x(r_n) & = & r_n
372\end{eqnarray}
373%\doublespace
374In our notation, the half-operand modifiers are
375specified as optional template (compile-time) parameters
376for each of the binary functions.  Thus,
378the 2-bit quantity found in the low 2-bits of each 4-bit field
379of its first operand (\verb:a:)
380together with the corresponding 2-bit quantity found in the
381high 2-bits of its second operand (\verb:b:).
382In general, the purpose of the half-operand modifiers
383in support of inductive doubling is to allow the processing
384of $n$-bit fields to easily expressed in terms of
385combination of the results determined by processing
386$n/2$ bit fields.
387
388The third facility of the inductive doubling architecture
389is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
390The field values of \verb#r=simd<n>::pack(a,b)# are
391defined by the following equations.
392%\singlespace
393\begin{eqnarray}
394r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
395r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
396\end{eqnarray}
397%\doublespace
398Here conv is a function which performs conversion of an $n$-bit
399value to an $n/2$ bit value by signed saturation (although
400conversion by unsigned saturation would also suit our purpose).
401
402Half-operand modifiers may also be used with the pack
403operations.  Thus packing with conversion by masking off all
404but the low $n/2$ bits of each field may be
405be performed using the operation \verb#simd<n>::pack<l,l>#
406
407
408The final facility of the inductive doubling architecture is
409a set of merging operations \verb#simd<n>::mergeh# and
410\verb#simd<n>::mergel# that produce $2n$-bit fields
411by concatenating corresponding $n$-bit fields from the
412operand registers.   The respective
413operations \verb#r=simd<n>::mergeh(a,b)# and
414\verb#s=simd<n>::mergel(a,b)#
415are defined by the following equations.
416%\singlespace
417\begin{eqnarray}
418r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
419s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
420\end{eqnarray}
421%\doublespace
422Both SSE and Altivec provide versions of pack and merge operations
423for certain field widths.  The pack operations are provided
424with operands having 16-bit or 32-bit fields on each platform, although
425with some variation in how conversion is carried out.
426The merge operations are provided at 8-bit, 16-bit and 32-bit
427field widths on both architectures and also at the 64-bit level
428on SSE.
429
430This completes the description of the proposed inductive
431doubling architecture.  As described, many of the features
432are already available with the SIMD facilities of
433existing commodity processors.  The extensions enumerated
434here are relatively straightforward.  The innovation
435is to specifically tackle the design of facilities to
436offer systematic support for transitions between power-of-2 field widths.
437As we shall show in the remainder of this paper, these facilities
438can dramatically reduce instruction count in core parallel bit
439stream algorithms, with a factor of 3 reduction being typical.
440The next section goes on to illustrate a first simple example
441for the task of population count.
442
443\section{Population Count}
444
445\begin{figure}
446\begin{center}\small
447\begin{verbatim}
448c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
449c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
450c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
451c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
452c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
453\end{verbatim}
454\end{center}
455\caption{Population Count Reference Algorithm}
456\label{HD-pop}
457\end{figure}
458
459\begin{figure}
460\begin{center}\small
461\begin{verbatim}
467\end{verbatim}
468\end{center}
469\caption{Ideal SIMD Population Count}
470\label{inductivepopcount}
471\end{figure}
472
473As an initial example to illustrate the principle of inductive doubling
474in practice, consider the problem of {\em population count}: determining
475the number of one bits within a particular bit field.  It is important
476enough for such operations as
477calculating Hamming distance to be included as a built-in instruction
478on some processors.  For example, the SPU of the Cell Broadband Engine
479has a SIMD population count instruction \verb:si_cntb: for simultaneously
480determining the
481number of 1 bits within each byte of a 16-byte register.
482In text processing with parallel bit streams, population count has direct
483application to keeping track of line numbers for error reporting, for example.
484Given a bit block identifying the positions of newline characters within
485a block of characters being processed, the population count of the
486bit block can be used to efficiently and conveniently be used to update
487the line number upon completion of block processing.
488
489Figure \ref{HD-pop} presents a traditional divide-and-conquer
490implementation for a 32-bit integer {\tt x} adapted from
491Warren \cite{HackersDelight}, while
492Figure \ref{inductivepopcount} shows the corresponding IDISA
493implementation for a vector of 32-bit fields.  Each implementation employs
494five steps of inductive doubling to produce population counts
495within 32 bit fields.  The traditional implementation employs
496explicit masking and shifting operations, while these
497operations are implicit within the semantics of the inductive
498doubling instructions shown in Figure \ref{inductivepopcount}.
499In each implementation, the first step determines the
500the population counts within 2-bit fields
501by adding the high bit of each such field to the low bit
502to produce a set of 2-bit counts in {\tt c}.
503In the second step, the counts within 4-bit fields of {\tt c} are determined
504by adding the counts of the corresponding high and low 2-bit subfields.
505Continuing in this fashion,
506the final population counts within 32-bit fields are determined in five steps.
507
508With the substitution of longer mask constants replicated for four
50932-bit fields, the implementation of Figure \ref{HD-pop} can be
510directly adapted to SWAR processing using 128-bit registers.
511Each binary operator is replaced by a corresponding binary
512SIMD operation.   Without the IDISA features, a
513straightforward SWAR implementation of population count for
51432-bit fields thus employs 10 operations to load or generate
515mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
516total of 30 operations to complete the task.   Employing
517optimization identified by Warren, this can be reduced to
51820 operations, 5 of which are required to generate mask constants.
519At the cost of register pressure, it is possible that these constants
520could be kept preloaded in long vector processing.
521In any case, the IDISA implementation requires only 5 operations
522to carry out the work requiring 15 to 20 operations with the
523optimized version using standard SWAR operations.   IDISA thus offers
524offers a 3X to 4X improvement in instruction count as well as
525a reduction in register pressure.
526
527The pattern illustrated by population count is typical.  An
528inductive doubling algorithm of $n$ steps typically applies
529two mask or shift operations at each step for each of the
530two operands being combined in the step.   If the combination
531can be implemented in a single SWAR operation, then a total
532of $3n$ operations are the minimum required for the SWAR
533algorithm with IDISA features.   An IDISA implementation
534typically eliminates the explicit mask and shift operations
535through appropriate half-operand modifiers, reducing the
536total instruction count to $n$.   Thus a 3X improvement
537is typical.
538
539\section{Transposition to Parallel Bit Streams}
540
541In this section, we consider the first major
542application of the inductive doubling architecture:
543transposition of byte stream data to parallel bit stream
544form.   Of course, this operation is critical to the
545method of parallel bit streams and all applications
546of the method can benefit from a highly efficient
547transposition process.  Before considering how
548the inductive doubling architecture supports this
549transposition process, however, we first consider
550algorithms on existing architectures.  Two algorithms
551are presented; the best of these requires 72
552SIMD operations in the three-register model to perform
553transposition of eight serial registers of byte stream data into
554eight parallel registers of bit stream data.
555
556We then go on to show how the transposition problem
557can be solved using the inductive doubling architecture
558with a mere 24 three-register SIMD operations.  We also prove
559that this is optimal for any three-register instruction set model.
560
561
562\begin{figure}[tbh]
563\begin{center}
564\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
565\caption{Input/Output Model for Serial to Parallel Transposition}
566\label{s2p-spec}
567\end{center}
568
569\end{figure}
570Figure \ref{s2p-spec} illustrates the input-output requirements of
571the transposition problem.  We assume that inputs and
572outputs are each SIMD registers of size $N=2^K$ bits.
573The input consists of $N$ bytes of serial byte data,
574stored consecutively in eight SIMD registers each holding
575$N/8$ bytes.  The output consists of eight parallel
576registers, one each for the eight individual bit positions
577within a byte.  Upon completion of the transposition process,
578each output register is to hold the $N$ bits corresponding
579to the selected bit position in the sequence of $N$ input
580bytes.
581
582\subsection{Bit Gathering Algorithm}
583
584\begin{figure}[tbh]
585\begin{center}
586\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
587\caption{Serial to Parallel Transposition Using Bit-Gathering}
588\label{gather}
589\end{center}
590\end{figure}
591One straightforward algorithm for implementing the transposition process
592takes advantage of SIMD bit gathering operations that exist
593on some architectures.  This operation gathers one bit per byte
594from a particular position within each byte of a SIMD register.
595For example, the {\tt pmovmskb} operation of the Intel MMX and
596SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
597consisting of the high bit of each byte.  Similarly, the
598{\tt \verb:si_gbb:} operation of the synergistic processing units of the
599Cell Broadband Engine gathers together the low bit of each byte
600from the SIMD register.  Figure \ref{gather} illustrates the
601bit gathering process.
602
603For each bit stream of output, the bit gather algorithm requires
604one gather operation for each of the 8 input registers,
605giving 64 bit gather operations in all.  In addition, for seven
606of the eight bit positions, it is necessary to shift the bits
607of each input register into the conventional position for
608gathering.  A total of 56 shift operations are required for this
609task.  Finally, the result of each bit gather operation must
610be properly inserted into the output stream.  If the first
611gather operation for a stream can be used to also initialize
612the output register, there will then need to be 7 insert
613operations for the results of the remaining gather operations
614for this stream, making 56 insert operations in all.
615The insert step may be more complex than a single operation
616in some cases, but we use one operation per insert as a lower bound.
617Thus, the bit gather algorithm requires
618at least 176 operations to perform the transposition task.
619
620\subsection{BytePack Algorithm}
621
622A much more efficient transposition algorithm on commodity
623SIMD architectures (SSE and Altivec) involves three
624stages of binary division transformation.  This is similar
625to the three stage bit matrix inversion described by
626Warren  \cite{HackersDelight}, although modified to use SIMD operations.
627In each stage, input streams are divided into two half-length output streams.
628The first stage separates the bits at even numbered positions from those
629at odd number positions.  The two output streams from the first
630stage are then further divided in the second stage.
631The stream comprising even numbered bits from the original byte stream
632divides into one stream consisting of bits from positions 0 and 4 of each
633byte in the original stream and a second stream consisting of bits
634from positions 2 and 6 of each original byte.  The stream of bits from
635odd positions is similarly divided into streams for bits from Each of the
636positions 1 and 5 and bits from positions 2 and 6.
637Finally, each of the four streams resulting from the second stage are
638divided into the desired individual bit streams in the third stage.
639
640\begin{figure}[tbh]
641\begin{center}\small
642\begin{verbatim}
643t0 = simd<16>::pack<h,h>(s0, s1);
644t1 = simd<16>::pack<l,l>(s0, s1);
645p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
646p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
647\end{verbatim}
648\end{center}
649\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
650\label{s2pstep}
651\end{figure}
652
653The binary division transformations are accomplished in each stage
654using byte packing, shifting and masking.  In each stage, a
655transposition step combines each pair of serial input registers to
656produce a pair of parallel output registers.  In essence,
657doublebytes from the input registers are packed into bytes
658in the output registers, with the bits from even positions stored
659in the bytes of one output stream and the bits from odd positions
660stored in the bytes of the second output stream.
661Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
662producing two parallel registers \verb:p0: and \verb:p1: from
663two serial registers \verb:s0: and \verb:s1:.  This step is applied
664four times in stage 1; stages 2 and 3 also consist of four applications
665of a similar step with different shift and masking constants.
666
667Although we have used the idealized SIMD notation here, each of the
668operations maps to a single operation in the Altivec set and a small number
669of operations in the SSE set.  Using the Altivec set, there are
6706 operations per step for a total of 24 operations per stage.
671The three stages combined required 72 operations to transpose 128 bytes
672to parallel bit stream form.  This is the best algorithm known to
673us for existing SIMD architectures.
674
675\subsection{Inductive Halving Algorithm}
676
677Using the inductive doubling architecture, it is possible to design
678a transposition algorithm that is both easier to understand and requires
679many fewer operations than the the bytepack algorithm described above.
680We call it the inductive halving algorithm for serial to parallel
681transposition, because it proceeds by reducing byte streams to
682two sets of nybble streams in a first stage, dividing the nybble
683streams into streams of bitpairs in a second stage and finally
684dividing the bitpair streams into bit streams in the third stage.
685
686
687\begin{figure}[tbh]
688\small
689\begin{verbatim}
690p0 = simd<8>::pack<h,h>(s0, s1);
691p1 = simd<8>::pack<l,l>(s0, s1);
692\end{verbatim}
693\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
694\label{halvingstep}
695\end{figure}
696
697Figure \ref{halvingstep} shows one step in stage 1 of the inductive
698halving algorithm, comprising just two SIMD operations.
699The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
700from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
701the low nybble of each byte.  As in the bytepack algorithm, this step is
702applied 4 times in stage 1, for a total of 8 operations.
703
704Stage 2 of the inductive halving algorithm reduces nybble streams
705to streams of bit pairs.  The basic step in this algorithm consists
706of one \verb#simd<4>::pack<h,h># operation to extract the high pair
707of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
708low pair of each nybble.  Four applications of this step complete stage 2.
709
710Stage 3 similarly uses four applications of a step that uses a
711\verb#simd<2>::pack<h,h># operation to extract the high bit of
712each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
713each pair.  The complete algorithm to transform eight serial
714byte registers s0 through s7 into the eight parallel bit stream
715registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
716
717\begin{figure}[tbh]
718\small
719\begin{verbatim}
720hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
721lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
722hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
723lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
724hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
725lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
726hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
727lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
728hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
729hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
730lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
731ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
732hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
733hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
734lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
735ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
736bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
737bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
738bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
739bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
740bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
741bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
742bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
743bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
744\end{verbatim}
745\caption{Complete Inductive Halving Algorithm}
746\label{halvingalgorithm}
747\end{figure}
748
749\subsection{Optimality of the Inductive Halving Algorithm}
750
751Here we show that the inductive halving algorithm presented in
752the previous subsection is optimal in the following sense:
753no other algorithm on any 3-register SIMD architecture can use
754fewer than 24 operations to transform eight serial registers
755of byte stream data into eight parallel registers of bit stream data.
756By 3-register SIMD architecture, we refer to any architecture
757that uses SIMD instructions consistent with our overall model of
758binary operations using two input register operands to produce
759one output register value.
760
761Observe that the $N$ data bits from each input register must be
762distributed $N/8$ each to the 8 output registers by virtue of
763the problem definition.  Each output register can effectively
764be given a 3-bit address; the partitioning problem can be viewed
765as moving data to the correct address.  However, each
766operation can move results into at most one register.
767At most this can result in the assignment of one correct address
768bit for each of the $N$ input bits.  As all $8N$ input bits
769need to be moved to a register with a correct 3-bit address,
770a minimum of 24 operations is required.
771
772\section{Parallel to Serial Conversion}
773
774Parallel bit stream applications may apply string editing
775operations in bit space to substitute, delete or insert
776parallel sets of bits at particular positions.  In such cases,
777the inverse transform that converts a set of parallel bit
778streams back into byte space is needed.  In the example of
779UTF-8 to UTF-16 transcoding, the inverse transform is
780actually used twice for each application of the forward
781transform, to separately compute the high and low byte
782streams of each UTF-16 code unit.  Those two byte streams
783are subsequentely merged to form the final result.
784
785Algorithms for performing the inverse transform mirror those
786of the forward transform, employing SIMD merge operations
787in place of pack operations.  The best algorithm known
788to us on the commodity SIMD architectures takes advantage
789of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
790operations that are available with each of the SSE and Altivec instruction
791sets.  These algorithms take 72 operations to perform the
792inverse transposition of 8 parallel registers of bit stream
793data into 8 serial registers of byte stream data.
794
795\begin{figure}[tbh]
796\begin{center}\small
797\begin{verbatim}
798bit01_r0 = simd<1>::mergeh(bit0, bit1);
799bit01_r1 = simd<1>::mergel(bit0, bit1);
800bit23_r0 = simd<1>::mergeh(bit2, bit3);
801bit23_r1 = simd<1>::mergel(bit2, bit3);
802bit45_r0 = simd<1>::mergeh(bit4, bit5);
803bit45_r1 = simd<1>::mergel(bit4, bit5);
804bit67_r0 = simd<1>::mergeh(bit6, bit7);
805bit67_r1 = simd<1>::mergel(bit6, bit7);
806bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
807bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
808bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
809bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
810bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
811bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
812bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
813bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
814s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
815s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
816s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
817s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
818s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
819s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
820s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
821s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
822\end{verbatim}
823\end{center}
824\label{p2s-inductive}
825\caption{Parallel to Serial Transposition by Inductive Doubling}
826\end{figure}
827
828An algorithm employing only 24 operations using the
829inductive doubling instruction set architecture is relatively
830straightforward.. In stage 1, parallel registers for individual bit streams
831are first merged with bit-level interleaving
832using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
833operations.  For each of the four pairs of consecutive
834even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
835two consecutive registers of bitpair data are produced.
836In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
837are then applied to merge to bitpair streams to produce streams
838of nybbles for the high and low nybble of each byte.  Finally,
839the nybble streams are merged in stage 3 to produce the
840desired byte stream data.  The full inductive doubling
841algorithm for parallel to serial transposition is shown in Figure
842\ref{p2s-inductive}.
843
844This algorithm is again optimal, requiring the fewest operations
845of any possible algorithm using any 3-register instruction set
846model.  The proof directly follows that for the serial to parallel
847transposition.
848
849The existence of high-performance algorithms for transformation of
850character data between byte stream and parallel bit stream form
851in both directions makes it possible to consider applying these
852transformations multiple times during text processing applications.
853Just as the time domain and frequency domain each have their
854use in signal processing applications, the byte stream form and
855parallel bit stream form can then each be used at will in
856character stream applications.
857
858
859
860
861
862\section{Parallel Bit Deletion}
863
864Parallel bit deletion is an important operation that allows string
865editing operations to be carried out while in parallel bit stream
866form.  It is also fundamental to UTF-8 to UTF-16 transcoding
867using parallel bit streams, allowing the excess code unit
868positions for UTF-8 two-, three- and four-byte sequences to
869be deleted once the sixteen parallel bit streams of UTF-16 have
870been computed \cite{PPoPP08}.
871
872Parallel bit deletion is specified using a deletion mask.
873A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
874to be deleted and 0s at positions identifying bits to be retained.
875For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
876bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
877accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
878
879Bit deletion may be performed using
880the parallel-prefix compress algorithm documented by
881Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
882only logic and shifts with a constant parameter to carry
883out the deletion process.  However, it requires $k^2$
884preprocessing steps for a final field width parameter
885of size $2^k$, as well as 4 operations per deletion step
886per stream.  Using the inductive doubling instruction set architecture
887it is possible to carry out bit deletion much more efficiently.
888
889Deletion within fixed size fields or registers may produce results that are either
890left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
891eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
892care positions marked \verb:x:'.  Concatenating an adjacent right-justified result with a
893left-justified result produces an important intermediate form known as a
894{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
895right-justified and left-justified results from the application of the
8964-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
897stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
898the central result \verb:xxbdefhx:, which may easily be converted to a either a
899left or a right justified 8-element result by an appropriate shift operation.
900
901\begin{figure*}
902\begin{center}
903\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
904\hline
905\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
906\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
907\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
908\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
909\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
910\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
911\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
912\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
913\end{tabular}
914\end{center}
915\label{centraldelstep}
916\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
917\end{figure*}
918
919The observation about how two $n$-bit central deletion results can
920combine to yield a $2n$ central deletion result provides the basis
921for an inductive doubling algorithm.  Figure \ref{centraldelstep}
922illustrates the inductive process for the transition from 8-bit central
923deletion results to 16-bit central deletion results.  The top row shows
924the original deletion mask, while the second row shows the original
925bit stream to which deletions are to be applied, with deleted bits zeroed out.
926The third row shows the central result for each 8-bit field as the
927result of the previous inductive step.
928
929To perform the $8 \rightarrow 16$ central deletion step, we first form
930the population counts of 4-bit fields of the original deletion mask as
931shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
932an 8-bit central result, we perform a right shift by the population count
933of the low half of the field.  Similarly,
934left-justification requires a left-shift by the population count in the
935high half of the field.
936
937The left and right shifts can be performed simultaneously using a rotate
938left instruction.  Right justification by shifting an $n$ bit field
939$i$ positions to the right is equivalent to a left rotate of $n-i$
940positions.  These rotation amounts are computed by the operation
941\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
942except that don't care fields (which won't be subsequently used)
943are marked \verb:XX:.
944
945The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
946as shown in row 6, and are combined with the right shift amounts
947by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
948as shown in row 7.  Using these computed values, the inductive step
949is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
950as shown in row 8.
951
952At each inductive doubling level, it requires 4 operations to compute the
953required deletion infomation and one operation per bit stream to perform deletion.
954Note that, if deletion is to be applied to a set of eight parallel bit streams,
955the computed deletion information is used for each stream without recomputation,
956thus requiring 12 operations per inductive level.
957
958In comparison to the parallel-prefix compress method, the method of central
959deletion results using the inductive doubling architecture has far fewer operations.
960The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
961induction versus $4k^2$ for the parallel-prefix method.  Using the computed
962deletion operation, only a single SIMD rotate operation per bit stream
963per level is needed, in comparison with 4 operations per level for parallel-prefix
964compress.
965
966
967
968\section{Beyond Parallel Bit Streams}
969
970It can be argued that text processing with parallel bit
971streams by itself is not sufficiently important to justify
972the circuit complexity of IDISA.  In this section, then
973we move on to consider further applications that may
974benefit from IDISA features.  These include additional
975basic inductive doubling
976algorithms such as parity, and least power-of-2 ceiling.
977Further applications for narrow field widths are discussed
978as well, such as packed DNA representations using 2-bit
979field widths and packed decimal representations using 4-bit
980field widths. Most significantly, however, we show that IDISA offers
981a fully general solution to the problem of supporting
982{\em horizontal} SWAR operations.
983
984\subsection{Inductive Doubling with Logic Operations}
985
986\begin{figure}
987\begin{center}\small
988\begin{verbatim}
989y = x ^ (x >> 1);
990y = y ^ (y >> 2);
991y = y ^ (y >> 4);
992y = y ^ (y >> 8);
993y = y ^ (y >>16);
994y = y & 1;
995\end{verbatim}
996\end{center}
997\caption{Parity Reference Algorithm}
998\label{HD-parity}
999\end{figure}
1000
1001\begin{figure}
1002\begin{center}\small
1003\begin{verbatim}
1004x = x - 1;
1005
1006x = x | (x >> 1);
1007
1008x = x | (x >> 2);
1009
1010x = x | (x >> 4);
1011
1012x = x | (x >> 8);
1013
1014x = x | (x >>16);
1015
1016x = x + 1;
1017\end{verbatim}
1018\end{center}
1019\caption{Power-of-2 Ceiling Reference Algorithm}
1020\label{HD-clp2}
1021\end{figure}
1022
1023\begin{figure}
1024\begin{center}\small
1025\begin{verbatim}
1026y = simd<2>::xor<l,h>(x, x);
1027y = simd<4>::xor<l,h>(y, y);
1028y = simd<8>::xor<l,h>(y, y);
1029y = simd<16>::xor<l,h>(y, y);
1030y = simd<32>::xor<l,h>(y, y);
1031\end{verbatim}
1032\end{center}
1033\caption{IDISA Parity Implementation}
1034\label{ID-parity}
1035\end{figure}
1036
1037\begin{figure}
1038\begin{center}\small
1039\begin{verbatim}
1040x = simd<32>::sub(x, simd<32>::const(1));
1041x = simd<2>::or<x, h>(x, x);
1042x = simd<4>::or<x, h>(x, x);
1043x = simd<8>::or<x, h>(x, x);
1044x = simd<16>::or<x, h>(x, x);
1045x = simd<32>::or<x, h>(x, x);
1047\end{verbatim}
1048\end{center}
1049\caption{IDISA Power-of-2 Ceiling Implementation}
1050\label{ID-clp2}
1051\end{figure}
1052
1053Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1054code for two further inductive doubling examples using
1055logical operators as the combining feature.  In the
1056first of these, the exclusive or'' operation is applied
1057to all bits in a 32-bit value to determine parity.
1058Parity has important applications for error-correcting
1059codes such as the various Hamming codes for detecting
1060and correcting numbers of bit errors dependent on the
1061number of parity bits added.  Warren's version uses
106211 operations for parity of a single 32-bit value;
1063a straightforward SWAR adaptation also uses 11 operations
1064for the parity of a set of 32-bit fields.
1065
1066Figure \ref{ID-parity} shows the IDISA parity implementation
1067with only 5 operations required.  This represents slightly
1068more than a 2X improvement.  The improvement is less than
10693X seen in other cases because one of the operands need
1070not be modified before applying the exclusive-or operation.
1071
1072Using the `or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1073code for the least power-of-2 ceiling of a 32-bit value.  The
1074technique is to employ inductive doubling to fill in one bits at
1075all bit positions after the first one bit in the original value to
1076first produce a value of the form $2^n-1$.  This is then incremented
1077to determine the power of 2 ceiling.  This function and
1078its straightforward adaptation for SWAR application on a
1079set of 32-bit fields requires 12 operations.   The
1080corresponding IDISA implementation of Figure \ref{ID-clp2}
1081requires 7 operations, just under a 2X improvement.
1082
1083\subsection{Packed DNA Representation}
1084
1085DNA sequences are often represented as strings consisting
1086of the four nucleotide codes A, C, G and T.  Internally,
1087these sequences are frequently represented in internal
1088form as packed sequences of 2-bit values.  The IDISA
1089\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1090allow these packed representations to be quickly computed
1091from byte-oriented string values by two steps of inductive
1092halving.   Similarly, conversion back to string form
1093can use two steps of inductive merging.  Without direct
1094support for these pack and merge operations, the SWAR
1095implementations of these conversions require the cost
1096of explicit masking and shifting in combination with
1097the 16-bit to 8-bit packing and 8-bit to 16-bit
1098merging operations supported by existing SWAR facilities
1099on commodity processors.
1100
1101\subsection{Bit Reverse}
1102
1103Include or omit?
1104
1105\subsection{String/Decimal/Integer Conversion}
1106\begin{figure}
1107\begin{center}\small
1108\begin{verbatim}
1109b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1110b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1111b = (d & 0x0000FFFF) + 10000 * (d >> 16);
1112\end{verbatim}
1113\end{center}
1114\caption{BCD to Integer Reference Algorithm}
1115\label{BCD2int}
1116\end{figure}
1117
1118\begin{figure}
1119\begin{center}\small
1120\begin{verbatim}
1121c10 = simd<16>:const(10);
1122c100 = simd<16>:const(100);
1123c10000 = simd<32>:const(10000);
1127\end{verbatim}
1128\end{center}
1129\caption{IDISA Parity Implementation}
1130\label{ID-BCD2int}
1131\end{figure}
1132
1133Just as DNA sequences represent an important use case for
1134SWAR operations on 2-bit fields, packed sequences of
1135decimal or hexadecimal digits represent a common use case
1136for 4-bit fields.  These representations can be used
1137both as an intermediate form in numeric string to integer
1138conversion and as a direct representation for
1139packed binary coded decimal.
1140
1141Figure \ref{BCD2int} shows a three-step inductive
1142doubling implementation for conversion of 32-bit packed BCD
1143values to integer form.  The 32-bit value consists
1144of 8 4-bit decimal digits.  Pairs of digits are
1145first combined by multiplying the higher digit
1146of the pair by 10 and adding.  Pairs of these
1147two-digit results are then further combined by
1148multiplying the value of the higher of the two-digit
1149results by 100 and adding.  The final step is
1150to combine four-digit results by multiplying the
1151higher one by 10000 and adding.  Overall, 20
1152operations are required for this implementation
1153as well as the corresponding SWAR implementation
1155into registers for repeated use can reduce the number of
1156operations to 14 at the cost of significant register
1157pressure.
1158
1159The IDISA implementation of this algorithm is shown
1160in Figure \ref{ID-BCD2int}.  This implementation
1161shows an interesting variation in the use of
1162half-operand modifiers, with only one operand
1163of each of the addition and multiplication operations
1164modified at each level.  Overall, this implementation
1165requires 9 operations, or 6 operations with 3
1166preloaded constants.  This represents more than a 2X
1167reduction in instruction count as well as a 2X reduction
1168in register pressure.
1169
1170
1171
1172
1173\subsection{Systematic Support for Horizontal SIMD Operations}
1174
1175In SIMD parlance, {\em horizontal} operations are
1176operations which combine values from two or more fields
1177of the same register, in contrast to the normal
1178{\em vertical} operations which combine corresponding
1179fields of different registers.  Horizontal operations
1180can be found that combine two (e.g., \verb:haddpd: on SSE3),
1181four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1182or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1183horizontal operations have a vertical component as well.
1184For example, \verb:psadbw: first forms the absolute value of
1185the difference of eight corresponding byte fields before
1186performing horizontal add of the eight values, while
1187\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1188four unsigned 8-bit fields within one register
1189and then combines the result horizontally with
1190corresponding 32-bit fields of a second registers.
1191
1192The space of potential horizontal operations thus has
1193many dimensions, including not only the particular
1194combining operation and the operand field width, but
1195also the number of fields being combined, whether a
1196vertical combination is applied and whether it is applied
1197before or after the horizontal operation and what the
1198nature of the vertical combining operation is.
1199Within this space, commodity SIMD architectures tend
1200to support only a very few combinations, without any
1201particular attempt at systematic support for horizontal
1202operations in general.
1203
1204In contrast to this {\em ad hoc} support on commodity
1205processors, IDISA offers a completely systematic treatment
1206of horizontal operations without any special features beyond
1207the inductive doubling features already described.
1208In the simplest case, any vertical operation
1209\verb#simd<n>::F# on $n$-bit fields gives rise to
1210an immediate horizontal operation
1212pairs of $n/2$ bit fields.
1214in adjacent 8 bit fields to produce 16 bit results,
1215while \verb#simd<32>::min<l,h># can produce the
1216minimum value of adjacent 16-bit fields.
1217Thus any binary horizontal operation can be implemented
1218in a single IDISA instruction making use of the \verb:<h, l>:
1219operand modifier combination.
1220
1221Horizontal combinations of four adjacent fields can also be
1222realized in a general way through two steps of inductive
1223doubling.  For example, consider the or-across operation \verb:si_orx:
1224of the SPU, that performs a logical or operation
1225on four 32-bit fields.  This four field combination
1226can easily be implemented with the following two operations.
1227%\begin{singlespace}
1228\begin{verbatim}
1229t = simd<64>::or<l,h>(x, x)
1230t = simd<128>::or<l,h>(t, t)
1231\end{verbatim}
1232%\end{singlespace}
1233
1234In general, systematic support for horizontal
1235combinations of sets of $2^h$ adjacent fields may
1236be realized through $h$ inductive double steps
1237in a similar fashion.
1238Thus, IDISA esssentially offers systematic support
1239for horizontal operations entirely through the
1240use of \verb:<h, l>: half-operand modifier
1241combinations.
1242
1243Systematic support for general horizontal operations
1244under IDISA also creates opportunity for a design tradeoff:
1245offsetting the circuit complexity of half-operand
1246modifiers with potential elimination of dedicated
1247logic for some {/ad hoc} horizontal SIMD operations.
1248Even if legacy support for these operations is required,
1249it may be possible to provide that support through
1250software or firmware rather than a full hardware
1251implementation.  Evaluation of these possibilities
1252in the context of particular architectures is a potential
1253area for further work.
1254
1255
1256\begin{figure*}
1257\begin{center}
1258\begin{tabular}{|c||c|c|c|}
1259\hline
1260Kernel & Altivec ops & IDISA ops & Speed-up  \\ \hline
1261pop\_count<32> &  & 5n & 3X  \\ \hline
1262
1263bit\_reverse<32> & & & \\ \hline
1264Gray2binary<32> & 10n & 5n &  2X\\ \hline
1265\end{tabular}
1266\end{center}
1267\label{perftable}
1268\caption{Performance Results}
1269\end{figure*}
1270
1271
1272
1273\section{Implementation}
1274
1275We have carried implementation work for IDISA in three
1276ways.  First, we have constructed libraries that
1277implement the IDISA instructions by template and/or macro
1278expansion for each of MMX, SSE, Altivec, and SPU platforms.
1279Second, we have developed a model implementation
1280involving a modified operand fetch component
1281of a pipelined SIMD processor.  Third, we have written
1282and evaluated Verilog HDL description of this model
1283implementation.
1284
1285\subsection{IDISA Libraries}
1286
1287Implementation of IDISA instructions using template
1288and macro libraries has been useful in developing
1289and assessing the correctness of many of the algorithms
1290presented here.  Although these implementations do not
1291deliver the performance benefits associated with
1292direct hardware implementation of IDISA, they
1293have been quite useful in providing a practical means
1294for portable implementation of parallel bit stream
1295algorithms on multiple SWAR architectures.  However,
1296one additional facility has also proven necessary for
1297portability of parallel bit stream algorithms across
1298big-endian and little-endian architectures: the
1299notion of shift-forward and shift-back operations.
1300In essence, shift forward means shift to the left
1301on little-endian systems and shift to the right on
1302big-endian systems, while shift back has the reverse
1303interpretation.  Although this concept is unrelated to
1304inductive doubling, its inclusion with the IDISA
1305libraries has provided a suitable basis for portable
1306SIMD implementations of parallel bit stream algorithms.
1307Beyond this, the IDISA libraries have the additional
1308benefit of allowing the implementation
1309of inductive doubling algorithms at a higher level
1310abstraction, without need for programmer coding of
1311the underlying shift and mask operations.
1312
1313\subsection{IDISA Model}
1314\begin{figure}[tbh]
1315\begin{center}
1316\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1317\caption{IDISA Block Diagram}
1318\label{pipeline-model}
1319\end{center}
1320\end{figure}
1321
1322Figure \ref{pipeline-model} shows a block diagram
1323for a pipelined SIMD processor implementing IDISA.
1324The SIMD Register File (SRF) provides a file of $R = 2^A$
1325registers each of width $N = 2^K$ bits.
1326IDISA instructions identified by the Instruction Fetch
1327Unit (IFU) are forwarded for decoding to the SIMD
1328Instruction Decode Unit (SIDU).  This unit decodes
1329the instruction to produce
1330signals identifying the source and destination
1331operand registers, the half-operand modifiers, the
1332field width specification and the SIMD operation
1333to be applied.
1334
1335The SIDU supplies the source register information and the half-operand
1336modifier information to the SIMD Operand Fetch Unit (SOFU).
1337For each source operand, the SIDU provides an $A$-bit register
1338address and two 1-bit signals $h$ and $l$ indicating the value
1339of the decoded half-operand modifiers for this operand.
1340Only one of these values may be 1; both are 0 if
1341no modifier is specified.
1342In addition, the SIDU supplies decoded field width information
1343to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
1344The SIDU also supplies decoded SIMD opcode information to SIEU and
1345a decoded $A$-bit register address for the destination register to
1346the SIMD Result Write Back Unit (SRWBU).
1347
1348The SOFU is the key component of the IDISA model that
1349differs from that found in a traditional SWAR
1350processor.  For each of the two $A$-bit source
1351register addresses, SOFU is first responsible for
1352fetching the raw operand values from the SRF.
1353Then, before supplying operand values to the
1354SIEU, the SOFU applies the half-operand modification
1355logic as specified by the $h$, $l$, and field-width
1356signals.  The possibly modified operand values are then
1357provided to the SIEU for carrying out the SIMD operations.
1358A detailed model of SOFU logic is described in the following
1359subsection.
1360
1361The SIEU differs from similar execution units in
1362current commodity processors primarily by providing
1363SIMD operations at each field width
1364$n=2^k$ for $0 \leq k \leq K$.  This involves
1365additional circuitry for field widths not supported
1366in existing processors.  For inductive doubling
1367algorithms in support of parallel bit streams,
1368the principal need is for additional circuitry to
1369support 2-bit and 4-bit field widths.  This circuity
1370is generally less complicated than that for larger
1371fields.  Support for circuitry at these width
1372has other applications as well.   For example,
1373DNA sequences are frequently represented using
1374packed sequences of 2-bit codes for the four possible
1375nucleotides\cite{}, while the need for accurate financial
1376calculation has seen a resurgence of the 4-bit
1377packed BCD format for decimal floating point \cite{}.
1378
1379When execution of the SWAR instruction is
1380completed, the result value is then provided
1381to the SRWBU to update the value stored in the
1382SRF at the address specified by the $A$-bit
1383destination operand.
1384
1385\subsection{Operand Fetch Unit Logic}
1386
1387
1388
1389
1390
1391
1392
1393\section{Conclusions}
1394
1395This paper has considered the issue of architectural support for
1396SIMD text processing using the method of parallel bit streams and has
1397argued that this architectural support can best be provided
1398through a SIMD instruction set architecture that implements
1399features for direct support of inductive doubling algorithms.
1400Four key features of the inductive doubling architecture have
1401been identified include support for operations at all
1402power-of-2 field widths, half-operand modifiers and
1403pack and merge operations.  The principle innovation is the
1404notion of half-operand modifiers to support efficient
1405transition between successive power-of-two field widths.
1406
1407Several algorithms key to parallel bit stream methods
1408have been examined and shown to benefit from dramatic
1409reductions in instruction count compared to the best
1410known algorithms on existing architecture.  In the case
1411of transposition algorithms to and from parallel bit stream
1412form, the architecture has been shown to make possible
1413straightforward inductive doubling algorithms with the
1414lowest total number of operations that can be achieved by any
1415possible 3-register SIMD architecture.
1416
1417The inductive doubling architecture also has considerable
1418benefits beyond its role in supporting SIMD programming
1419with parallel bit streams.  Notable among these is
1420that the architecture provides a framework for systematic
1421support of horizontal SIMD operations.
1422
1423
1424
1425%\appendix
1426%\section{Appendix Title}
1427%
1428%This is the text of the appendix, if you need one.
1429
1430\acks
1431
1432This research was supported by a Discovery Grant from the
1433Natural Sciences and Engineering Research Council of Canada.
1434
1435%\bibliographystyle{plainnat}
1436\bibliographystyle{plain}
1437\bibliography{asplos094-cameron}
1438%\begin{thebibliography}{}
1439
1440%\bibitem{smith02}
1441%Smith, P. Q. reference text
1442
1443%\end{thebibliography}
1444
1445
1446\end{document}
Note: See TracBrowser for help on using the repository browser.