source: docs/ASPLOS09/asplos094-cameron.tex @ 251

Last change on this file since 251 was 251, checked in by cameron, 11 years ago

simd<n>::const => constant

File size: 70.8 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66\sloppy
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{u8u16, PPoPP08}, XML parsing \cite{CASCON08, Herdy}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94In application to UTF-8 to UTF-16 transcoding, a 3X to 25X speed-up
95is achieved in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets\cite{PPoPP08}.
97In the broader context of XML parsing, further applications of these
98techniques demonstrate the utility of parallel bit stream techniques
99in delivering performance benefits through a significant portion of the
100web technology stack.  In an XML statistics gathering application,
101including the implementation of XML well-formedness checking, an
102overall 3X to 10X performance improvement is achieved in using
103the parallel bit stream methods in comparison with a similarly
104coded application using such well known parsers as Expat and Xerces \cite{CASCON08}.
105In an application involving transformation between different XML formats
106(GML and SVG), an implementation using parallel bit stream technology
107required a mere 15 cycles per byte, while a range of other technologies
108required from 25 to 200 cycles per byte \cite{Herdy}.
109Ongoing work is further applying the parallel bit stream methds to
110parallel hash value computation and parallel regular expression matching
111for the purpose of validating XML datatype declarations in
112accord with XML Schema \cite{CASCON08}.
113
114Given these promising initial results in the application of
115parallel bit stream methods, what role might architectural support play in
116further enhancing this route to parallelization of text processing?
117This paper addresses this question through presentation and
118analysis of a constructive proposal: a set of SIMD instruction set
119features based on the principle of systematic support for
120inductive doubling algorithms.  Inductive doubling refers
121to a general property of certain kinds of algorithm that
122systematically double the values of field widths or other
123data attributes with each iteration.  In essence, the goal
124of the proposed features is to support such algorithms
125with specific facilities to  transition between successive power-of-2 field
126widths.   These transitions are quite frequent in several critical
127algorithms for parallel bit streams.   These transitions also
128occur in other applications as well.  In related work,
129efficient automatic interleaving based on power-of-2 strides has been
130reported quite useful for a number of SIMD kernels \cite{Nuzman}.
131The specific features presented herein will be referred to
132as IDISA: inductive doubling instruction set architecture.
133
134The remainder of this paper is organized as follows.
135The second section of this paper introduces IDISA and the
136SIMD notation used throughout this paper.  The third
137section moves on to discuss an evaluation methodology
138for IDISA in comparison to two reference architectures
139motivated by the SSE and Altivec instruction sets.
140The fourth section provides a short first example of
141the inductive doubling principle in action through
142the case of population count.  Sections 5 through 7
143then address the application of IDISA to core
144algorithms in text processing with parallel bit
145streams. 
146The eighth section then considers the potential role of
147IDISA in supporting applications beyond parallel bit streams.
148Section 9 addresses IDISA implementation while
149Section 10 concludes the paper with a summary of
150results and directions for further work.
151
152
153\section{Inductive Doubling Architecture}
154
155This section presents an idealized model for a single-instruction
156multiple-data (SIMD) instruction set architecture designed
157specifically to support inductive doubling algorithms in the
158domain of parallel bit stream programming.   The architecture is idealized
159in the sense that we concentrate on only the necessary features
160for our purpose, without enumerating the additional
161operations that would be required for
162SIMD applications in other domains.  The goal is to focus
163on the principles of inductive doubling support in a way
164that can accommodate a variety of realizations as other
165design constraints are brought to bear on the overall instruction set
166design.  First we introduce a simple model and notation for
167SIMD operations in general and then present the four key
168features of an idealized architecture in support of parallel
169bit stream programming.
170
171The idealized architecture supports typical SIMD integer
172operations common to existing commodity architectures such as SSE
173and Altivec.   The architecture is defined to support SIMD
174operations on registers of size $N=2^K$ bits, for some integer $K$.
175Typical values of $K$ for commodity processors include $K=6$ for
176the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
177the 128-bit registers of SSE and Altivec technology and $K=8$ for
178the upcoming Intel AVX technology.   The registers may be
179partitioned into $N/n$ fields of width $n=2^k$ bits for some values
180of $k \leq K$.   Typical values of $k$ used on commodity processors
181include $k = 3$ for SIMD operations on 8-bit fields (bytes),
182$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
183on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
184fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
185Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
186register $r$, using big-endian numbering.
187
188Throughout this paper, we focus on
189SWAR implementations using the following {\em
190three-register model}, although it is also possible to define
191IDISA extensions for a two-register model of the SSE instruction set.
192Let \verb:simd<n>: represent the class of SIMD operations defined
193on fields of size $n$ using C++ template syntax.  Given a
194binary function $F_n$ on $n$-bit fields, we denote the SIMD
195version of this operation as \verb#simd<n>::F#.  Given two
196SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
197respectively, the operation \verb#r=simd<n>::F(a,b)# stores
198the value $r$ in the register \verb:r: as determined by
199the simultaneous calculation of individual field values in
200accord with the following equation.
201\[ r_i = F_n(a_i, b_i) \]
202
203For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
204logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
205integers as follows.
206%\singlespace
207\begin{eqnarray}
208\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
209\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
210\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
211\end{eqnarray}
212%\doublespace
213The SSE and Altivec instruction sets support
214each of the addition and subtraction operations in SIMD form
215for 8, 16 and 32-bit fields, while the SSE set also includes
216the 64-bit forms.  For example, \verb#simd<8>::add# in our
217nomenclature is provided by the operation \verb:paddb:
218on SSE and the operation \verb:vaddubm: on Altivec.
219The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
220\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
221more constrained, requiring that all field shifts by the same amount.
222
223Given these definitions and notation, we now present
224the four key elements of an inductive doubling architecture.
225The first is a definition of a core set of binary functions
226on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
227The second is a set of {\em half-operand modifiers} that allow
228the inductive processing of fields of size $2n$ in terms of
229combinations of $n$-bit values selected from the fields.
230The third is the definition of packing operations that compress
231two consecutive registers of $n$-bit values into a single
232register of $n/2$-bit values.  The fourth is the definition
233of merging operations that produce a set of $2n$ bit fields
234by concatenating corresponding $n$-bit fields from two
235parallel registers.  Each of these features is described below.
236
237For the purpose of direct and efficient support for inductive
238doubling algorithms on bit streams, the provision of
239a core set of operations at field widths of 2 and 4 as
240well as the more traditional field widths of 8, 16 and 32
241is critical for elegant and efficient expression of the
242algorithms.  In essence, inductive doubling algorithms
243work by establishing some base property at either single
244or 2-bit fields.  Each iteration of the algorithm then
245goes on to establish the property for the power-of-2
246field width.  In order for this inductive step to be
247most conveniently and efficiently expressed, the
248core operations needed for the step should be available
249at each field width.  In the case of work with parallel
250bit streams, the operations \verb:add:, \verb:sub:,
251\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
252(rotate left) comprise the core.  In other domains,
253additional operations may be usefully included in the
254core depending on the work that needs to be performed
255at each inductive doubling level.
256
257Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
258also includes fields of width 1.  These are included for
259logical consistency, but are easily implemented by mapping
260directly to appropriate bitwise logic operations, which
261we assume are also available.  For example,
262\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
263bitwise exclusive-or operation on SIMD registers.
264
265The second key facility of the inductive doubling architecture
266is the potential application of half-operand modifiers to
267the fields of either or both of the operands of a SIMD
268operation.  These modifiers select either the
269low $n/2$
270bits of each $n$-bit field (modifier ``\verb:l:'') or the
271high $n/2$ bits (modifier ``\verb:h:'').  When required,
272the modifier ``\verb:x:'' means that the full $n$ bits
273should be used, unmodified.  The semantics of these
274modifiers are given by the following equations.
275%\singlespace
276\begin{eqnarray}
277l(r_n) & = & r_n \bmod 2^{n/2} \\
278h(r_n) & = & r_n / 2^{n/2} \\
279x(r_n) & = & r_n
280\end{eqnarray}
281%\doublespace
282In our notation, the half-operand modifiers are
283specified as optional template (compile-time) parameters
284for each of the binary functions.  Thus,
285\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
286the 2-bit quantity found in the high 2-bits of each 4-bit field
287of its first operand (\verb:a:)
288together with the corresponding 2-bit quantity found in the
289low 2-bits of its second operand (\verb:b:).
290In general, the purpose of the half-operand modifiers
291in support of inductive doubling is to allow the processing
292of $n$-bit fields to easily expressed in terms of
293combination of the results determined by processing
294$n/2$ bit fields.
295
296The third facility of the inductive doubling architecture
297is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
298The field values of \verb#r=simd<n>::pack(a,b)# are
299defined by the following equations.
300%\singlespace
301\begin{eqnarray}
302r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
303r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
304\end{eqnarray}
305%\doublespace
306Here conv is a function which performs conversion of an $n$-bit
307value to an $n/2$ bit value by signed saturation (although
308conversion by unsigned saturation would also suit our purpose).
309
310Half-operand modifiers may also be used with the pack
311operations.  Thus packing with conversion by masking off all
312but the low $n/2$ bits of each field may be
313be performed using the operation \verb#simd<n>::pack<l,l>#
314
315
316The final facility of the inductive doubling architecture is
317a set of merging operations \verb#simd<n>::mergeh# and
318\verb#simd<n>::mergel# that produce $2n$-bit fields
319by concatenating corresponding $n$-bit fields from the
320operand registers.   The respective
321operations \verb#r=simd<n>::mergeh(a,b)# and
322\verb#s=simd<n>::mergel(a,b)#
323are defined by the following equations.
324%\singlespace
325\begin{eqnarray}
326r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
327s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
328\end{eqnarray}
329%\doublespace
330Both SSE and Altivec provide versions of pack and merge operations
331for certain field widths.  The pack operations are provided
332with operands having 16-bit or 32-bit fields on each platform, although
333with some variation in how conversion is carried out.
334The merge operations are provided at 8-bit, 16-bit and 32-bit
335field widths on both architectures and also at the 64-bit level
336on SSE.
337
338This completes the description of IDISA.  As described, many of the
339features are already available with the SIMD facilities of
340existing commodity processors.  The extensions enumerated
341here are relatively straightforward.  The innovation
342is to specifically tackle the design of facilities to
343offer systematic support for transitions between power-of-2 field widths.
344As we shall show in the remainder of this paper, these facilities
345can dramatically reduce instruction count in core parallel bit
346stream algorithms, with a factor of 3 reduction being typical.
347
348\section{Evaluation Methodology}
349
350IDISA represents a set of instruction set features that
351could potentially be added to any SWAR processor.  The goal
352in this paper is to evaluate these features independent
353of artifacts that may be due to any particular realization,
354while still considering realistic models based on existing
355commodity instruction set architectures.  For the purpose
356of IDISA evaluation, then, we define two reference architectures.
357For concreteness, IDISA and the two reference architectures will
358each be considered as 128-bit processors employing the
359three-register SWAR model defined in the previous
360section.
361
362Reference architecture A (RefA) consists of a limited register
363processor providing a set of core binary operations defined
364for 8, 16, 32 and 64 bit fields.  The core binary operations
365will be assumed to be those defined by the SSE instruction
366set for 16-bit fields.  In addition, we assume that
367shift immediate operations for each field width exist,
368e.g., \verb#simd<8>::srli<1>(x)# for a right logical
369shift of each 8-bit field by 1.  We also assume that
370a constant load operation \verb#simd::constant<n>(c)#
371loads the constant value $c$ into each $n$ bit field.
372
373Reference architecture B (RefB) consists of a register-rich
374processor incorporating all the operations of reference
375architecture A as well as the following additional facilities
376inspired by the Altivec instruction set.
377For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
378logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
379of $a$ by the rotation count in the corresponding field of $b$.
380A three-register \verb#simd<1>::if(a,b,c)# bitwise logical
381operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned
382after the Altivec \verb:vec_sel: operation.  Finally,
383a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
384permutation of bytes from the concatenation of $a$ and $b$ based
385on the set of indices in $c$.
386
387Two versions of IDISA are assessed against these reference
388architectures as follows.  IDISA-A has all the facilities
389of RefA extended with half-operand modifiers and all core
390operations at field widths of 2, 4 and 128.  IDISA-B is
391similarly defined and extended based on RefB.  Algorithms
392for both RefA and IDISA-A are assessed assuming that
393any required constants must be loaded as needed; this
394reflects the limited register assumption.  On the other,
395assessment for both RefB and IDISA-B will make the
396assumption that sufficiently many registers exist that
397constants can be kept preloaded.
398
399In each case, the processors are assumed to be
400pipelined processors with a throughput of one SWAR instruction
401each processor cycle for straight-line code free of memory
402access.  Such processors are certainly feasible with a
403suitable investment in circuitry, although practical designs
404may make sacrifices to achieve close to optimal throughput
405with less circuitry.  However, the assumption makes for
406straightforward performance evaluation based on instruction
407count for straight-line computational kernels.  Furthermore,
408the assumption also eliminates artifacts due to possibly different
409latencies in reference and IDISA architectures.  Because
410the same assumption is made for reference and IDISA
411architectures, determination of the additional circuit
412complexity due to IDISA features is unaffected by the
413assumption.
414
415In the remainder of this paper, then, IDISA-A and IDISA-B
416models are evaluated against their respective reference
417architectures on straight-line computational kernels
418used in parallel bit stream processing and other applications.
419As XML and other sequential text processing applications
420tend to use memory in an efficient streaming model, the
421applications tend to be compute-bound rather than IO-bound.
422Thus, the focus on computational kernels addresses the
423primary concern for performance improvement of these applications.
424
425The additional circuity complexity to realize IDISA-A and
426IDISA-B designs over their reference models will be
427addressed in the penultimate section.  That discussion
428will focus primarily on the complexity of implementing
429half-operand modifier logic, but will also address
430the extension of the core operations to operate on
4312-bit, 4-bit and 128-bit fields, as well.
432
433\section{Population Count}
434
435\begin{figure}[h]
436\begin{center}\small
437\begin{verbatim}
438c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
439c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
440c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
441c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
442c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
443\end{verbatim}
444\end{center}
445\caption{Population Count Reference Algorithm}
446\label{HD-pop}
447\end{figure}
448
449\begin{figure}
450\begin{center}\small
451\begin{verbatim}
452c = simd<2>::add<h,l>(x, x);
453c = simd<4>::add<h,l>(c, c);
454c = simd<8>::add<h,l>(c, c);
455c = simd<16>::add<h,l>(c, c);
456c = simd<32>::add<h,l>(c, c);
457\end{verbatim}
458\end{center}
459\caption{IDISA Population Count}
460\label{inductivepopcount}
461\end{figure}
462
463As an initial example to illustrate the principle of inductive doubling
464in practice, consider the problem of {\em population count}: determining
465the number of one bits within a particular bit field.  It is important
466enough for such operations as calculating Hamming distance to be included
467as a built-in instruction
468on some processors.  For example, the SPU of the Cell Broadband Engine
469has a SIMD population count instruction \verb:si_cntb: for simultaneously
470determining the
471number of 1 bits within each byte of a 16-byte register.
472In text processing with parallel bit streams, population count has direct
473application to keeping track of line numbers for error reporting, for example.
474Given a bit block identifying the positions of newline characters within
475a block of characters being processed, the population count of the
476bit block can be used to efficiently and conveniently be used to update
477the line number upon completion of block processing.
478
479Figure \ref{HD-pop} presents a traditional divide-and-conquer
480implementation for a 32-bit integer {\tt x} adapted from
481Warren \cite{HackersDelight}, while
482Figure \ref{inductivepopcount} shows the corresponding IDISA
483implementation for a vector of 32-bit fields.  Each implementation employs
484five steps of inductive doubling to produce population counts
485within 32 bit fields.  The traditional implementation employs
486explicit masking and shifting operations, while these
487operations are implicit within the semantics of the inductive
488doubling instructions shown in Figure \ref{inductivepopcount}.
489In each implementation, the first step determines the
490the population counts within 2-bit fields
491by adding the high bit of each such field to the low bit
492to produce a set of 2-bit counts in {\tt c}.
493In the second step, the counts within 4-bit fields of {\tt c} are determined
494by adding the counts of the corresponding high and low 2-bit subfields.
495Continuing in this fashion,
496the final population counts within 32-bit fields are determined in five steps.
497
498With the substitution of longer mask constants replicated for four
49932-bit fields, the implementation of Figure \ref{HD-pop} can be
500directly adapted to SWAR processing using 128-bit registers.
501Each binary operator is replaced by a corresponding binary
502SWAR operation.   Without the IDISA features, a
503straightforward RefA implementation of population count for
50432-bit fields thus employs 10 operations to load or generate
505mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
506total of 30 operations to complete the task.   Employing
507optimization identified by Warren, this can be reduced to
50820 operations, 5 of which are required to generate mask constants.
509At the cost of register pressure, it is possible that these constants
510could be kept preloaded in long vector processing.  In accord
511with our evaluation model, the RefB cost is thus 15 operations.
512As the IDISA implementation requires no constants at all,
513both the IDISA-A and IDISA-B cost is 5 operations.
514At our assumed one CPU cycle per instruction model, IDISA-A
515offers a 4X improvement over RefA, while IDISA-B offers a 3X
516improvement over its comparator.
517
518The pattern illustrated by population count is typical.
519An inductive doubling algorithm of $n$ steps typically applies
520mask or shift operations at each step for each of the
521two operands being combined in the step.  If there is
522one such operation for each operand, and the combination
523can be implemented in a single SWAR operation, then a total
524of $3n$ operations are the required in a RefB implementation,
525and possibly $4n$ for a RefA implementation including the
526cost of loading masks.  IDISA-A and IDISA-B implementations
527typically eliminate the explicit mask and shift operations
528through appropriate half-operand modifiers, reducing the
529total instruction count to $n$.   Thus a 3X to 4X improvement
530obtains in these cases.
531
532\section{Transposition to Parallel Bit Streams}
533
534In this section, we consider the first major
535application of IDISA: transposition of byte stream data to parallel bit stream
536form.   Of course, this operation is critical to the
537method of parallel bit streams and all applications
538of the method can benefit from a highly efficient
539transposition process.  Before considering how
540the IDISA supports this
541transposition process, however, we first consider
542algorithms on existing architectures.  Two algorithms
543are presented; the best of these requires 72
544SWAR operations under the RefB model to perform
545transposition of eight serial registers of byte stream data into
546eight parallel registers of bit stream data.
547
548We then go on to show how the transposition problem
549can be solved using IDISA-A or IDISA-B
550with a mere 24 three-register SIMD operations.  We also show
551that this is optimal for any three-register instruction set model.
552
553\begin{figure}[tbh]
554\begin{center}
555\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
556\caption{Input/Output Model for Serial to Parallel Transposition}
557\label{s2p-spec}
558\end{center}
559
560\end{figure}
561Figure \ref{s2p-spec} illustrates the input-output requirements of
562the transposition problem.  We assume that inputs and
563outputs are each SWAR registers of size $N=2^K$ bits.
564The input consists of $N$ bytes of serial byte data,
565stored consecutively in eight SIMD registers each holding
566$N/8$ bytes.  The output consists of eight parallel
567registers, one each for the eight individual bit positions
568within a byte.  Upon completion of the transposition process,
569each output register is to hold the $N$ bits corresponding
570to the selected bit position in the sequence of $N$ input
571bytes.
572
573\subsection{Bit Gathering Algorithm}
574
575% \begin{figure}[tbh]
576% \begin{center}
577% \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
578% \caption{Serial to Parallel Transposition Using Bit-Gathering}
579% \label{gather}
580% \end{center}
581% \end{figure}
582One straightforward algorithm for implementing the transposition process
583takes advantage of SWAR bit gathering operations that exist
584on some architectures.  This operation gathers one bit per byte
585from a particular position within each byte of a SIMD register.
586For example, the {\tt pmovmskb} operation of the Intel
587SSE instruction set forms a 16-bit mask
588consisting of the high bit of each byte.  Similarly, the
589{\tt \verb:si_gbb:} operation of the synergistic processing units of the
590Cell Broadband Engine gathers together the low bit of each byte
591from the SIMD register. 
592% Figure \ref{gather} illustrates the
593% bit gathering process.
594
595Using bit gathering, each bit stream of output is assembled 16 positions
596at a time.  Bits from the input register must be shifted into
597position, the gather operation performed and the result inserted
598into position in the output register.  For the 8 streams, this
599requires at least 22 operations for 16 positions, or 176 operations
600to complete the transposition task.
601
602\subsection{BytePack Algorithm}
603
604A more efficient transposition algorithm on commodity
605SWAR architectures involves three
606stages of binary division transformation.  This is similar
607to the three stage bit matrix inversion described by
608Warren  \cite{HackersDelight}, although modified to use SWAR operations.
609In each stage, input streams are divided into two half-length output streams.
610The first stage separates the bits at even numbered positions from those
611at odd number positions.  The two output streams from the first
612stage are then further divided in the second stage.
613The stream comprising even numbered bits from the original byte stream
614divides into one stream consisting of bits from positions 0 and 4 of each
615byte in the original stream and a second stream consisting of bits
616from positions 2 and 6 of each original byte.  The stream of bits from
617odd positions is similarly divided into streams for bits from each of the
618positions 1 and 5 and bits from positions 2 and 6.
619Finally, each of the four streams resulting from the second stage are
620divided into the desired individual bit streams in the third stage.
621
622% \begin{figure}[tbh]
623% \begin{center}\small
624% \begin{verbatim}
625% s0h = simd<16>::srli<8>(s0);
626% s0l = simd_and(s0, simd<16>::constant(0x00FF));
627% s1h = simd<16>::srli<8>(s1);
628% s1l = simd_and(s1, simd<16>::constant(0x00FF));
629% t0 = simd<16>::pack(s0h, s1h);
630% t1 = simd<16>::pack(s0l, s1l);
631% t0_l1 = simd<16>::slli<1>(t0);
632% t0_r1 = simd<16>::srli<1>(t1);
633% mask = simd<8>::constant(0xAA);
634% p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask));
635% p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask));
636% \end{verbatim}
637% \end{center}
638% \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
639% \label{s2pstep}
640% \end{figure}
641%
642
643\begin{figure}[tbh]
644\begin{center}\small
645\begin{verbatim}
646even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30};
647odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
648mask = simd<8>::constant(0xAA);
649t0 = simd<8>::permute(s0, s1, even);
650t1 = simd<8>::permute(s0, s1, odd);
651p0 = simd_if(mask, t0, simd<16>::srli<1>(t1));
652p1 = simd_if(mask, simd<16>::slli<1>(t0), t1);
653\end{verbatim}
654\end{center}
655\caption{RefB Transposition Step in BytePack Stage 1}
656\label{s2pstep}
657\end{figure}
658
659The binary division transformations are accomplished in each stage
660using byte packing, shifting and masking.  In each stage, a
661transposition step combines each pair of serial input registers to
662produce a pair of parallel output registers. 
663Figure \ref{s2pstep} shows a stage 1 transposition step in a
664Ref B implementation.  Using the permute facility, the even
665and odd bytes, respectively, from two serial input registers
666\verb:s0: and \verb:s1: are packed into temporary registers
667\verb:t0: and \verb:t1:.  The even and odd bits are then
668separated into two parallel output registers \verb:p0: and \verb:p1:
669by selecting alternating bits using a mask.  This step is applied
670four times in stage 1; stages 2 and 3 also consist of four applications
671of a similar step with different shift and mask constants.
672Overall, 6 operations per step are required, yielding a total
673of 72 operations to transpose 128 bytes to parallel bit stream
674form in the RefB implementation.
675
676In a RefA implementation, byte packing may also be achieved
677by the \verb#simd<16>::pack# with 4 additional operations to
678prepare operands.  Essentially, the RefB implementation
679uses single permuite instructions to implement the equivalent of
680\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
681The RefA implementation also requires 3 logic operations to implement
682each \verb#simd_if#.
683Assuming that mask loads are only need once per 128 bytes,
684a total of 148 operations are required in the RefB implementation.
685
686\subsection{Inductive Halving Algorithm}
687
688Using IDISA, it is possible to design
689a transposition algorithm that is both easier to understand and requires
690many fewer operations than the the bytepack algorithm described above.
691We call it the inductive halving algorithm for serial to parallel
692transposition, because it proceeds by reducing byte streams to
693two sets of nybble streams in a first stage, dividing the nybble
694streams into streams of bitpairs in a second stage and finally
695dividing the bitpair streams into bit streams in the third stage.
696
697\begin{figure}[tbh]
698\small
699\begin{verbatim}
700p0 = simd<8>::pack<h,h>(s0, s1);
701p1 = simd<8>::pack<l,l>(s0, s1);
702\end{verbatim}
703\caption{Stage 1 Transposition Step in the Inductive Halving Algorithm}
704\label{halvingstep}
705\end{figure}
706
707Figure \ref{halvingstep} shows one step in stage 1 of the inductive
708halving algorithm, comprising just two IDISA-A operations.
709The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
710from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
711the low nybble of each byte.  As in the bytepack algorithm, this step is
712applied 4 times in stage 1, for a total of 8 operations.
713
714Stage 2 of the inductive halving algorithm reduces nybble streams
715to streams of bit pairs.  The basic step in this algorithm consists
716of one \verb#simd<4>::pack<h,h># operation to extract the high pair
717of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
718low pair of each nybble.  Four applications of this step complete stage 2.
719
720Stage 3 similarly uses four applications of a step that uses a
721\verb#simd<2>::pack<h,h># operation to extract the high bit of
722each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
723each pair.  The complete algorithm to transform eight serial
724byte registers s0 through s7 into the eight parallel bit stream
725registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
726Under either IDISA-A or IDISA-B models, a mere 24 operations per 128
727input bytes is required.
728
729\begin{figure}[tbh]
730\small
731\begin{verbatim}
732hnybble0 = simd<8>::pack<h,h>(s0, s1);
733lnybble0 = simd<8>::pack<l,l>(s0, s1);
734hnybble1 = simd<8>::pack<h,h>(s2, s3);
735lnybble1 = simd<8>::pack<l,l>(s2, s3);
736hnybble2 = simd<8>::pack<h,h>(s4, s5);
737lnybble2 = simd<8>::pack<l,l>(s4, s5);
738hnybble3 = simd<8>::pack<h,h>(s6, s7);
739lnybble3 = simd<8>::pack<l,l>(s6, s7);
740hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
741hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
742lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
743ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
744hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
745hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
746lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
747ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
748bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
749bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
750bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
751bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
752bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
753bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
754bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
755bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
756\end{verbatim}
757\caption{Complete Inductive Halving Algorithm}
758\label{halvingalgorithm}
759\end{figure}
760
761\subsection{Optimality of the Inductive Halving Algorithm}
762
763Here we show that the inductive halving algorithm presented in
764the previous subsection is optimal in the following sense:
765no other algorithm on any 3-register SIMD architecture can use
766fewer than 24 operations to transform eight serial registers
767of byte stream data into eight parallel registers of bit stream data.
768By 3-register SIMD architecture, we refer to any architecture
769that uses SIMD instructions consistent with our overall model of
770binary operations using two input register operands to produce
771one output register value.
772
773Observe that the $N$ data bits from each input register must be
774distributed $N/8$ each to the 8 output registers by virtue of
775the problem definition.  Each output register can effectively
776be given a 3-bit address; the partitioning problem can be viewed
777as moving data to the correct address.  However, each
778operation can move results into at most one register. 
779At most this can result in the assignment of one correct address
780bit for each of the $N$ input bits.  As all $8N$ input bits
781need to be moved to a register with a correct 3-bit address,
782a minimum of 24 operations is required.
783
784\subsection{End-to-End Significance}
785
786In a study of several XML technologies applied to
787the problem of GML to SVG transformation, the parabix
788implementation (parallel bit streams for XML) was
789found to the fastest with a cost of approximately
79015 CPU cycles per input byte \cite{Herdy}.  Within parabix,
791transposition to parallel bit stream form requires
792approximately 1.1 cycles per byte \cite{CASCON08}.
793All other things being equal, a 3X speed-up of transposition
794alone would improve end-to-end performance in a
795complete XML processing application by more than 4\%.
796
797
798\section{Parallel to Serial Conversion}
799
800Parallel bit stream applications may apply string editing
801operations in bit space to substitute, delete or insert
802parallel sets of bits at particular positions.  In such cases,
803the inverse transform that converts a set of parallel bit
804streams back into byte space is needed.  In the example of
805UTF-8 to UTF-16 transcoding, the inverse transform is
806actually used twice for each application of the forward
807transform, to separately compute the high and low byte
808streams of each UTF-16 code unit.  Those two byte streams
809are subsequentely merged to form the final result.
810
811Algorithms for performing the inverse transform mirror those
812of the forward transform, employing SIMD merge operations
813in place of pack operations.  The best algorithm known
814to us on the commodity SIMD architectures takes advantage
815of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
816operations that are available with each of the SSE and Altivec instruction
817sets.  To perform the full inverse transform of 8 parallel
818registers of bit stream data into 8 serial registers of byte stream data,
819a RefA implementation requires 120 operations, while a RefB
820implementation reduces this to 72.
821
822\begin{figure}[tbh]
823\begin{center}\small
824\begin{verbatim}
825bit01_r0 = simd<1>::mergeh(bit0, bit1);
826bit01_r1 = simd<1>::mergel(bit0, bit1);
827bit23_r0 = simd<1>::mergeh(bit2, bit3);
828bit23_r1 = simd<1>::mergel(bit2, bit3);
829bit45_r0 = simd<1>::mergeh(bit4, bit5);
830bit45_r1 = simd<1>::mergel(bit4, bit5);
831bit67_r0 = simd<1>::mergeh(bit6, bit7);
832bit67_r1 = simd<1>::mergel(bit6, bit7);
833bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
834bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
835bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
836bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
837bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
838bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
839bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
840bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
841s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
842s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
843s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
844s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
845s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
846s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
847s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
848s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
849\end{verbatim}
850\end{center}
851\label{p2s-inductive}
852\caption{Parallel to Serial Transposition by Inductive Doubling}
853\end{figure}
854
855An algorithm employing only 24 operations using IDISA-A/B is relatively
856straightforward.. In stage 1, parallel registers for individual bit streams
857are first merged with bit-level interleaving
858using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
859operations.  For each of the four pairs of consecutive
860even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
861two consecutive registers of bitpair data are produced.
862In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
863are then applied to merge to bitpair streams to produce streams
864of nybbles for the high and low nybble of each byte.  Finally,
865the nybble streams are merged in stage 3 to produce the
866desired byte stream data.  The full inductive doubling
867algorithm for parallel to serial transposition is shown in Figure
868\ref{p2s-inductive}.
869
870This algorithm is again optimal, requiring the fewest operations
871of any possible algorithm using any 3-register instruction set
872model.  The proof directly follows that for the serial to parallel
873transposition.
874
875The existence of high-performance algorithms for transformation of
876character data between byte stream and parallel bit stream form
877in both directions makes it possible to consider applying these
878transformations multiple times during text processing applications.
879Just as the time domain and frequency domain each have their
880use in signal processing applications, the byte stream form and
881parallel bit stream form can then each be used at will in
882character stream applications.
883
884
885
886\section{Parallel Bit Deletion}
887
888\begin{figure*}[tbh]
889\begin{center}
890\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
891\hline
892\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
893\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
894\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
895\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
896\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
897\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
898\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
899\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
900\end{tabular}
901\end{center}
902\label{centraldelstep}
903\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
904\end{figure*}
905
906Parallel bit deletion is an important operation that allows string
907editing operations to be carried out while in parallel bit stream
908form.  It is also fundamental to UTF-8 to UTF-16 transcoding
909using parallel bit streams, allowing the excess code unit
910positions for UTF-8 two-, three- and four-byte sequences to
911be deleted once the sixteen parallel bit streams of UTF-16 have
912been computed \cite{PPoPP08}.
913
914Parallel bit deletion is specified using a deletion mask.
915A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
916to be deleted and 0s at positions identifying bits to be retained.
917For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
918bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
919accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
920
921Bit deletion may be performed using
922the parallel-prefix compress algorithm documented by
923Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
924only logic and shifts with a constant parameter to carry
925out the deletion process.  However, it requires $k^2$
926preprocessing steps for a final field width parameter
927of size $2^k$, as well as 4 operations per deletion step
928per stream.  Using the inductive doubling instruction set architecture
929it is possible to carry out bit deletion much more efficiently.
930
931Deletion within fixed size fields or registers may produce results that are either
932left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
933eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
934care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
935left-justified result produces an important intermediate form known as a
936{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
937right-justified and left-justified results from the application of the
9384-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
939stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
940the central result \verb:xxbdefhx:, which may easily be converted to a either a
941left or a right justified 8-element result by an appropriate shift operation.
942
943
944The observation about how two $n$-bit central deletion results can
945combine to yield a $2n$ central deletion result provides the basis
946for an inductive doubling algorithm.  Figure \ref{centraldelstep}
947illustrates the inductive process for the transition from 8-bit central
948deletion results to 16-bit central deletion results.  The top row shows
949the original deletion mask, while the second row shows the original
950bit stream to which deletions are to be applied, with deleted bits zeroed out.
951The third row shows the central result for each 8-bit field as the
952result of the previous inductive step.
953
954To perform the $8 \rightarrow 16$ central deletion step, we first form
955the population counts of 4-bit fields of the original deletion mask as
956shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
957an 8-bit central result, we perform a right shift by the population count
958of the low half of the field.  Similarly,
959left-justification requires a left-shift by the population count in the
960high half of the field.
961
962The left and right shifts can be performed simultaneously using a rotate
963left instruction.  Right justification by shifting an $n$ bit field
964$i$ positions to the right is equivalent to a left rotate of $n-i$
965positions.  These rotation amounts are computed by the operation
966\verb#rj=simd<8>::sub<x,l>(simd<8>::constant(8), cts_4)# as shown in row 5,
967except that don't care fields (which won't be subsequently used)
968are marked \verb:XX:. 
969
970The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
971as shown in row 6, and are combined with the right shift amounts
972by the selection operation  \verb#rot_8=simd_if(simd<16>::constant(0xFF00), rj, lj)#
973as shown in row 7.  Using these computed values, the inductive step
974is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
975as shown in row 8.
976
977At each inductive doubling level, it requires 4 operations to compute the
978required deletion infomation and one operation per bit stream to perform deletion.
979Note that, if deletion is to be applied to a set of eight parallel bit streams,
980the computed deletion information is used for each stream without recomputation,
981thus requiring 12 operations per inductive level.
982
983In comparison to the parallel-prefix compress method, the method of central
984deletion results using the inductive doubling architecture has far fewer operations.
985The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
986induction versus $4k^2$ for the parallel-prefix method.  Using the computed
987deletion operation, only a single SIMD rotate operation per bit stream
988per level is needed, in comparison with 4 operations per level for parallel-prefix
989compress.
990
991
992
993\section{Beyond Parallel Bit Streams}
994
995IDISA has a variety of applications in domains beyond
996text processing with parallel bit streams.  We present
997a number of examples in this section, including,
998most significantly, a full general solution to the problem of supporting
999{\em horizontal} SWAR operations.
1000
1001\subsection{Parity}
1002
1003\begin{figure}
1004\begin{center}\small
1005\begin{verbatim}
1006y = x ^ (x >> 1);
1007y = y ^ (y >> 2);
1008y = y ^ (y >> 4);
1009y = y ^ (y >> 8);
1010y = y ^ (y >>16);
1011y = y & 1;
1012\end{verbatim}
1013\end{center}
1014\caption{Parity Reference Algorithm}
1015\label{HD-parity}
1016\end{figure}
1017
1018\begin{figure}
1019\begin{center}\small
1020\begin{verbatim}
1021y = simd<2>::xor<h,l>(x, x);
1022y = simd<4>::xor<h,l>(y, y);
1023y = simd<8>::xor<h,l>(y, y);
1024y = simd<16>::xor<h,l>(y, y);
1025y = simd<32>::xor<h,l>(y, y);
1026\end{verbatim}
1027\end{center}
1028\caption{IDISA Parity Implementation}
1029\label{ID-parity}
1030\end{figure}
1031
1032Figures \ref{HD-parity} shows Warren's for parity
1033employing 11 operations, giving rise to a straightforward
1034RefA implementation also using 11 operations. 
1035Parity has important applications for error-correcting
1036codes such as the various Hamming codes for detecting
1037and correcting numbers of bit errors dependent on the
1038number of parity bits added. 
1039
1040Figure \ref{ID-parity} shows the IDISA parity implementation
1041with only 5 operations required.  This represents slightly
1042more than a 2X improvement.  The improvement is less than
10433X seen in other cases because one of the operands need
1044not be modified before applying the exclusive-or operation.
1045
1046\subsection{Packed DNA Representation}
1047
1048DNA sequences are often represented as strings consisting
1049of the four nucleotide codes A, C, G and T.  Internally,
1050these sequences are frequently represented in internal
1051form as packed sequences of 2-bit values.  The IDISA
1052\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1053allow these packed representations to be quickly computed
1054from byte-oriented string values by two steps of inductive
1055halving.   Similarly, conversion back to string form
1056can use two steps of inductive merging.  Without direct
1057support for these pack and merge operations, the SWAR
1058implementations of these conversions require the cost
1059of explicit masking and shifting in combination with
1060the 16-bit to 8-bit packing and 8-bit to 16-bit
1061merging operations supported by existing SWAR facilities
1062on commodity processors.
1063
1064\subsection{Bit Reverse}
1065
1066Bit reverse is an important operation needed in a number
1067of low level codecs.  Following Warren's inductive
1068doubling implementation using masks and shifts \cite{HackersDelight},
1069a RefA implementation on 32-bit fields requires 28
1070operations, while a straightforward IDISA-A implementation
1071using \verb#simd<n>::rotli# at each inductive doubling
1072level requires only 5 operations.
1073
1074\subsection{String/Decimal/Integer Conversion}
1075\begin{figure}
1076\begin{center}\small
1077\begin{verbatim}
1078b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1079b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1080b=(d & 0x0000FFFF) + 10000 * (d >> 16);
1081\end{verbatim}
1082\end{center}
1083\caption{BCD to Integer Reference Algorithm}
1084\label{BCD2int}
1085\end{figure}
1086
1087\begin{figure}
1088\begin{center}\small
1089\begin{verbatim}
1090t1=simd<8>:const(10);
1091t2=simd<16>:const(100);
1092t3=simd<32>:const(10000);
1093b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d);
1094b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b);
1095b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b);
1096\end{verbatim}
1097\end{center}
1098\caption{IDISA BCD to Integer}
1099\label{ID-BCD2int}
1100\end{figure}
1101
1102Just as DNA sequences represent an important use case for
1103SWAR operations on 2-bit fields, packed sequences of
1104decimal or hexadecimal digits represent a common use case
1105for 4-bit fields.  These representations can be used
1106both as an intermediate form in numeric string to integer
1107conversion and as a direct representation for
1108packed binary coded decimal.
1109
1110Figure \ref{BCD2int} shows a three-step inductive
1111doubling implementation for conversion of 32-bit packed BCD
1112values to integer form.  The 32-bit value consists
1113of 8 4-bit decimal digits.  Pairs of digits are
1114first combined by multiplying the higher digit
1115of the pair by 10 and adding.  Pairs of these
1116two-digit results are then further combined by
1117multiplying the value of the higher of the two-digit
1118results by 100 and adding.  The final step is
1119to combine four-digit results by multiplying the
1120higher one by 10000 and adding.  Overall, 20
1121operations are required for this implementation
1122as well as the corresponding SWAR implementation
1123for sets of 32-bit fields.  Preloading of 6 constants
1124into registers for repeated use can reduce the number of
1125operations to 14 at the cost of significant register
1126pressure.
1127
1128The IDISA implementation of this algorithm is shown
1129in Figure \ref{ID-BCD2int}.  This implementation
1130shows an interesting variation in the use of
1131half-operand modifiers, with only one operand
1132of each of the addition and multiplication operations
1133modified at each level.  Overall, this implementation
1134requires 9 operations, or 6 operations with 3
1135preloaded constants.  This represents more than a 2X
1136reduction in instruction count as well as a 2X reduction
1137in register pressure.
1138
1139
1140\subsection{Further Applications}
1141
1142Further applications of inductive doubling can be
1143found in integer contraction and dilation for quadtrees and
1144octtrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
1145Pixel packing from 32 bit fields into a 5:5:5 representation
1146is a further application of parallel bit deletion.
1147
1148\subsection{Systematic Support for Horizontal SIMD Operations}
1149
1150In SIMD parlance, {\em horizontal} operations are
1151operations which combine values from two or more fields
1152of the same register, in contrast to the normal
1153{\em vertical} operations which combine corresponding
1154fields of different registers.  Horizontal operations
1155can be found that combine two (e.g., \verb:haddpd: on SSE3),
1156four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1157or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1158horizontal operations have a vertical component as well.
1159For example, \verb:psadbw: first forms the absolute value of
1160the difference of eight corresponding byte fields before
1161performing horizontal add of the eight values, while
1162\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1163four unsigned 8-bit fields within one register
1164and then combines the result horizontally with
1165corresponding 32-bit fields of a second registers.
1166
1167The space of potential horizontal operations thus has
1168many dimensions, including not only the particular
1169combining operation and the operand field width, but
1170also the number of fields being combined, whether a
1171vertical combination is applied and whether it is applied
1172before or after the horizontal operation and what the
1173nature of the vertical combining operation is.
1174Within this space, commodity SIMD architectures tend
1175to support only a very few combinations, without any
1176particular attempt at systematic support for horizontal
1177operations in general.
1178
1179In contrast to this {\em ad hoc} support on commodity
1180processors, IDISA offers a completely systematic treatment
1181of horizontal operations without any special features beyond
1182the inductive doubling features already described.
1183In the simplest case, any vertical operation
1184\verb#simd<n>::F# on $n$-bit fields gives rise to
1185an immediate horizontal operation
1186\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1187pairs of $n/2$ bit fields.
1188For example, \verb#simd<16>::add<h,l># adds values
1189in adjacent 8 bit fields to produce 16 bit results,
1190while \verb#simd<32>::min<h,l># can produce the
1191minimum value of adjacent 16-bit fields.
1192Thus any binary horizontal operation can be implemented
1193in a single IDISA instruction making use of the \verb:<h,l>:
1194operand modifier combination.
1195
1196Horizontal combinations of four adjacent fields can also be
1197realized in a general way through two steps of inductive
1198doubling.  For example, consider the or-across operation \verb:si_orx:
1199of the SPU, that performs a logical or operation
1200on four 32-bit fields.  This four field combination
1201can easily be implemented with the following two operations.
1202%\begin{singlespace}
1203\begin{verbatim}
1204t = simd<64>::or<h,l>(x, x)
1205t = simd<128>::or<h,l>(t, t)
1206\end{verbatim}
1207%\end{singlespace}
1208
1209In general, systematic support for horizontal
1210combinations of sets of $2^h$ adjacent fields may
1211be realized through $h$ inductive double steps
1212in a similar fashion.
1213Thus, IDISA esssentially offers systematic support
1214for horizontal operations entirely through the
1215use of \verb:<h,l>: half-operand modifier
1216combinations.
1217
1218Systematic support for general horizontal operations
1219under IDISA also creates opportunity for a design tradeoff:
1220offsetting the circuit complexity of half-operand
1221modifiers with potential elimination of dedicated
1222logic for some {/ad hoc} horizontal SIMD operations.
1223Even if legacy support for these operations is required,
1224it may be possible to provide that support through
1225software or firmware rather than a full hardware
1226implementation.  Evaluation of these possibilities
1227in the context of particular architectures is a potential
1228area for further work.
1229
1230
1231\section{Implementation}
1232
1233IDISA may be implemented as a software
1234abstraction on top of existing SWAR architectures or
1235directly in hardware.  In this section, we briefly
1236discuss implementation of IDISA libraries before
1237moving on to consider hardware design.  Although
1238a full realization of IDISA in hardware is beyond our
1239current capabilities, our goal is to develop a sufficiently
1240detailed design to assess the costs of IDISA implementation
1241in terms of the incremental complexity over the RefA
1242and RefB architectures.  The cost factors we consider, then,
1243are the implementation of the half-operand modifiers, and
1244the extension of core operations to the 2-bit, 4-bit and
1245128-bit field widths.  In each case, we also discuss
1246design tradeoffs.
1247
1248\subsection{IDISA Libraries}
1249
1250Implementation of IDISA instructions using template
1251and macro libraries has been useful in developing
1252and assessing the correctness of many of the algorithms
1253presented here.  Although these implementations do not
1254deliver the performance benefits associated with
1255direct hardware implementation of IDISA, they
1256have been quite useful in providing a practical means
1257for portable implementation of parallel bit stream
1258algorithms on multiple SWAR architectures.  However,
1259one additional facility has also proven necessary for
1260portability of parallel bit stream algorithms across
1261big-endian and little-endian architectures: the
1262notion of shift-forward and shift-back operations.
1263In essence, shift forward means shift to the left
1264on little-endian systems and shift to the right on
1265big-endian systems, while shift back has the reverse
1266interpretation.  Although this concept is unrelated to
1267inductive doubling, its inclusion with the IDISA
1268libraries has provided a suitable basis for portable
1269SIMD implementations of parallel bit stream algorithms.
1270Beyond this, the IDISA libraries have the additional
1271benefit of allowing the implementation
1272of inductive doubling algorithms at a higher level
1273abstraction, without need for programmer coding of
1274the underlying shift and mask operations.
1275
1276\subsection{IDISA Model}
1277\begin{figure}[tbh]
1278\begin{center}
1279\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1280\caption{IDISA Block Diagram}
1281\label{pipeline-model}
1282\end{center}
1283\end{figure}
1284
1285Figure \ref{pipeline-model} shows a block diagram
1286for a pipelined SIMD processor implementing IDISA.
1287The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1288registers each of width $N = 2^K$ bits. 
1289IDISA instructions identified by the Instruction Fetch
1290Unit (IFU) are forwarded for decoding to the SIMD
1291Instruction Decode Unit (SIDU).  This unit decodes
1292the instruction to produce
1293signals identifying the source and destination
1294operand registers, the half-operand modifiers, the
1295field width specification and the SIMD operation
1296to be applied.
1297
1298The SIDU supplies the source register information and the half-operand
1299modifier information to the SIMD Operand Fetch Unit (SOFU).
1300For each source operand, the SIDU provides an $A$-bit register
1301address and two 1-bit signals $h$ and $l$ indicating the value
1302of the decoded half-operand modifiers for this operand.
1303Only one of these values may be 1; both are 0 if
1304no modifier is specified.
1305The SIDU also supplies decoded field width signals $w_k$
1306for each field width $2^k$ to both the SOFU and to the
1307SIMD Instruction Execute Unit (SIEU).  Only one of the
1308field width signals has the value 1.
1309The SIDU also supplies decoded SIMD opcode information to SIEU and
1310a decoded $A$-bit register address for the destination register to
1311the SIMD Result Write Back Unit (SRWBU).
1312
1313The SOFU is the key component of the IDISA model that
1314differs from that found in a traditional SWAR
1315processor.  For each of the two $A$-bit source
1316register addresses, SOFU is first responsible for
1317fetching the raw operand values from the SRF.
1318Then, before supplying operand values to the
1319SIEU, the SOFU applies the half-operand modification
1320logic as specified by the $h$, $l$, and field-width
1321signals.  The possibly modified operand values are then
1322provided to the SIEU for carrying out the SIMD operations.
1323A detailed model of SOFU logic is described in the following
1324subsection.
1325
1326The SIEU differs from similar execution units in
1327current commodity processors primarily by providing
1328SIMD operations at each field width
1329$n=2^k$ for $0 \leq k \leq K$.  This involves
1330additional circuitry for field widths not supported
1331in existing processors.  In our evaluation model,
1332IDISA-A adds support for 2-bit, 4-bit and 128-bit
1333field widths in comparison with the RefA architecture,
1334while IDISA-B similarly extends RefB.
1335
1336When execution of the SWAR instruction is
1337completed, the result value is then provided
1338to the SRWBU to update the value stored in the
1339SRF at the address specified by the $A$-bit
1340destination operand.
1341
1342\subsection{Operand Fetch Unit Logic}
1343
1344The SOFU is responsible for implementing the half-operand
1345modification logic for each of up to two input operands fetched
1346from SRF.  For each operand, this logic is implemented
1347using the decoded half-operand modifiers signals $h$ and $l$,
1348the decoded field width signals $w_k$ and the 128-bit operand
1349value $r$ fetched from SRF to produce a modified 128-bit operand
1350value $s$ following the requirements of equations (4), (5) and
1351(6) above.  Those equations must be applied for each possible
1352modifier and each field width to determine the possible values $s[i]$
1353for each bit position $i$.  For example, consider bit
1354position 41, whose binary 7-bit address is $0101001$.
1355Considering the address bits left to right, each 1 bit
1356corresponds to a field width for which this bit lies in the
1357lower $n/2$ bits (widths 2, 16, 64), while each 0 bit corresponds to a field
1358width for which this bit lies in the high $n/2$ bits.
1359In response to the half-operand modifier signal $h$,
1360bit $s[41]$ may receive a value from the corresponding high bit in the field
1361of width 2, 16 or 64, namely $r[40]$,
1362$r[33]$ or $r[9]$.   Otherwise, this bit receives the value $r[41]$,
1363in the case of no half-operand modifier, or a low half-operand modifier
1364in conjunction with a field width signal $w_2$, $w_{16}$ or $w_{64}$.
1365The overall logic for determining this bit value is thus given as follows.
1366\begin{eqnarray*}
1367s[41] & = & h \wedge (w_2 \wedge r[40] \vee w_{16} \wedge r[33] \vee w_{64} \wedge r[9]) \\
1368& & \vee \neg h \wedge (\neg l \vee w_2 \vee w_{16} \vee w_{64}) \wedge r[41]
1369\end{eqnarray*}
1370
1371Similar logic is determined for each of the 128 bit positions.
1372For each of the 7 field widths, 64 bits are in the low $n/2$ bits,
1373resulting in 448 2-input and gates for the $w_k \wedge r[i]$ terms.
1374For 120 of the bit positions, or gates are needed to combine these
1375terms; $441 -120 = 321$ 2-input or gates are required.  Another
1376127 2-input and gates combine these values with the $h$ signal.
1377In the case of a low-half-operand modifier, the or-gates combining $w_k$
1378signals can share circuitry.  For each bit position $i=2^k+j$ one
1379additional or gate is required beyond that for position $j$.
1380Thus 127 2-input or gates are required.  Another 256 2-input and gates
1381are required for combination with the $\neg h$  and $r[i]$ terms.  The terms for
1382the low and high half-operand modifiers are then combined with an
1383additional 127 2-input or gates.   Thus, the circuity complexity
1384for the combinational logic implementation of half-operand
1385modifiers within the SOFU is 1279 2-input gates per operand,
1386or 2558 gates in total.
1387
1388The gate-level complexity of half-operand modifiers as described is nontrivial,
1389but modest.   However, one possible design tradeoff is to
1390differentiate the two operands, permitting a high half-operand
1391modifier to be used only with the first operand and a low-modifier
1392to be used only with the second operand.  This would exclude
1393\verb#<h,h># and \verb#<l,l># modifier combinations and also
1394certain combinations for noncommutative core operations. 
1395The principal
1396consequence for the applications considered here would be with
1397respect to the pack operations in forward transposition, but it
1398may be possible to address this through SIEU circuitry.
1399If this approach were to be taken, the gate complexity of
1400half-operand modification would be reduced by slightly more than half.
1401
1402\subsection{2-Bit and 4-Bit Field Widths}
1403
1404Beyond the half-operand modifiers, extension of core SWAR
1405operations to 2-bit and 4-bit field widths is critical to
1406inductive doubling support.  The principal operations
1407that need to be supported in this way are addition, pack, merge
1408merge, and rotate. 
1409
1410Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
1411as a modification to that for 8-bit fields by incorporating logic to
1412disable carry propagation at the 16 midfield boundaries.  For 2-bit
1413fields, disabling carry propagation at 32 additional boundaries suffices,
1414although it may be simpler to directly implement the simple logic
1415of 2-bit adders.
1416
1417Pack and merge require bit selection logic for each field width.
1418A straightforward implementation model for each operation
1419uses 128 2-input and gates to select the desired bits from the
1420operand registers and another 128 2-input or gates to feed
1421these results into the destination register.
1422
1423Rotation for 2-bit fields involves simple logic for selecting
1424between the 2 bits of each field of the operand being
1425rotated on the basis of the low bit of each field of the
1426rotation count.  Rotation for 4-bit fields is more complex,
1427but can also be based on 1-of-4 selection circuitry
1428involving the low 2 bits of the rotation count fields.
1429
1430\subsection{128-Bit Field Widths}
1431
1432For completeness, the IDISA model requires core operations
1433to be implemented at the full register width, as well
1434as power-of-2 partitions.  This may be problematic for
1435operations such as addition due to the inherent delays
1436in 128-bit carry propagation.  However, the primary
1437role of 128 bit operations in inductive doubling
1438is to combine two 64-bit fields using \verb#<h,l>#
1439operand combinations.  In view of this, it may be
1440reasonable to define hardware support for such
1441combinations to be based on 64-bit logic, with support
1442for 128-bit logic implemented through firmware or software.
1443
1444\subsection{Final Notes and Further Tradeoffs}
1445
1446In order to present IDISA as a concept for design extension of
1447any SWAR architecture, our discussion of gate-level implementation
1448is necessarily abstract.  Additional circuitry is sure to be
1449required, for example, in implementation of SIDU.  However,
1450in context of the 128-bit reference architectures studied,
1451our analysis suggests realistic IDISA implementation well
1452within a 10,000 gate budget.
1453
1454However, the additional circuitry required may be offset
1455by elimination of special-purpose instructions found in
1456existing processors that could instead be implemented through efficient
1457IDISA sequences.  These include examples such as population
1458count, count leading and/or trailing zeroes and parity.
1459They also include specialized horizontal SIMD operations.
1460Thus, design tradeoffs can be made with the potential of
1461reducing the chip area devoted to special purpose instructions
1462in favor of more general IDISA features.
1463
1464\section{Conclusions}
1465
1466In considering the architectural support for
1467SIMD text processing using the method of parallel bit streams,
1468this paper has presented the principle of inductive doubling
1469and a realization of that principle in the form of
1470IDISA, a modified SWAR instruction set architecture.
1471IDISA offers support for SWAR operations at all power-of-2
1472field widths, including 2-bit and 4-bit widths, in particular,
1473as well as half-operand modifiers and pack and merge operations
1474to support efficient transition between successive power-of-two
1475field widths.  The principle innovation is the notion of
1476half-operand modifiers that eliminate the cost associated
1477with the explicit mask and shift operations required for
1478such transitions.
1479
1480Several algorithms key to parallel bit stream methods
1481have been examined and shown to benefit from dramatic
1482reductions in instruction count compared to the best
1483known algorithms on reference architectures.  This
1484includes both a reference architecture modeled on
1485the SWAR capabilities of the SSE family as well as
1486an architecture incorporating the powerful permute
1487or shuffle capabilities found in Altivec or Cell BE processors.
1488In the case of transposition algorithms to and from parallel bit stream
1489form, the architecture has been shown to make possible
1490straightforward inductive doubling algorithms with a 3X
1491speedup over the best known versions on permute-capable
1492reference architectures, achieving the lowest total number
1493of operations of any possible 3-register SWAR architecture.
1494
1495Applications of IDISA in other areas have also been
1496examined.  The support for 2-bit and 4-bit field widths
1497in SWAR processing is beneficial for packed DNA representations
1498and packed decimal representations respectively.  Additional
1499inductive doubling examples are presented and the phenomenon
1500of power-of-2 transitions discussed more broadly.
1501Most significantly, IDISA supports a fully general approach
1502to horizontal SWAR operations, offering a considerable
1503improvement over the {\em ad hoc} sets of special-purpose
1504horizontal operations found in existing SWAR instruction sets.
1505
1506An IDISA implementation model has been presented employing
1507a customized operand fetch unit to implement the half-operand
1508modifier logic.  Gate-level implementation of this unit
1509and operations at the 2-bit and 4-bit field widths have
1510been analyzed and found to be quite reasonable within
1511a 10,000 gate bound.  Design tradeoffs to reduce the cost
1512have also been discussed, possibly even leading to a
1513net complexity reduction through elimination of instructions
1514that implement special-case versions of inductive doubling.
1515
1516Future research may consider the extension of inductive doubling
1517support in further ways.   For example, it may be possible
1518to develop a pipelined architecture supporting two or three
1519steps of inductive doubling in a single operation. 
1520
1521%\appendix
1522%\section{Appendix Title}
1523%
1524%This is the text of the appendix, if you need one.
1525
1526\acks
1527
1528This research was supported by a Discovery Grant from the
1529Natural Sciences and Engineering Research Council of Canada.
1530
1531%\bibliographystyle{plainnat}
1532\bibliographystyle{plain}
1533\bibliography{asplos094-cameron}
1534%\begin{thebibliography}{}
1535
1536%\bibitem{smith02}
1537%Smith, P. Q. reference text
1538
1539%\end{thebibliography}
1540
1541
1542\end{document}
Note: See TracBrowser for help on using the repository browser.