source: docs/ASPLOS09/asplos094-cameron.tex @ 247

Last change on this file since 247 was 247, checked in by cameron, 11 years ago

Intro

File size: 72.5 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66\sloppy
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{u8u16, PPoPP08}, XML parsing \cite{CASCON08, Herdy}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94In application to UTF-8 to UTF-16 transcoding, a 3X to 25X speed-up
95is achieved in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets\cite{PPoPP08}.
97In the broader context of XML parsing, further applications of these
98techniques demonstrate the utility of parallel bit stream techniques
99in delivering performance benefits through a significant portion of the
100web technology stack.  In an XML statistics gathering application,
101including the implementation of XML well-formedness checking, an
102overall 3X to 10X performance improvement is achieved in using
103the parallel bit stream methods in comparison with a similarly
104coded application using such well known parsers as Expat and Xerces \cite{CASCON08}.
105In an application involving transformation between different XML formats
106(GML and SVG), an implementation using parallel bit stream technology
107required a mere 15 cycles per byte, while a range of other technologies
108required from 25 to 200 cycles per byte \cite{Herdy}.
109Ongoing work is further applying the parallel bit stream methds to
110parallel hash value computation and parallel regular expression matching
111for the purpose of validating XML datatype declarations in
112accord with XML Schema \cite{CASCON08}.
113
114Given these promising initial results in the application of
115parallel bit stream methods, what role might architectural support play in
116further enhancing this route to parallelization of text processing?
117This paper addresses this question through presentation and
118analysis of a constructive proposal: a set of SIMD instruction set
119features based on the principle of systematic support for
120inductive doubling algorithms.  Inductive doubling refers
121to a general property of certain kinds of algorithm that
122systematically double the values of field widths or other
123data attributes with each iteration.  In essence, the goal
124of the proposed features is to support such algorithms
125with specific facilities to  transition between successive power-of-2 field
126widths.   These transitions are quite frequent in several critical
127algorithms for parallel bit streams.   These transitions also
128occur in other applications as well.  In related work,
129efficient automatic interleaving based on power-of-2 strides has been
130reported quite useful for a number of SIMD kernels \cite{Nuzman}.
131The specific features presented herein will be referred to
132as IDISA: inductive doubling instruction set architecture.
133
134The remainder of this paper is organized as follows.
135The second section of this paper introduces IDISA and the
136SIMD notation used throughout this paper.  The third
137section moves on to discuss an evaluation methodology
138for IDISA in comparison to two reference architectures
139motivated by the SSE and Altivec instruction sets.
140The fourth section provides a short first example of
141the inductive doubling principle in action through
142the case of population count.  Sections 5 through 7
143then address the application of IDISA to core
144algorithms in text processing with parallel bit
145streams. 
146The eighth section then considers the potential role of
147IDISA in supporting applications beyond parallel bit streams.
148Section 9 addresses IDISA implementation while
149Section 10 concludes the paper with a summary of
150results and directions for further work.
151
152
153\section{Inductive Doubling Architecture}
154
155This section presents an idealized model for a single-instruction
156multiple-data (SIMD) instruction set architecture designed
157specifically to support inductive doubling algorithms in the
158domain of parallel bit stream programming.   The architecture is idealized
159in the sense that we concentrate on only the necessary features
160for our purpose, without enumerating the additional
161operations that would be required for
162SIMD applications in other domains.  The goal is to focus
163on the principles of inductive doubling support in a way
164that can accommodate a variety of realizations as other
165design constraints are brought to bear on the overall instruction set
166design.  First we introduce a simple model and notation for
167SIMD operations in general and then present the four key
168features of an idealized architecture in support of parallel
169bit stream programming.
170
171The idealized architecture supports typical SIMD integer
172operations common to existing commodity architectures such as SSE
173and Altivec.   The architecture is defined to support SIMD
174operations on registers of size $N=2^K$ bits, for some integer $K$.
175Typical values of $K$ for commodity processors include $K=6$ for
176the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
177the 128-bit registers of SSE and Altivec technology and $K=8$ for
178the upcoming Intel AVX technology.   The registers may be
179partitioned into $N/n$ fields of width $n=2^k$ bits for some values
180of $k \leq K$.   Typical values of $k$ used on commodity processors
181include $k = 3$ for SIMD operations on 8-bit fields (bytes),
182$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
183on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
184fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
185Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
186register $r$, using big-endian numbering.
187
188Let \verb:simd<n>: represent the class of SIMD operations defined
189on fields of size $n$ using C++ template syntax.  Given a
190binary function $F_n$ on $n$-bit fields, we denote the SIMD
191version of this operation as \verb#simd<n>::F#.  Given two
192SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
193respectively, the operation \verb#r=simd<n>::F(a,b)# stores
194the value $r$ in the register \verb:r: as determined by
195the simultaneous calculation of individual field values in
196accord with the following equation.
197\[ r_i = F_n(a_i, b_i) \]
198
199For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
200logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
201integers as follows.
202%\singlespace
203\begin{eqnarray}
204\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
205\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
206\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
207\end{eqnarray}
208%\doublespace
209The SSE and Altivec instruction sets support
210each of the addition and subtraction operations in SIMD form
211for 8, 16 and 32-bit fields, while the SSE set also includes
212the 64-bit forms.  For example, \verb#simd<8>::add# in our
213nomenclature is provided by the operation \verb:paddb:
214on SSE and the operation \verb:vaddubm: on Altivec.
215The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
216\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
217more constrained, requiring that all field shifts by the same amount.
218
219Given these definitions and notation, we now present
220the four key elements of an inductive doubling architecture.
221The first is a definition of a core set of binary functions
222on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
223The second is a set of {\em half-operand modifiers} that allow
224the inductive processing of fields of size $2n$ in terms of
225combinations of $n$-bit values selected from the fields.
226The third is the definition of packing operations that compress
227two consecutive registers of $n$-bit values into a single
228register of $n/2$-bit values.  The fourth is the definition
229of merging operations that produce a set of $2n$ bit fields
230by concatenating corresponding $n$-bit fields from two
231parallel registers.  Each of these features is described below.
232
233For the purpose of direct and efficient support for inductive
234doubling algorithms on bit streams, the provision of
235a core set of operations at field widths of 2 and 4 as
236well as the more traditional field witdhs of 8, 16 and 32
237is critical for elegant and efficient expression of the
238algorithms.  In essence, inductive doubling algorithms
239work by establishing some base property at either single
240or 2-bit fields.  Each iteration of the algorithm then
241goes on to establish the property for the power-of-2
242field width.  In order for this inductive step to be
243most conveniently and efficiently expressed, the
244core operations needed for the step should be available
245at each field width.  In the case of work with parallel
246bit streams, the operations \verb:add:, \verb:sub:,
247\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
248(rotate left) comprise the core.  In other domains,
249additional operations may be usefully included in the
250core depending on the work that needs to be performed
251at each inductive doubling level.
252
253Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
254also includes fields of width 1.  These are included for
255logical consistency, but are easily implemented by mapping
256directly to appropriate bitwise logic operations, which
257we assume are also available.  For example,
258\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
259bitwise exclusive-or operation on SIMD registers.
260
261The second key facility of the inductive doubling architecture
262is the potential application of half-operand modifiers to
263the fields of either or both of the operands of a SIMD
264operation.  These modifiers select either the
265low $n/2$
266bits of each $n$-bit field (modifier ``\verb:l:'') or the
267high $n/2$ bits (modifier ``\verb:h:'').  When required,
268the modifier ``\verb:x:'' means that the full $n$ bits
269should be used, unmodified.  The semantics of these
270modifiers are given by the following equations.
271%\singlespace
272\begin{eqnarray}
273l(r_n) & = & r_n \bmod 2^{n/2} \\
274h(r_n) & = & r_n / 2^{n/2} \\
275x(r_n) & = & r_n
276\end{eqnarray}
277%\doublespace
278In our notation, the half-operand modifiers are
279specified as optional template (compile-time) parameters
280for each of the binary functions.  Thus,
281\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
282the 2-bit quantity found in the high 2-bits of each 4-bit field
283of its first operand (\verb:a:)
284together with the corresponding 2-bit quantity found in the
285low 2-bits of its second operand (\verb:b:).
286In general, the purpose of the half-operand modifiers
287in support of inductive doubling is to allow the processing
288of $n$-bit fields to easily expressed in terms of
289combination of the results determined by processing
290$n/2$ bit fields.
291
292The third facility of the inductive doubling architecture
293is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
294The field values of \verb#r=simd<n>::pack(a,b)# are
295defined by the following equations.
296%\singlespace
297\begin{eqnarray}
298r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
299r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
300\end{eqnarray}
301%\doublespace
302Here conv is a function which performs conversion of an $n$-bit
303value to an $n/2$ bit value by signed saturation (although
304conversion by unsigned saturation would also suit our purpose).
305
306Half-operand modifiers may also be used with the pack
307operations.  Thus packing with conversion by masking off all
308but the low $n/2$ bits of each field may be
309be performed using the operation \verb#simd<n>::pack<l,l>#
310
311
312The final facility of the inductive doubling architecture is
313a set of merging operations \verb#simd<n>::mergeh# and
314\verb#simd<n>::mergel# that produce $2n$-bit fields
315by concatenating corresponding $n$-bit fields from the
316operand registers.   The respective
317operations \verb#r=simd<n>::mergeh(a,b)# and
318\verb#s=simd<n>::mergel(a,b)#
319are defined by the following equations.
320%\singlespace
321\begin{eqnarray}
322r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
323s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
324\end{eqnarray}
325%\doublespace
326Both SSE and Altivec provide versions of pack and merge operations
327for certain field widths.  The pack operations are provided
328with operands having 16-bit or 32-bit fields on each platform, although
329with some variation in how conversion is carried out.
330The merge operations are provided at 8-bit, 16-bit and 32-bit
331field widths on both architectures and also at the 64-bit level
332on SSE.
333
334This completes the description of IDISA.  As described, many of the
335features are already available with the SIMD facilities of
336existing commodity processors.  The extensions enumerated
337here are relatively straightforward.  The innovation
338is to specifically tackle the design of facilities to
339offer systematic support for transitions between power-of-2 field widths.
340As we shall show in the remainder of this paper, these facilities
341can dramatically reduce instruction count in core parallel bit
342stream algorithms, with a factor of 3 reduction being typical.
343
344\section{Evaluation Methodology}
345
346IDISA represents a set of instruction set features that
347could potentially be added to any SWAR processor.  The goal
348in this paper is to evaluate these features independent
349of artifacts that may be due to any particular realization,
350while still considering realistic models based on existing
351commodity instruction set architectures.  For the purpose
352of IDISA evaluation, then, we define two reference architectures.
353For concreteness, IDISA and the two reference architectures will
354each be considered as 128-bit processors employing the
355three-register SWAR model defined in the previous
356section.
357
358Reference architecture A (RefA) consists of a limited register
359processor providing a set of core binary operations defined
360for 8, 16, 32 and 64 bit fields.  The core binary operations
361will be assumed to be those defined by the SSE instruction
362set for 16-bit fields.  In addition, we assume that
363shift immediate operations for each field width exist,
364e.g., \verb#simd<8>::srli<1>(x)# for a right logical
365shift of each 8-bit field by 1.  We also assume that
366a constant load operation \verb#simd::const<n>(c)#
367loads the constant value $c$ into each $n$ bit field.
368
369Reference architecture B (RefB) consists of a register-rich
370processor incorporating all the operations of reference
371architecture A as well as the following additional facilities
372inspired by the Altivec instruction set.
373For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
374logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
375of $a$ by the rotation count in the corresponding field of $b$.
376A three-register \verb#simd<1>::if(a,b,c)# bitwise logical
377operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned
378after the Altivec \verb:vec_sel: operation.  Finally,
379a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
380permutation of bytes from the concatenation of $a$ and $b$ based
381on the set of indices in $c$.
382
383Two versions of IDISA are assessed against these reference
384architectures as follows.  IDISA-A has all the facilities
385of RefA extended with half-operand modifiers and all core
386operations at field widths of 2, 4 and 128.  IDISA-B is
387similarly defined and extended based on RefB.  Algorithms
388for both RefA and IDISA-A are assessed assuming that
389any required constants must be loaded as needed; this
390reflects the limited register assumption.  On the other,
391assessment for both RefB and IDISA-B will make the
392assumption that sufficiently many registers exist that
393constants can be kept preloaded.
394
395In each case, the processors are assumed to be
396pipelined processors with a throughput of one SWAR instruction
397each processor cycle for straight-line code free of memory
398access.  Such processors are certainly feasible with a
399suitable investment in circuitry, although practical designs
400may make sacrifices to achieve close to optimal throughput
401with less circuitry.  However, the assumption makes for
402straightforward performance evaluation based on instruction
403count for straight-line computational kernels.  Furthermore,
404the assumption also eliminates artifacts due to possibly different
405latencies in reference and IDISA architectures.  Because
406the same assumption is made for reference and IDISA
407architectures, determination of the additional circuit
408complexity due to IDISA features is unaffected by the
409assumption.
410
411In the remainder of this paper, then, IDISA-A and IDISA-B
412models are evaluated against their respective reference
413architectures on straight-line computational kernels
414used in parallel bit stream processing and other applications.
415As XML and other sequential text processing applications
416tend to use memory in an efficient streaming model, the
417applications tend to be compute-bound rather than IO-bound.
418Thus, the focus on computational kernels addresses the
419primary concern for performance improvement of these applications.
420
421The additional circuity complexity to realize IDISA-A and
422IDISA-B designs over their reference models will be
423addressed in the penultimate section.  That discussion
424will focus primarily on the complexity of implementing
425half-operand modifier logic, but will also address
426the extension of the core operations to operate on
4272-bit, 4-bit and 128-bit fields, as well.
428
429\section{Population Count}
430
431\begin{figure}[h]
432\begin{center}\small
433\begin{verbatim}
434c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
435c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
436c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
437c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
438c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
439\end{verbatim}
440\end{center}
441\caption{Population Count Reference Algorithm}
442\label{HD-pop}
443\end{figure}
444
445\begin{figure}
446\begin{center}\small
447\begin{verbatim}
448c = simd<2>::add<h,l>(x, x);
449c = simd<4>::add<h,l>(c, c);
450c = simd<8>::add<h,l>(c, c);
451c = simd<16>::add<h,l>(c, c);
452c = simd<32>::add<h,l>(c, c);
453\end{verbatim}
454\end{center}
455\caption{IDISA Population Count}
456\label{inductivepopcount}
457\end{figure}
458
459As an initial example to illustrate the principle of inductive doubling
460in practice, consider the problem of {\em population count}: determining
461the number of one bits within a particular bit field.  It is important
462enough for such operations as calculating Hamming distance to be included
463as a built-in instruction
464on some processors.  For example, the SPU of the Cell Broadband Engine
465has a SIMD population count instruction \verb:si_cntb: for simultaneously
466determining the
467number of 1 bits within each byte of a 16-byte register.
468In text processing with parallel bit streams, population count has direct
469application to keeping track of line numbers for error reporting, for example.
470Given a bit block identifying the positions of newline characters within
471a block of characters being processed, the population count of the
472bit block can be used to efficiently and conveniently be used to update
473the line number upon completion of block processing.
474
475Figure \ref{HD-pop} presents a traditional divide-and-conquer
476implementation for a 32-bit integer {\tt x} adapted from
477Warren \cite{HackersDelight}, while
478Figure \ref{inductivepopcount} shows the corresponding IDISA
479implementation for a vector of 32-bit fields.  Each implementation employs
480five steps of inductive doubling to produce population counts
481within 32 bit fields.  The traditional implementation employs
482explicit masking and shifting operations, while these
483operations are implicit within the semantics of the inductive
484doubling instructions shown in Figure \ref{inductivepopcount}.
485In each implementation, the first step determines the
486the population counts within 2-bit fields
487by adding the high bit of each such field to the low bit
488to produce a set of 2-bit counts in {\tt c}.
489In the second step, the counts within 4-bit fields of {\tt c} are determined
490by adding the counts of the corresponding high and low 2-bit subfields.
491Continuing in this fashion,
492the final population counts within 32-bit fields are determined in five steps.
493
494With the substitution of longer mask constants replicated for four
49532-bit fields, the implementation of Figure \ref{HD-pop} can be
496directly adapted to SWAR processing using 128-bit registers.
497Each binary operator is replaced by a corresponding binary
498SWAR operation.   Without the IDISA features, a
499straightforward RefA implementation of population count for
50032-bit fields thus employs 10 operations to load or generate
501mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
502total of 30 operations to complete the task.   Employing
503optimization identified by Warren, this can be reduced to
50420 operations, 5 of which are required to generate mask constants.
505At the cost of register pressure, it is possible that these constants
506could be kept preloaded in long vector processing.  In accord
507with our evaluation model, the RefB cost is thus 15 operations.
508As the IDISA implementation requires no constants at all,
509both the IDISA-A and IDISA-B cost is 5 operations.
510At our assumed one CPU cycle per instruction model, IDISA-A
511offers a 4X improvement over RefA, while IDISA-B offers a 3X
512improvement over its comparator.
513
514The pattern illustrated by population count is typical.
515An inductive doubling algorithm of $n$ steps typically applies
516mask or shift operations at each step for each of the
517two operands being combined in the step.  If there is
518one such operation for each operand, and the combination
519can be implemented in a single SWAR operation, then a total
520of $3n$ operations are the required in a RefB implementation,
521and possibly $4n$ for a RefA implementation including the
522cost of loading masks.  IDISA-A and IDISA-B implementations
523typically eliminate the explicit mask and shift operations
524through appropriate half-operand modifiers, reducing the
525total instruction count to $n$.   Thus a 3X to 4X improvement
526obtains in these cases.
527
528\section{Transposition to Parallel Bit Streams}
529
530In this section, we consider the first major
531application of IDISA: transposition of byte stream data to parallel bit stream
532form.   Of course, this operation is critical to the
533method of parallel bit streams and all applications
534of the method can benefit from a highly efficient
535transposition process.  Before considering how
536the IDISA supports this
537transposition process, however, we first consider
538algorithms on existing architectures.  Two algorithms
539are presented; the best of these requires 72
540SWAR operations under the RefB model to perform
541transposition of eight serial registers of byte stream data into
542eight parallel registers of bit stream data.
543
544We then go on to show how the transposition problem
545can be solved using IDISA-A or IDISA-B
546with a mere 24 three-register SIMD operations.  We also show
547that this is optimal for any three-register instruction set model.
548
549\begin{figure}[tbh]
550\begin{center}
551\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
552\caption{Input/Output Model for Serial to Parallel Transposition}
553\label{s2p-spec}
554\end{center}
555
556\end{figure}
557Figure \ref{s2p-spec} illustrates the input-output requirements of
558the transposition problem.  We assume that inputs and
559outputs are each SWAR registers of size $N=2^K$ bits.
560The input consists of $N$ bytes of serial byte data,
561stored consecutively in eight SIMD registers each holding
562$N/8$ bytes.  The output consists of eight parallel
563registers, one each for the eight individual bit positions
564within a byte.  Upon completion of the transposition process,
565each output register is to hold the $N$ bits corresponding
566to the selected bit position in the sequence of $N$ input
567bytes.
568
569\subsection{Bit Gathering Algorithm}
570
571\begin{figure}[tbh]
572\begin{center}
573\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
574\caption{Serial to Parallel Transposition Using Bit-Gathering}
575\label{gather}
576\end{center}
577\end{figure}
578One straightforward algorithm for implementing the transposition process
579takes advantage of SWAR bit gathering operations that exist
580on some architectures.  This operation gathers one bit per byte
581from a particular position within each byte of a SIMD register.
582For example, the {\tt pmovmskb} operation of the Intel MMX and
583SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
584consisting of the high bit of each byte.  Similarly, the
585{\tt \verb:si_gbb:} operation of the synergistic processing units of the
586Cell Broadband Engine gathers together the low bit of each byte
587from the SIMD register.  Figure \ref{gather} illustrates the
588bit gathering process.
589
590For each bit stream of output, the bit gather algorithm requires
591one gather operation for each of the 8 input registers,
592giving 64 bit gather operations in all.  In addition, for seven
593of the eight bit positions, it is necessary to shift the bits
594of each input register into the conventional position for
595gathering.  A total of 56 shift operations are required for this
596task.  Finally, the result of each bit gather operation must
597be properly inserted into the output stream.  If the first
598gather operation for a stream can be used to also initialize
599the output register, there will then need to be 7 insert
600operations for the results of the remaining gather operations
601for this stream, making 56 insert operations in all.
602The insert step may be more complex than a single operation
603in some cases, but we use one operation per insert as a lower bound.
604Thus, the bit gather algorithm requires
605at least 176 operations to perform the transposition task.
606
607\subsection{BytePack Algorithm}
608
609A much more efficient transposition algorithm on commodity
610SWAR architectures involves three
611stages of binary division transformation.  This is similar
612to the three stage bit matrix inversion described by
613Warren  \cite{HackersDelight}, although modified to use SWAR operations.
614In each stage, input streams are divided into two half-length output streams.
615The first stage separates the bits at even numbered positions from those
616at odd number positions.  The two output streams from the first
617stage are then further divided in the second stage.
618The stream comprising even numbered bits from the original byte stream
619divides into one stream consisting of bits from positions 0 and 4 of each
620byte in the original stream and a second stream consisting of bits
621from positions 2 and 6 of each original byte.  The stream of bits from
622odd positions is similarly divided into streams for bits from each of the
623positions 1 and 5 and bits from positions 2 and 6.
624Finally, each of the four streams resulting from the second stage are
625divided into the desired individual bit streams in the third stage.
626
627% \begin{figure}[tbh]
628% \begin{center}\small
629% \begin{verbatim}
630% s0h = simd<16>::srli<8>(s0);
631% s0l = simd_and(s0, simd<16>::const(0x00FF));
632% s1h = simd<16>::srli<8>(s1);
633% s1l = simd_and(s1, simd<16>::const(0x00FF));
634% t0 = simd<16>::pack(s0h, s1h);
635% t1 = simd<16>::pack(s0l, s1l);
636% t0_l1 = simd<16>::slli<1>(t0);
637% t0_r1 = simd<16>::srli<1>(t1);
638% mask = simd<8>::const(0xAA);
639% p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask));
640% p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask));
641% \end{verbatim}
642% \end{center}
643% \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
644% \label{s2pstep}
645% \end{figure}
646%
647
648\begin{figure}[tbh]
649\begin{center}\small
650\begin{verbatim}
651even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30};
652odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
653mask = simd<8>::const(0xAA);
654t0 = simd<8>::permute(s0, s1, even);
655t1 = simd<8>::permute(s0, s1, odd);
656p0 = simd_if(mask, t0, simd<16>::srli<1>(t1));
657p1 = simd_if(mask, simd<16>::slli<1>(t0), t1);
658\end{verbatim}
659\end{center}
660\caption{RefB Transposition Step in BytePack Stage 1}
661\label{s2pstep}
662\end{figure}
663
664The binary division transformations are accomplished in each stage
665using byte packing, shifting and masking.  In each stage, a
666transposition step combines each pair of serial input registers to
667produce a pair of parallel output registers. 
668Figure \ref{s2pstep} shows a stage 1 transposition step in a
669Ref B implementation.  Using the permute facility, the even
670and odd bytes, respectively, from two serial input registers
671\verb:s0: and \verb:s1: are packed into temporary registers
672\verb:t0: and \verb:t1:.  The even and odd bits are then
673separated into two parallel output registers \verb:p0: and \verb:p1:
674by selecting alternating bits using a mask.  This step is applied
675four times in stage 1; stages 2 and 3 also consist of four applications
676of a similar step with different shift and mask constants.
677Overall, 6 operations per step are required, yielding a total
678of 72 operations to transpose 128 bytes to parallel bit stream
679form in the RefB implementation.
680
681In a RefA implementation, byte packing may also be achieved
682by the \verb#simd<16>::pack# with 4 additional operations to
683prepare operands.  Essentially, the RefB implementation
684uses single permuite instructions to implement the equivalent of
685\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
686The RefA implementation also requires 3 logic operations to implement
687each \verb#simd_if#.
688Assuming that mask loads are only need once per 128 bytes,
689a total of 148 operations are required in the RefB implementation.
690
691\subsection{Inductive Halving Algorithm}
692
693Using IDISA, it is possible to design
694a transposition algorithm that is both easier to understand and requires
695many fewer operations than the the bytepack algorithm described above.
696We call it the inductive halving algorithm for serial to parallel
697transposition, because it proceeds by reducing byte streams to
698two sets of nybble streams in a first stage, dividing the nybble
699streams into streams of bitpairs in a second stage and finally
700dividing the bitpair streams into bit streams in the third stage.
701
702\begin{figure}[tbh]
703\small
704\begin{verbatim}
705p0 = simd<8>::pack<h,h>(s0, s1);
706p1 = simd<8>::pack<l,l>(s0, s1);
707\end{verbatim}
708\caption{Stage 1 Transposition Step in the Inductive Halving Algorithm}
709\label{halvingstep}
710\end{figure}
711
712Figure \ref{halvingstep} shows one step in stage 1 of the inductive
713halving algorithm, comprising just two IDISA-A operations.
714The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
715from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
716the low nybble of each byte.  As in the bytepack algorithm, this step is
717applied 4 times in stage 1, for a total of 8 operations.
718
719Stage 2 of the inductive halving algorithm reduces nybble streams
720to streams of bit pairs.  The basic step in this algorithm consists
721of one \verb#simd<4>::pack<h,h># operation to extract the high pair
722of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
723low pair of each nybble.  Four applications of this step complete stage 2.
724
725Stage 3 similarly uses four applications of a step that uses a
726\verb#simd<2>::pack<h,h># operation to extract the high bit of
727each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
728each pair.  The complete algorithm to transform eight serial
729byte registers s0 through s7 into the eight parallel bit stream
730registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
731Under either IDISA-A or IDISA-B models, a mere 24 operations per 128
732input bytes is required.
733
734\begin{figure}[tbh]
735\small
736\begin{verbatim}
737hnybble0 = simd<8>::pack<h,h>(s0, s1);
738lnybble0 = simd<8>::pack<l,l>(s0, s1);
739hnybble1 = simd<8>::pack<h,h>(s2, s3);
740lnybble1 = simd<8>::pack<l,l>(s2, s3);
741hnybble2 = simd<8>::pack<h,h>(s4, s5);
742lnybble2 = simd<8>::pack<l,l>(s4, s5);
743hnybble3 = simd<8>::pack<h,h>(s6, s7);
744lnybble3 = simd<8>::pack<l,l>(s6, s7);
745hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
746hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
747lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
748ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
749hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
750hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
751lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
752ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
753bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
754bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
755bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
756bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
757bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
758bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
759bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
760bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
761\end{verbatim}
762\caption{Complete Inductive Halving Algorithm}
763\label{halvingalgorithm}
764\end{figure}
765
766\subsection{Optimality of the Inductive Halving Algorithm}
767
768Here we show that the inductive halving algorithm presented in
769the previous subsection is optimal in the following sense:
770no other algorithm on any 3-register SIMD architecture can use
771fewer than 24 operations to transform eight serial registers
772of byte stream data into eight parallel registers of bit stream data.
773By 3-register SIMD architecture, we refer to any architecture
774that uses SIMD instructions consistent with our overall model of
775binary operations using two input register operands to produce
776one output register value.
777
778Observe that the $N$ data bits from each input register must be
779distributed $N/8$ each to the 8 output registers by virtue of
780the problem definition.  Each output register can effectively
781be given a 3-bit address; the partitioning problem can be viewed
782as moving data to the correct address.  However, each
783operation can move results into at most one register. 
784At most this can result in the assignment of one correct address
785bit for each of the $N$ input bits.  As all $8N$ input bits
786need to be moved to a register with a correct 3-bit address,
787a minimum of 24 operations is required.
788
789\subsection{End-to-End Significance}
790
791In a study of several XML technologies applied to
792the problem of GML to SVG transformation, the parabix
793implementation (parallel bit streams for XML) was
794found to the fastest with a cost of approximately
79515 CPU cycles per input byte \cite{Herdy}.  Within parabix,
796transposition to parallel bit stream form requires
797approximately 1.1 cycles per byte \cite{CASCON08}.
798All other things being equal, a 3X speed-up of transposition
799alone would improve end-to-end performance in a
800complete XML processing application by more than 4\%.
801
802
803\section{Parallel to Serial Conversion}
804
805Parallel bit stream applications may apply string editing
806operations in bit space to substitute, delete or insert
807parallel sets of bits at particular positions.  In such cases,
808the inverse transform that converts a set of parallel bit
809streams back into byte space is needed.  In the example of
810UTF-8 to UTF-16 transcoding, the inverse transform is
811actually used twice for each application of the forward
812transform, to separately compute the high and low byte
813streams of each UTF-16 code unit.  Those two byte streams
814are subsequentely merged to form the final result.
815
816Algorithms for performing the inverse transform mirror those
817of the forward transform, employing SIMD merge operations
818in place of pack operations.  The best algorithm known
819to us on the commodity SIMD architectures takes advantage
820of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
821operations that are available with each of the SSE and Altivec instruction
822sets.  To perform the full inverse transform of 8 parallel
823registers of bit stream data into 8 serial registers of byte stream data,
824a RefA implementation requires 120 operations, while a RefB
825implementation reduces this to 72.
826
827\begin{figure}[tbh]
828\begin{center}\small
829\begin{verbatim}
830bit01_r0 = simd<1>::mergeh(bit0, bit1);
831bit01_r1 = simd<1>::mergel(bit0, bit1);
832bit23_r0 = simd<1>::mergeh(bit2, bit3);
833bit23_r1 = simd<1>::mergel(bit2, bit3);
834bit45_r0 = simd<1>::mergeh(bit4, bit5);
835bit45_r1 = simd<1>::mergel(bit4, bit5);
836bit67_r0 = simd<1>::mergeh(bit6, bit7);
837bit67_r1 = simd<1>::mergel(bit6, bit7);
838bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
839bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
840bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
841bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
842bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
843bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
844bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
845bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
846s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
847s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
848s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
849s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
850s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
851s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
852s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
853s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
854\end{verbatim}
855\end{center}
856\label{p2s-inductive}
857\caption{Parallel to Serial Transposition by Inductive Doubling}
858\end{figure}
859
860An algorithm employing only 24 operations using IDISA-A/B is relatively
861straightforward.. In stage 1, parallel registers for individual bit streams
862are first merged with bit-level interleaving
863using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
864operations.  For each of the four pairs of consecutive
865even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
866two consecutive registers of bitpair data are produced.
867In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
868are then applied to merge to bitpair streams to produce streams
869of nybbles for the high and low nybble of each byte.  Finally,
870the nybble streams are merged in stage 3 to produce the
871desired byte stream data.  The full inductive doubling
872algorithm for parallel to serial transposition is shown in Figure
873\ref{p2s-inductive}.
874
875This algorithm is again optimal, requiring the fewest operations
876of any possible algorithm using any 3-register instruction set
877model.  The proof directly follows that for the serial to parallel
878transposition.
879
880The existence of high-performance algorithms for transformation of
881character data between byte stream and parallel bit stream form
882in both directions makes it possible to consider applying these
883transformations multiple times during text processing applications.
884Just as the time domain and frequency domain each have their
885use in signal processing applications, the byte stream form and
886parallel bit stream form can then each be used at will in
887character stream applications.
888
889
890
891\section{Parallel Bit Deletion}
892
893\begin{figure*}[tbh]
894\begin{center}
895\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
896\hline
897\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
898\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
899\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
900\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
901\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
902\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
903\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
904\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
905\end{tabular}
906\end{center}
907\label{centraldelstep}
908\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
909\end{figure*}
910
911Parallel bit deletion is an important operation that allows string
912editing operations to be carried out while in parallel bit stream
913form.  It is also fundamental to UTF-8 to UTF-16 transcoding
914using parallel bit streams, allowing the excess code unit
915positions for UTF-8 two-, three- and four-byte sequences to
916be deleted once the sixteen parallel bit streams of UTF-16 have
917been computed \cite{PPoPP08}.
918
919Parallel bit deletion is specified using a deletion mask.
920A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
921to be deleted and 0s at positions identifying bits to be retained.
922For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
923bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
924accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
925
926Bit deletion may be performed using
927the parallel-prefix compress algorithm documented by
928Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
929only logic and shifts with a constant parameter to carry
930out the deletion process.  However, it requires $k^2$
931preprocessing steps for a final field width parameter
932of size $2^k$, as well as 4 operations per deletion step
933per stream.  Using the inductive doubling instruction set architecture
934it is possible to carry out bit deletion much more efficiently.
935
936Deletion within fixed size fields or registers may produce results that are either
937left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
938eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
939care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
940left-justified result produces an important intermediate form known as a
941{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
942right-justified and left-justified results from the application of the
9434-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
944stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
945the central result \verb:xxbdefhx:, which may easily be converted to a either a
946left or a right justified 8-element result by an appropriate shift operation.
947
948
949The observation about how two $n$-bit central deletion results can
950combine to yield a $2n$ central deletion result provides the basis
951for an inductive doubling algorithm.  Figure \ref{centraldelstep}
952illustrates the inductive process for the transition from 8-bit central
953deletion results to 16-bit central deletion results.  The top row shows
954the original deletion mask, while the second row shows the original
955bit stream to which deletions are to be applied, with deleted bits zeroed out.
956The third row shows the central result for each 8-bit field as the
957result of the previous inductive step.
958
959To perform the $8 \rightarrow 16$ central deletion step, we first form
960the population counts of 4-bit fields of the original deletion mask as
961shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
962an 8-bit central result, we perform a right shift by the population count
963of the low half of the field.  Similarly,
964left-justification requires a left-shift by the population count in the
965high half of the field.
966
967The left and right shifts can be performed simultaneously using a rotate
968left instruction.  Right justification by shifting an $n$ bit field
969$i$ positions to the right is equivalent to a left rotate of $n-i$
970positions.  These rotation amounts are computed by the operation
971\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
972except that don't care fields (which won't be subsequently used)
973are marked \verb:XX:. 
974
975The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
976as shown in row 6, and are combined with the right shift amounts
977by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
978as shown in row 7.  Using these computed values, the inductive step
979is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
980as shown in row 8.
981
982At each inductive doubling level, it requires 4 operations to compute the
983required deletion infomation and one operation per bit stream to perform deletion.
984Note that, if deletion is to be applied to a set of eight parallel bit streams,
985the computed deletion information is used for each stream without recomputation,
986thus requiring 12 operations per inductive level.
987
988In comparison to the parallel-prefix compress method, the method of central
989deletion results using the inductive doubling architecture has far fewer operations.
990The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
991induction versus $4k^2$ for the parallel-prefix method.  Using the computed
992deletion operation, only a single SIMD rotate operation per bit stream
993per level is needed, in comparison with 4 operations per level for parallel-prefix
994compress.
995
996
997
998\section{Beyond Parallel Bit Streams}
999
1000It can be argued that text processing with parallel bit
1001streams by itself is not sufficiently important to justify
1002the circuit complexity of IDISA.  In this section, then
1003we move on to consider further applications that may
1004benefit from IDISA features.  These include additional
1005basic inductive doubling
1006algorithms such as parity, and least power-of-2 ceiling.
1007Further applications for narrow field widths are discussed
1008as well, such as packed DNA representations using 2-bit
1009field widths and packed decimal representations using 4-bit
1010field widths. Most significantly, however, we show that IDISA offers
1011a fully general solution to the problem of supporting
1012{\em horizontal} SWAR operations.
1013
1014\subsection{Inductive Doubling with Logic Operations}
1015
1016\begin{figure}
1017\begin{center}\small
1018\begin{verbatim}
1019y = x ^ (x >> 1);
1020y = y ^ (y >> 2);
1021y = y ^ (y >> 4);
1022y = y ^ (y >> 8);
1023y = y ^ (y >>16);
1024y = y & 1;
1025\end{verbatim}
1026\end{center}
1027\caption{Parity Reference Algorithm}
1028\label{HD-parity}
1029\end{figure}
1030
1031\begin{figure}
1032\begin{center}\small
1033\begin{verbatim}
1034x = x - 1;
1035x = x | (x >> 1);
1036x = x | (x >> 2);
1037x = x | (x >> 4);
1038x = x | (x >> 8);
1039x = x | (x >>16);
1040x = x + 1;
1041\end{verbatim}
1042\end{center}
1043\caption{Power-of-2 Ceiling Reference Algorithm}
1044\label{HD-clp2}
1045\end{figure}
1046
1047\begin{figure}
1048\begin{center}\small
1049\begin{verbatim}
1050y = simd<2>::xor<h,l>(x, x);
1051y = simd<4>::xor<h,l>(y, y);
1052y = simd<8>::xor<h,l>(y, y);
1053y = simd<16>::xor<h,l>(y, y);
1054y = simd<32>::xor<h,l>(y, y);
1055\end{verbatim}
1056\end{center}
1057\caption{IDISA Parity Implementation}
1058\label{ID-parity}
1059\end{figure}
1060
1061\begin{figure}
1062\begin{center}\small
1063\begin{verbatim}
1064x = simd<32>::sub(x, simd<32>::const(1));
1065x = simd<2>::or<x, h>(x, x);
1066x = simd<4>::or<x, h>(x, x);
1067x = simd<8>::or<x, h>(x, x);
1068x = simd<16>::or<x, h>(x, x);
1069x = simd<32>::or<x, h>(x, x);
1070x = simd<32>::add(x, simd<32>::const(1));
1071\end{verbatim}
1072\end{center}
1073\caption{IDISA Power-of-2 Ceiling Implementation}
1074\label{ID-clp2}
1075\end{figure}
1076
1077Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1078code for two further inductive doubling examples using
1079logical operators as the combining feature.  In the
1080first of these, the ``exclusive or'' operation is applied
1081to all bits in a 32-bit value to determine parity.
1082Parity has important applications for error-correcting
1083codes such as the various Hamming codes for detecting
1084and correcting numbers of bit errors dependent on the
1085number of parity bits added.  Warren's version uses
108611 operations for parity of a single 32-bit value;
1087a straightforward SWAR adaptation also uses 11 operations
1088for the parity of a set of 32-bit fields.
1089
1090Figure \ref{ID-parity} shows the IDISA parity implementation
1091with only 5 operations required.  This represents slightly
1092more than a 2X improvement.  The improvement is less than
10933X seen in other cases because one of the operands need
1094not be modified before applying the exclusive-or operation.
1095
1096Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1097code for the least power-of-2 ceiling of a 32-bit value.  The
1098technique is to employ inductive doubling to fill in one bits at
1099all bit positions after the first one bit in the original value to
1100first produce a value of the form $2^n-1$.  This is then incremented
1101to determine the power of 2 ceiling.  This function and
1102its straightforward adaptation for SWAR application on a
1103set of 32-bit fields requires 12 operations.   The
1104corresponding IDISA implementation of Figure \ref{ID-clp2}
1105requires 7 operations, just under a 2X improvement.
1106
1107\subsection{Packed DNA Representation}
1108
1109DNA sequences are often represented as strings consisting
1110of the four nucleotide codes A, C, G and T.  Internally,
1111these sequences are frequently represented in internal
1112form as packed sequences of 2-bit values.  The IDISA
1113\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1114allow these packed representations to be quickly computed
1115from byte-oriented string values by two steps of inductive
1116halving.   Similarly, conversion back to string form
1117can use two steps of inductive merging.  Without direct
1118support for these pack and merge operations, the SWAR
1119implementations of these conversions require the cost
1120of explicit masking and shifting in combination with
1121the 16-bit to 8-bit packing and 8-bit to 16-bit
1122merging operations supported by existing SWAR facilities
1123on commodity processors.
1124
1125\subsection{Bit Reverse}
1126
1127Include or omit?
1128
1129\subsection{String/Decimal/Integer Conversion}
1130\begin{figure}
1131\begin{center}\small
1132\begin{verbatim}
1133b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1134b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1135b=(d & 0x0000FFFF) + 10000 * (d >> 16);
1136\end{verbatim}
1137\end{center}
1138\caption{BCD to Integer Reference Algorithm}
1139\label{BCD2int}
1140\end{figure}
1141
1142\begin{figure}
1143\begin{center}\small
1144\begin{verbatim}
1145t1=simd<8>:const(10);
1146t2=simd<16>:const(100);
1147t3=simd<32>:const(10000);
1148b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d);
1149b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b);
1150b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b);
1151\end{verbatim}
1152\end{center}
1153\caption{IDISA Parity Implementation}
1154\label{ID-BCD2int}
1155\end{figure}
1156
1157Just as DNA sequences represent an important use case for
1158SWAR operations on 2-bit fields, packed sequences of
1159decimal or hexadecimal digits represent a common use case
1160for 4-bit fields.  These representations can be used
1161both as an intermediate form in numeric string to integer
1162conversion and as a direct representation for
1163packed binary coded decimal.
1164
1165Figure \ref{BCD2int} shows a three-step inductive
1166doubling implementation for conversion of 32-bit packed BCD
1167values to integer form.  The 32-bit value consists
1168of 8 4-bit decimal digits.  Pairs of digits are
1169first combined by multiplying the higher digit
1170of the pair by 10 and adding.  Pairs of these
1171two-digit results are then further combined by
1172multiplying the value of the higher of the two-digit
1173results by 100 and adding.  The final step is
1174to combine four-digit results by multiplying the
1175higher one by 10000 and adding.  Overall, 20
1176operations are required for this implementation
1177as well as the corresponding SWAR implementation
1178for sets of 32-bit fields.  Preloading of 6 constants
1179into registers for repeated use can reduce the number of
1180operations to 14 at the cost of significant register
1181pressure.
1182
1183The IDISA implementation of this algorithm is shown
1184in Figure \ref{ID-BCD2int}.  This implementation
1185shows an interesting variation in the use of
1186half-operand modifiers, with only one operand
1187of each of the addition and multiplication operations
1188modified at each level.  Overall, this implementation
1189requires 9 operations, or 6 operations with 3
1190preloaded constants.  This represents more than a 2X
1191reduction in instruction count as well as a 2X reduction
1192in register pressure.
1193
1194
1195
1196
1197\subsection{Systematic Support for Horizontal SIMD Operations}
1198
1199In SIMD parlance, {\em horizontal} operations are
1200operations which combine values from two or more fields
1201of the same register, in contrast to the normal
1202{\em vertical} operations which combine corresponding
1203fields of different registers.  Horizontal operations
1204can be found that combine two (e.g., \verb:haddpd: on SSE3),
1205four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1206or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1207horizontal operations have a vertical component as well.
1208For example, \verb:psadbw: first forms the absolute value of
1209the difference of eight corresponding byte fields before
1210performing horizontal add of the eight values, while
1211\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1212four unsigned 8-bit fields within one register
1213and then combines the result horizontally with
1214corresponding 32-bit fields of a second registers.
1215
1216The space of potential horizontal operations thus has
1217many dimensions, including not only the particular
1218combining operation and the operand field width, but
1219also the number of fields being combined, whether a
1220vertical combination is applied and whether it is applied
1221before or after the horizontal operation and what the
1222nature of the vertical combining operation is.
1223Within this space, commodity SIMD architectures tend
1224to support only a very few combinations, without any
1225particular attempt at systematic support for horizontal
1226operations in general.
1227
1228In contrast to this {\em ad hoc} support on commodity
1229processors, IDISA offers a completely systematic treatment
1230of horizontal operations without any special features beyond
1231the inductive doubling features already described.
1232In the simplest case, any vertical operation
1233\verb#simd<n>::F# on $n$-bit fields gives rise to
1234an immediate horizontal operation
1235\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1236pairs of $n/2$ bit fields.
1237For example, \verb#simd<16>::add<h,l># adds values
1238in adjacent 8 bit fields to produce 16 bit results,
1239while \verb#simd<32>::min<h,l># can produce the
1240minimum value of adjacent 16-bit fields.
1241Thus any binary horizontal operation can be implemented
1242in a single IDISA instruction making use of the \verb:<h,l>:
1243operand modifier combination.
1244
1245Horizontal combinations of four adjacent fields can also be
1246realized in a general way through two steps of inductive
1247doubling.  For example, consider the or-across operation \verb:si_orx:
1248of the SPU, that performs a logical or operation
1249on four 32-bit fields.  This four field combination
1250can easily be implemented with the following two operations.
1251%\begin{singlespace}
1252\begin{verbatim}
1253t = simd<64>::or<h,l>(x, x)
1254t = simd<128>::or<h,l>(t, t)
1255\end{verbatim}
1256%\end{singlespace}
1257
1258In general, systematic support for horizontal
1259combinations of sets of $2^h$ adjacent fields may
1260be realized through $h$ inductive double steps
1261in a similar fashion.
1262Thus, IDISA esssentially offers systematic support
1263for horizontal operations entirely through the
1264use of \verb:<h,l>: half-operand modifier
1265combinations.
1266
1267Systematic support for general horizontal operations
1268under IDISA also creates opportunity for a design tradeoff:
1269offsetting the circuit complexity of half-operand
1270modifiers with potential elimination of dedicated
1271logic for some {/ad hoc} horizontal SIMD operations.
1272Even if legacy support for these operations is required,
1273it may be possible to provide that support through
1274software or firmware rather than a full hardware
1275implementation.  Evaluation of these possibilities
1276in the context of particular architectures is a potential
1277area for further work.
1278
1279
1280\section{Implementation}
1281
1282IDISA may be implemented as a software
1283abstraction on top of existing SWAR architectures or
1284directly in hardware.  In this section, we briefly
1285discuss implementation of IDISA libraries before
1286moving on to consider hardware design.  Although
1287a full realization of IDISA in hardware is beyond our
1288current capabilities, our goal is to develop a sufficiently
1289detailed design to assess the costs of IDISA implementation
1290in terms of the incremental complexity over the RefA
1291and RefB architectures.  The cost factors we consider, then,
1292are the implementation of the half-operand modifiers, and
1293the extension of core operations to the 2-bit, 4-bit and
1294128-bit field widths.  In each case, we also discuss
1295design tradeoffs.
1296
1297\subsection{IDISA Libraries}
1298
1299Implementation of IDISA instructions using template
1300and macro libraries has been useful in developing
1301and assessing the correctness of many of the algorithms
1302presented here.  Although these implementations do not
1303deliver the performance benefits associated with
1304direct hardware implementation of IDISA, they
1305have been quite useful in providing a practical means
1306for portable implementation of parallel bit stream
1307algorithms on multiple SWAR architectures.  However,
1308one additional facility has also proven necessary for
1309portability of parallel bit stream algorithms across
1310big-endian and little-endian architectures: the
1311notion of shift-forward and shift-back operations.
1312In essence, shift forward means shift to the left
1313on little-endian systems and shift to the right on
1314big-endian systems, while shift back has the reverse
1315interpretation.  Although this concept is unrelated to
1316inductive doubling, its inclusion with the IDISA
1317libraries has provided a suitable basis for portable
1318SIMD implementations of parallel bit stream algorithms.
1319Beyond this, the IDISA libraries have the additional
1320benefit of allowing the implementation
1321of inductive doubling algorithms at a higher level
1322abstraction, without need for programmer coding of
1323the underlying shift and mask operations.
1324
1325\subsection{IDISA Model}
1326\begin{figure}[tbh]
1327\begin{center}
1328\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1329\caption{IDISA Block Diagram}
1330\label{pipeline-model}
1331\end{center}
1332\end{figure}
1333
1334Figure \ref{pipeline-model} shows a block diagram
1335for a pipelined SIMD processor implementing IDISA.
1336The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1337registers each of width $N = 2^K$ bits. 
1338IDISA instructions identified by the Instruction Fetch
1339Unit (IFU) are forwarded for decoding to the SIMD
1340Instruction Decode Unit (SIDU).  This unit decodes
1341the instruction to produce
1342signals identifying the source and destination
1343operand registers, the half-operand modifiers, the
1344field width specification and the SIMD operation
1345to be applied.
1346
1347The SIDU supplies the source register information and the half-operand
1348modifier information to the SIMD Operand Fetch Unit (SOFU).
1349For each source operand, the SIDU provides an $A$-bit register
1350address and two 1-bit signals $h$ and $l$ indicating the value
1351of the decoded half-operand modifiers for this operand.
1352Only one of these values may be 1; both are 0 if
1353no modifier is specified.
1354The SIDU also supplies decoded field width signals $w_k$
1355for each field width $2^k$ to both the SOFU and to the
1356SIMD Instruction Execute Unit (SIEU).  Only one of the
1357field width signals has the value 1.
1358The SIDU also supplies decoded SIMD opcode information to SIEU and
1359a decoded $A$-bit register address for the destination register to
1360the SIMD Result Write Back Unit (SRWBU).
1361
1362The SOFU is the key component of the IDISA model that
1363differs from that found in a traditional SWAR
1364processor.  For each of the two $A$-bit source
1365register addresses, SOFU is first responsible for
1366fetching the raw operand values from the SRF.
1367Then, before supplying operand values to the
1368SIEU, the SOFU applies the half-operand modification
1369logic as specified by the $h$, $l$, and field-width
1370signals.  The possibly modified operand values are then
1371provided to the SIEU for carrying out the SIMD operations.
1372A detailed model of SOFU logic is described in the following
1373subsection.
1374
1375The SIEU differs from similar execution units in
1376current commodity processors primarily by providing
1377SIMD operations at each field width
1378$n=2^k$ for $0 \leq k \leq K$.  This involves
1379additional circuitry for field widths not supported
1380in existing processors.  In our evaluation model,
1381IDISA-A adds support for 2-bit, 4-bit and 128-bit
1382field widths in comparison with the RefA architecture,
1383while IDISA-B similarly extends RefB.
1384
1385When execution of the SWAR instruction is
1386completed, the result value is then provided
1387to the SRWBU to update the value stored in the
1388SRF at the address specified by the $A$-bit
1389destination operand.
1390
1391\subsection{Operand Fetch Unit Logic}
1392
1393The SOFU is responsible for implementing the half-operand
1394modification logic for each of up to two input operands fetched
1395from SRF.  For each operand, this logic is implemented
1396using the decoded half-operand modifiers signals $h$ and $l$,
1397the decoded field width signals $w_k$ and the 128-bit operand
1398value $r$ fetched from SRF to produce a modified 128-bit operand
1399value $s$ following the requirements of equations (4), (5) and
1400(6) above.  Those equations must be applied for each possible
1401modifier and each field width to determine the possible values $s[i]$
1402for each bit position $i$.  For example, consider bit
1403position 41, whose binary 7-bit address is $0101001$.
1404Considering the address bits left to right, each 1 bit
1405corresponds to a field width for which this bit lies in the
1406lower $n/2$ bits (widths 2, 16, 64), while each 0 bit corresponds to a field
1407width for which this bit lies in the high $n/2$ bits.
1408In response to the half-operand modifier signal $h$,
1409this bit may receive a value from the corresponding field
1410of width 2, 16 or 64 whose address bit is 0, namely $r[40]$,
1411$r[33]$ or $r[9]$.   Otherwise, this bit receives the value $r[41]$,
1412in the case of no half-operand modifier, or a low half-operand modifier
1413in conjunction with a field width signal $w_2$, $w_{16}$ or $w_{64}$.
1414The overall logic for determining this bit value is thus given as follows.
1415\begin{eqnarray*}
1416s[41] & = & h \wedge (w_2 \wedge r[40] \vee w_{16} \wedge r[33] \vee w_{64} \wedge r[9]) \\
1417& & \vee \neg h \wedge (\neg l \vee w_2 \vee w_{16} \vee w_{64}) \wedge r[41]
1418\end{eqnarray*}
1419
1420Similar logic is determined for each of the 128 bit positions.
1421For each of the 7 field widths, 64 bits are in the low $n/2$ bits,
1422resulting in 448 2-input and gates for the $w_k \wedge r[i]$ terms.
1423For 120 of the bit positions, or gates are needed to combine these
1424terms; $441 -120 = 321$ 2-input or gates are required.  Another
1425127 2-input and gates combine these values with the $h$ signal.
1426In the case of a low-half-operand modifier, the or-gates combining $w_k$
1427signals can share circuitry.  For each bit position $i=2^k+j$ one
1428additional or gate is required beyond that for position $j$.
1429Thus 127 2-input or gates are required.  Another 256 2-input and gates
1430are required for combination with the $\neg h$  and $r[i]$ terms.  The terms for
1431the low and high half-operand modifiers are then combined with an
1432additional 127 2-input or gates.   Thus, the circuity complexity
1433for the combinational logic implementation of half-operand
1434modifiers within the SOFU is 1279 2-input gates per operand,
1435or 2558 gates in total.
1436
1437The gate-level complexity of half-operand modifiers as described is nontrivial,
1438but modest.   However, one possible design tradeoff is to
1439differentiate the two operands, permitting a high half-operand
1440modifier to be used only with the first operand and a low-modifier
1441to be used only with the second operand.  This would exclude
1442\verb#<h,h># and \verb#<l,l># modifier combinations and also
1443certain combinations for noncommutative core operations. 
1444The principal
1445consequence for the applications considered here would be with
1446respect to the pack operations in forward transposition, but it
1447may be possible to address this through SIEU circuitry.
1448If this approach were to be taken, the gate complexity of
1449half-operand modification would be reduced by slightly more than half.
1450
1451\subsection{2-Bit and 4-Bit Field Widths}
1452
1453Beyond the half-operand modifiers, extension of core SWAR
1454operations to 2-bit and 4-bit field widths is critical to
1455inductive doubling support.  The principal operations
1456that need to be supported in this way are addition, pack, merge
1457merge, and rotate. 
1458
1459Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
1460as a modification to that for 8-bit fields by incorporating logic to
1461disable carry propagation at the 16 midfield boundaries.  For 2-bit
1462fields, disabling carry propagation at 32 additional boundaries suffices,
1463although it may be simpler to directly implement the simple logic
1464of 2-bit adders.
1465
1466Pack and merge require bit selection logic for each field width.
1467A straightforward implementation model for each operation
1468uses 128 2-input and gates to select the desired bits from the
1469operand registers and another 128 2-input or gates to feed
1470these results into the destination register.
1471
1472Rotation for 2-bit fields involves simple logic for selecting
1473between the 2 bits of each field of the operand being
1474rotated on the basis of the low bit of each field of the
1475rotation count.  Rotation for 4-bit fields is more complex,
1476but can also be based on 1-of-4 selection circuitry
1477involving the low 2 bits of the rotation count fields.
1478
1479\subsection{128-Bit Field Widths}
1480
1481For completeness, the IDISA model requires core operations
1482to be implemented at the full register width, as well
1483as power-of-2 partitions.  This may be problematic for
1484operations such as addition due to the inherent delays
1485in 128-bit carry propagation.  However, the primary
1486role of 128 bit operations in inductive doubling
1487is to combine two 64-bit fields using \verb#<h,l>#
1488operand combinations.  In view of this, it may be
1489reasonable to define hardware support for such
1490combinations to be based on 64-bit logic, with support
1491for 128-bit logic implemented through firmware or software.
1492
1493\subsection{Final Notes and Further Tradeoffs}
1494
1495In order to present IDISA as a concept for design extension of
1496any SWAR architecture, our discussion of gate-level implementation
1497is necessarily abstract.  Additional circuitry is sure to be
1498required, for example, in implementation of SIDU.  However,
1499in context of the 128-bit reference architectures studied,
1500our analysis suggests realistic IDISA implementation well
1501within a 10,000 gate budget.
1502
1503However, the additional circuitry required may be offset
1504by elimination of special-purpose instructions found in
1505existing processors that could instead be implemented through efficient
1506IDISA sequences.  These include examples such as population
1507count, count leading and/or trailing zeroes and parity.
1508They also include specialized horizontal SIMD operations.
1509Thus, design tradeoffs can be made with the potential of
1510reducing the chip area devoted to special purpose instructions
1511in favor of more general IDISA features.
1512
1513\section{Conclusions}
1514
1515In considering the architectural support for
1516SIMD text processing using the method of parallel bit streams,
1517this paper has presented the principle of inductive doubling
1518and a realization of that principle in the form of
1519IDISA, a modified SWAR instruction set architecture.
1520IDISA offers support for SWAR operations at all power-of-2
1521field widths, including 2-bit and 4-bit widths, in particular,
1522as well as half-operand modifiers and pack and merge operations
1523to support efficient transition between successive power-of-two
1524field widths.  The principle innovation is the notion of
1525half-operand modifiers that eliminate the cost associated
1526with the explicit mask and shift operations required for
1527such transitions.
1528
1529Several algorithms key to parallel bit stream methods
1530have been examined and shown to benefit from dramatic
1531reductions in instruction count compared to the best
1532known algorithms on reference architectures.  This
1533includes both a reference architecture modeled on
1534the SWAR capabilities of the SSE family as well as
1535an architecture incorporating the powerful permute
1536or shuffle capabilities found in Altivec or Cell BE processors.
1537In the case of transposition algorithms to and from parallel bit stream
1538form, the architecture has been shown to make possible
1539straightforward inductive doubling algorithms with the
1540lowest total number of operations that can be achieved by
1541any possible 3-register SWAR architecture.
1542
1543Applications of IDISA in other areas have also been
1544examined.  The support for 2-bit and 4-bit field widths
1545in SWAR processing is beneficial for packed DNA representations
1546and packed decimal representations respectively.  Additional
1547inductive doubling examples are presented and the phenomenon
1548of power-of-2 transitions discussed more broadly.
1549Most significantly, IDISA supports a fully general approach
1550to horizontal SWAR operations, offering a considerable
1551improvement over the {\em ad hoc} sets of special-purpose
1552horizontal operations found in existing SWAR instruction sets.
1553
1554An IDISA implementation model has been presented employing
1555a customized operand fetch unit to implement the half-operand
1556modifier logic.  Gate-level implementation of this unit
1557and operations at the 2-bit and 4-bit field widths have
1558been analyzed and found to be quite reasonable within
1559a 10,000 gate bound.  Design tradeoffs to reduce the cost
1560have also been discussed, possibly even leading to a
1561net complexity reduction through elimination of instructions
1562that implement special-case versions of inductive doubling.
1563
1564Future research may consider the extension of inductive doubling
1565support in further ways.   For example, it may be possible
1566to develop a pipelined architecture supporting two or three
1567steps of inductive doubling in a single operation.
1568
1569%\appendix
1570%\section{Appendix Title}
1571%
1572%This is the text of the appendix, if you need one.
1573
1574\acks
1575
1576This research was supported by a Discovery Grant from the
1577Natural Sciences and Engineering Research Council of Canada.
1578
1579%\bibliographystyle{plainnat}
1580\bibliographystyle{plain}
1581\bibliography{asplos094-cameron}
1582%\begin{thebibliography}{}
1583
1584%\bibitem{smith02}
1585%Smith, P. Q. reference text
1586
1587%\end{thebibliography}
1588
1589
1590\end{document}
Note: See TracBrowser for help on using the repository browser.