source: docs/ASPLOS09/asplos094-cameron.tex @ 246

Last change on this file since 246 was 246, checked in by cameron, 11 years ago

Minor fixes.

File size: 73.6 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152The remainder of this paper is organized as follows.
153The second section of this paper introduces IDISA and the
154SIMD notation used throughout this paper.  The third
155section moves on to discuss an evaluation methodology
156for IDISA in comparison to two reference architectures
157motivated by the SSE and Altivec instruction sets.
158The fourth section provides a short first example of
159the inductive doubling principle in action through
160the case of population count.  Sections 5 through 7
161then address the application of IDISA to core
162algorithms in text processing with parallel bit
163streams. 
164The eighth section then considers the potential role of
165IDISA in supporting applications beyond parallel bit streams.
166Section 9 addresses IDISA implementation while
167Section 10 concludes the paper with a summary of
168results and directions for further work.
169
170
171\section{Inductive Doubling Architecture}
172
173This section presents an idealized model for a single-instruction
174multiple-data (SIMD) instruction set architecture designed
175specifically to support inductive doubling algorithms in the
176domain of parallel bit stream programming.   The architecture is idealized
177in the sense that we concentrate on only the necessary features
178for our purpose, without enumerating the additional
179operations that would be required for
180SIMD applications in other domains.  The goal is to focus
181on the principles of inductive doubling support in a way
182that can accommodate a variety of realizations as other
183design constraints are brought to bear on the overall instruction set
184design.  First we introduce a simple model and notation for
185SIMD operations in general and then present the four key
186features of an idealized architecture in support of parallel
187bit stream programming.
188
189The idealized architecture supports typical SIMD integer
190operations common to existing commodity architectures such as SSE
191and Altivec.   The architecture is defined to support SIMD
192operations on registers of size $N=2^K$ bits, for some integer $K$.
193Typical values of $K$ for commodity processors include $K=6$ for
194the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
195the 128-bit registers of SSE and Altivec technology and $K=8$ for
196the upcoming Intel AVX technology.   The registers may be
197partitioned into $N/n$ fields of width $n=2^k$ bits for some values
198of $k \leq K$.   Typical values of $k$ used on commodity processors
199include $k = 3$ for SIMD operations on 8-bit fields (bytes),
200$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
201on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
202fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
203Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
204register $r$, using big-endian numbering.
205
206Let \verb:simd<n>: represent the class of SIMD operations defined
207on fields of size $n$ using C++ template syntax.  Given a
208binary function $F_n$ on $n$-bit fields, we denote the SIMD
209version of this operation as \verb#simd<n>::F#.  Given two
210SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
211respectively, the operation \verb#r=simd<n>::F(a,b)# stores
212the value $r$ in the register \verb:r: as determined by
213the simultaneous calculation of individual field values in
214accord with the following equation.
215\[ r_i = F_n(a_i, b_i) \]
216
217For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
218logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
219integers as follows.
220%\singlespace
221\begin{eqnarray}
222\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
223\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
224\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
225\end{eqnarray}
226%\doublespace
227The SSE and Altivec instruction sets support
228each of the addition and subtraction operations in SIMD form
229for 8, 16 and 32-bit fields, while the SSE set also includes
230the 64-bit forms.  For example, \verb#simd<8>::add# in our
231nomenclature is provided by the operation \verb:paddb:
232on SSE and the operation \verb:vaddubm: on Altivec.
233The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
234\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
235more constrained, requiring that all field shifts by the same amount.
236
237Given these definitions and notation, we now present
238the four key elements of an inductive doubling architecture.
239The first is a definition of a core set of binary functions
240on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
241The second is a set of {\em half-operand modifiers} that allow
242the inductive processing of fields of size $2n$ in terms of
243combinations of $n$-bit values selected from the fields.
244The third is the definition of packing operations that compress
245two consecutive registers of $n$-bit values into a single
246register of $n/2$-bit values.  The fourth is the definition
247of merging operations that produce a set of $2n$ bit fields
248by concatenating corresponding $n$-bit fields from two
249parallel registers.  Each of these features is described below.
250
251For the purpose of direct and efficient support for inductive
252doubling algorithms on bit streams, the provision of
253a core set of operations at field widths of 2 and 4 as
254well as the more traditional field witdhs of 8, 16 and 32
255is critical for elegant and efficient expression of the
256algorithms.  In essence, inductive doubling algorithms
257work by establishing some base property at either single
258or 2-bit fields.  Each iteration of the algorithm then
259goes on to establish the property for the power-of-2
260field width.  In order for this inductive step to be
261most conveniently and efficiently expressed, the
262core operations needed for the step should be available
263at each field width.  In the case of work with parallel
264bit streams, the operations \verb:add:, \verb:sub:,
265\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
266(rotate left) comprise the core.  In other domains,
267additional operations may be usefully included in the
268core depending on the work that needs to be performed
269at each inductive doubling level.
270
271Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
272also includes fields of width 1.  These are included for
273logical consistency, but are easily implemented by mapping
274directly to appropriate bitwise logic operations, which
275we assume are also available.  For example,
276\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
277bitwise exclusive-or operation on SIMD registers.
278
279The second key facility of the inductive doubling architecture
280is the potential application of half-operand modifiers to
281the fields of either or both of the operands of a SIMD
282operation.  These modifiers select either the
283low $n/2$
284bits of each $n$-bit field (modifier ``\verb:l:'') or the
285high $n/2$ bits (modifier ``\verb:h:'').  When required,
286the modifier ``\verb:x:'' means that the full $n$ bits
287should be used, unmodified.  The semantics of these
288modifiers are given by the following equations.
289%\singlespace
290\begin{eqnarray}
291l(r_n) & = & r_n \bmod 2^{n/2} \\
292h(r_n) & = & r_n / 2^{n/2} \\
293x(r_n) & = & r_n
294\end{eqnarray}
295%\doublespace
296In our notation, the half-operand modifiers are
297specified as optional template (compile-time) parameters
298for each of the binary functions.  Thus,
299\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
300the 2-bit quantity found in the high 2-bits of each 4-bit field
301of its first operand (\verb:a:)
302together with the corresponding 2-bit quantity found in the
303low 2-bits of its second operand (\verb:b:).
304In general, the purpose of the half-operand modifiers
305in support of inductive doubling is to allow the processing
306of $n$-bit fields to easily expressed in terms of
307combination of the results determined by processing
308$n/2$ bit fields.
309
310The third facility of the inductive doubling architecture
311is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
312The field values of \verb#r=simd<n>::pack(a,b)# are
313defined by the following equations.
314%\singlespace
315\begin{eqnarray}
316r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
317r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
318\end{eqnarray}
319%\doublespace
320Here conv is a function which performs conversion of an $n$-bit
321value to an $n/2$ bit value by signed saturation (although
322conversion by unsigned saturation would also suit our purpose).
323
324Half-operand modifiers may also be used with the pack
325operations.  Thus packing with conversion by masking off all
326but the low $n/2$ bits of each field may be
327be performed using the operation \verb#simd<n>::pack<l,l>#
328
329
330The final facility of the inductive doubling architecture is
331a set of merging operations \verb#simd<n>::mergeh# and
332\verb#simd<n>::mergel# that produce $2n$-bit fields
333by concatenating corresponding $n$-bit fields from the
334operand registers.   The respective
335operations \verb#r=simd<n>::mergeh(a,b)# and
336\verb#s=simd<n>::mergel(a,b)#
337are defined by the following equations.
338%\singlespace
339\begin{eqnarray}
340r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
341s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
342\end{eqnarray}
343%\doublespace
344Both SSE and Altivec provide versions of pack and merge operations
345for certain field widths.  The pack operations are provided
346with operands having 16-bit or 32-bit fields on each platform, although
347with some variation in how conversion is carried out.
348The merge operations are provided at 8-bit, 16-bit and 32-bit
349field widths on both architectures and also at the 64-bit level
350on SSE.
351
352This completes the description of IDISA.  As described, many of the
353features are already available with the SIMD facilities of
354existing commodity processors.  The extensions enumerated
355here are relatively straightforward.  The innovation
356is to specifically tackle the design of facilities to
357offer systematic support for transitions between power-of-2 field widths.
358As we shall show in the remainder of this paper, these facilities
359can dramatically reduce instruction count in core parallel bit
360stream algorithms, with a factor of 3 reduction being typical.
361
362\section{Evaluation Methodology}
363
364IDISA represents a set of instruction set features that
365could potentially be added to any SWAR processor.  The goal
366in this paper is to evaluate these features independent
367of artifacts that may be due to any particular realization,
368while still considering realistic models based on existing
369commodity instruction set architectures.  For the purpose
370of IDISA evaluation, then, we define two reference architectures.
371For concreteness, IDISA and the two reference architectures will
372each be considered as 128-bit processors employing the
373three-register SWAR model defined in the previous
374section.
375
376Reference architecture A (RefA) consists of a limited register
377processor providing a set of core binary operations defined
378for 8, 16, 32 and 64 bit fields.  The core binary operations
379will be assumed to be those defined by the SSE instruction
380set for 16-bit fields.  In addition, we assume that
381shift immediate operations for each field width exist,
382e.g., \verb#simd<8>::srli<1>(x)# for a right logical
383shift of each 8-bit field by 1.  We also assume that
384a constant load operation \verb#simd::const<n>(c)#
385loads the constant value $c$ into each $n$ bit field.
386
387Reference architecture B (RefB) consists of a register-rich
388processor incorporating all the operations of reference
389architecture A as well as the following additional facilities
390inspired by the Altivec instruction set.
391For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
392logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
393of $a$ by the rotation count in the corresponding field of $b$.
394A three-register \verb#simd<1>::if(a,b,c)# bitwise logical
395operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned
396after the Altivec \verb:vec_sel: operation.  Finally,
397a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
398permutation of bytes from the concatenation of $a$ and $b$ based
399on the set of indices in $c$.
400
401Two versions of IDISA are assessed against these reference
402architectures as follows.  IDISA-A has all the facilities
403of RefA extended with half-operand modifiers and all core
404operations at field widths of 2, 4 and 128.  IDISA-B is
405similarly defined and extended based on RefB.  Algorithms
406for both RefA and IDISA-A are assessed assuming that
407any required constants must be loaded as needed; this
408reflects the limited register assumption.  On the other,
409assessment for both RefB and IDISA-B will make the
410assumption that sufficiently many registers exist that
411constants can be kept preloaded.
412
413In each case, the processors are assumed to be
414pipelined processors with a throughput of one SWAR instruction
415each processor cycle for straight-line code free of memory
416access.  Such processors are certainly feasible with a
417suitable investment in circuitry, although practical designs
418may make sacrifices to achieve close to optimal throughput
419with less circuitry.  However, the assumption makes for
420straightforward performance evaluation based on instruction
421count for straight-line computational kernels.  Furthermore,
422the assumption also eliminates artifacts due to possibly different
423latencies in reference and IDISA architectures.  Because
424the same assumption is made for reference and IDISA
425architectures, determination of the additional circuit
426complexity due to IDISA features is unaffected by the
427assumption.
428
429In the remainder of this paper, then, IDISA-A and IDISA-B
430models are evaluated against their respective reference
431architectures on straight-line computational kernels
432used in parallel bit stream processing and other applications.
433As XML and other sequential text processing applications
434tend to use memory in an efficient streaming model, the
435applications tend to be compute-bound rather than IO-bound.
436Thus, the focus on computational kernels addresses the
437primary concern for performance improvement of these applications.
438
439The additional circuity complexity to realize IDISA-A and
440IDISA-B designs over their reference models will be
441addressed in the penultimate section.  That discussion
442will focus primarily on the complexity of implementing
443half-operand modifier logic, but will also address
444the extension of the core operations to operate on
4452-bit, 4-bit and 128-bit fields, as well.
446
447\section{Population Count}
448
449\begin{figure}[h]
450\begin{center}\small
451\begin{verbatim}
452c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
453c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
454c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
455c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
456c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
457\end{verbatim}
458\end{center}
459\caption{Population Count Reference Algorithm}
460\label{HD-pop}
461\end{figure}
462
463\begin{figure}
464\begin{center}\small
465\begin{verbatim}
466c = simd<2>::add<h,l>(x, x);
467c = simd<4>::add<h,l>(c, c);
468c = simd<8>::add<h,l>(c, c);
469c = simd<16>::add<h,l>(c, c);
470c = simd<32>::add<h,l>(c, c);
471\end{verbatim}
472\end{center}
473\caption{IDISA Population Count}
474\label{inductivepopcount}
475\end{figure}
476
477As an initial example to illustrate the principle of inductive doubling
478in practice, consider the problem of {\em population count}: determining
479the number of one bits within a particular bit field.  It is important
480enough for such operations as calculating Hamming distance to be included
481as a built-in instruction
482on some processors.  For example, the SPU of the Cell Broadband Engine
483has a SIMD population count instruction \verb:si_cntb: for simultaneously
484determining the
485number of 1 bits within each byte of a 16-byte register.
486In text processing with parallel bit streams, population count has direct
487application to keeping track of line numbers for error reporting, for example.
488Given a bit block identifying the positions of newline characters within
489a block of characters being processed, the population count of the
490bit block can be used to efficiently and conveniently be used to update
491the line number upon completion of block processing.
492
493Figure \ref{HD-pop} presents a traditional divide-and-conquer
494implementation for a 32-bit integer {\tt x} adapted from
495Warren \cite{HackersDelight}, while
496Figure \ref{inductivepopcount} shows the corresponding IDISA
497implementation for a vector of 32-bit fields.  Each implementation employs
498five steps of inductive doubling to produce population counts
499within 32 bit fields.  The traditional implementation employs
500explicit masking and shifting operations, while these
501operations are implicit within the semantics of the inductive
502doubling instructions shown in Figure \ref{inductivepopcount}.
503In each implementation, the first step determines the
504the population counts within 2-bit fields
505by adding the high bit of each such field to the low bit
506to produce a set of 2-bit counts in {\tt c}.
507In the second step, the counts within 4-bit fields of {\tt c} are determined
508by adding the counts of the corresponding high and low 2-bit subfields.
509Continuing in this fashion,
510the final population counts within 32-bit fields are determined in five steps.
511
512With the substitution of longer mask constants replicated for four
51332-bit fields, the implementation of Figure \ref{HD-pop} can be
514directly adapted to SWAR processing using 128-bit registers.
515Each binary operator is replaced by a corresponding binary
516SWAR operation.   Without the IDISA features, a
517straightforward RefA implementation of population count for
51832-bit fields thus employs 10 operations to load or generate
519mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
520total of 30 operations to complete the task.   Employing
521optimization identified by Warren, this can be reduced to
52220 operations, 5 of which are required to generate mask constants.
523At the cost of register pressure, it is possible that these constants
524could be kept preloaded in long vector processing.  In accord
525with our evaluation model, the RefB cost is thus 15 operations.
526As the IDISA implementation requires no constants at all,
527both the IDISA-A and IDISA-B cost is 5 operations.
528At our assumed one CPU cycle per instruction model, IDISA-A
529offers a 4X improvement over RefA, while IDISA-B offers a 3X
530improvement over its comparator.
531
532The pattern illustrated by population count is typical.
533An inductive doubling algorithm of $n$ steps typically applies
534mask or shift operations at each step for each of the
535two operands being combined in the step.  If there is
536one such operation for each operand, and the combination
537can be implemented in a single SWAR operation, then a total
538of $3n$ operations are the required in a RefB implementation,
539and possibly $4n$ for a RefA implementation including the
540cost of loading masks.  IDISA-A and IDISA-B implementations
541typically eliminate the explicit mask and shift operations
542through appropriate half-operand modifiers, reducing the
543total instruction count to $n$.   Thus a 3X to 4X improvement
544obtains in these cases.
545
546\section{Transposition to Parallel Bit Streams}
547
548In this section, we consider the first major
549application of IDISA: transposition of byte stream data to parallel bit stream
550form.   Of course, this operation is critical to the
551method of parallel bit streams and all applications
552of the method can benefit from a highly efficient
553transposition process.  Before considering how
554the IDISA supports this
555transposition process, however, we first consider
556algorithms on existing architectures.  Two algorithms
557are presented; the best of these requires 72
558SWAR operations under the RefB model to perform
559transposition of eight serial registers of byte stream data into
560eight parallel registers of bit stream data.
561
562We then go on to show how the transposition problem
563can be solved using IDISA-A or IDISA-B
564with a mere 24 three-register SIMD operations.  We also show
565that this is optimal for any three-register instruction set model.
566
567\begin{figure}[tbh]
568\begin{center}
569\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
570\caption{Input/Output Model for Serial to Parallel Transposition}
571\label{s2p-spec}
572\end{center}
573
574\end{figure}
575Figure \ref{s2p-spec} illustrates the input-output requirements of
576the transposition problem.  We assume that inputs and
577outputs are each SWAR registers of size $N=2^K$ bits.
578The input consists of $N$ bytes of serial byte data,
579stored consecutively in eight SIMD registers each holding
580$N/8$ bytes.  The output consists of eight parallel
581registers, one each for the eight individual bit positions
582within a byte.  Upon completion of the transposition process,
583each output register is to hold the $N$ bits corresponding
584to the selected bit position in the sequence of $N$ input
585bytes.
586
587\subsection{Bit Gathering Algorithm}
588
589\begin{figure}[tbh]
590\begin{center}
591\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
592\caption{Serial to Parallel Transposition Using Bit-Gathering}
593\label{gather}
594\end{center}
595\end{figure}
596One straightforward algorithm for implementing the transposition process
597takes advantage of SWAR bit gathering operations that exist
598on some architectures.  This operation gathers one bit per byte
599from a particular position within each byte of a SIMD register.
600For example, the {\tt pmovmskb} operation of the Intel MMX and
601SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
602consisting of the high bit of each byte.  Similarly, the
603{\tt \verb:si_gbb:} operation of the synergistic processing units of the
604Cell Broadband Engine gathers together the low bit of each byte
605from the SIMD register.  Figure \ref{gather} illustrates the
606bit gathering process.
607
608For each bit stream of output, the bit gather algorithm requires
609one gather operation for each of the 8 input registers,
610giving 64 bit gather operations in all.  In addition, for seven
611of the eight bit positions, it is necessary to shift the bits
612of each input register into the conventional position for
613gathering.  A total of 56 shift operations are required for this
614task.  Finally, the result of each bit gather operation must
615be properly inserted into the output stream.  If the first
616gather operation for a stream can be used to also initialize
617the output register, there will then need to be 7 insert
618operations for the results of the remaining gather operations
619for this stream, making 56 insert operations in all.
620The insert step may be more complex than a single operation
621in some cases, but we use one operation per insert as a lower bound.
622Thus, the bit gather algorithm requires
623at least 176 operations to perform the transposition task.
624
625\subsection{BytePack Algorithm}
626
627A much more efficient transposition algorithm on commodity
628SWAR architectures involves three
629stages of binary division transformation.  This is similar
630to the three stage bit matrix inversion described by
631Warren  \cite{HackersDelight}, although modified to use SWAR operations.
632In each stage, input streams are divided into two half-length output streams.
633The first stage separates the bits at even numbered positions from those
634at odd number positions.  The two output streams from the first
635stage are then further divided in the second stage.
636The stream comprising even numbered bits from the original byte stream
637divides into one stream consisting of bits from positions 0 and 4 of each
638byte in the original stream and a second stream consisting of bits
639from positions 2 and 6 of each original byte.  The stream of bits from
640odd positions is similarly divided into streams for bits from each of the
641positions 1 and 5 and bits from positions 2 and 6.
642Finally, each of the four streams resulting from the second stage are
643divided into the desired individual bit streams in the third stage.
644
645% \begin{figure}[tbh]
646% \begin{center}\small
647% \begin{verbatim}
648% s0h = simd<16>::srli<8>(s0);
649% s0l = simd_and(s0, simd<16>::const(0x00FF));
650% s1h = simd<16>::srli<8>(s1);
651% s1l = simd_and(s1, simd<16>::const(0x00FF));
652% t0 = simd<16>::pack(s0h, s1h);
653% t1 = simd<16>::pack(s0l, s1l);
654% t0_l1 = simd<16>::slli<1>(t0);
655% t0_r1 = simd<16>::srli<1>(t1);
656% mask = simd<8>::const(0xAA);
657% p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask));
658% p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask));
659% \end{verbatim}
660% \end{center}
661% \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
662% \label{s2pstep}
663% \end{figure}
664%
665
666\begin{figure}[tbh]
667\begin{center}\small
668\begin{verbatim}
669even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30};
670odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
671mask = simd<8>::const(0xAA);
672t0 = simd<8>::permute(s0, s1, even);
673t1 = simd<8>::permute(s0, s1, odd);
674p0 = simd_if(mask, t0, simd<16>::srli<1>(t1));
675p1 = simd_if(mask, simd<16>::slli<1>(t0), t1);
676\end{verbatim}
677\end{center}
678\caption{RefB Transposition Step in BytePack Stage 1}
679\label{s2pstep}
680\end{figure}
681
682The binary division transformations are accomplished in each stage
683using byte packing, shifting and masking.  In each stage, a
684transposition step combines each pair of serial input registers to
685produce a pair of parallel output registers. 
686Figure \ref{s2pstep} shows a stage 1 transposition step in a
687Ref B implementation.  Using the permute facility, the even
688and odd bytes, respectively, from two serial input registers
689\verb:s0: and \verb:s1: are packed into temporary registers
690\verb:t0: and \verb:t1:.  The even and odd bits are then
691separated into two parallel output registers \verb:p0: and \verb:p1:
692by selecting alternating bits using a mask.  This step is applied
693four times in stage 1; stages 2 and 3 also consist of four applications
694of a similar step with different shift and mask constants.
695Overall, 6 operations per step are required, yielding a total
696of 72 operations to transpose 128 bytes to parallel bit stream
697form in the RefB implementation.
698
699In a RefA implementation, byte packing may also be achieved
700by the \verb#simd<16>::pack# with 4 additional operations to
701prepare operands.  Essentially, the RefB implementation
702uses single permuite instructions to implement the equivalent of
703\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
704The RefA implementation also requires 3 logic operations to implement
705each \verb#simd_if#.
706Assuming that mask loads are only need once per 128 bytes,
707a total of 148 operations are required in the RefB implementation.
708
709\subsection{Inductive Halving Algorithm}
710
711Using IDISA, it is possible to design
712a transposition algorithm that is both easier to understand and requires
713many fewer operations than the the bytepack algorithm described above.
714We call it the inductive halving algorithm for serial to parallel
715transposition, because it proceeds by reducing byte streams to
716two sets of nybble streams in a first stage, dividing the nybble
717streams into streams of bitpairs in a second stage and finally
718dividing the bitpair streams into bit streams in the third stage.
719
720\begin{figure}[tbh]
721\small
722\begin{verbatim}
723p0 = simd<8>::pack<h,h>(s0, s1);
724p1 = simd<8>::pack<l,l>(s0, s1);
725\end{verbatim}
726\caption{Stage 1 Transposition Step in the Inductive Halving Algorithm}
727\label{halvingstep}
728\end{figure}
729
730Figure \ref{halvingstep} shows one step in stage 1 of the inductive
731halving algorithm, comprising just two IDISA-A operations.
732The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
733from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
734the low nybble of each byte.  As in the bytepack algorithm, this step is
735applied 4 times in stage 1, for a total of 8 operations.
736
737Stage 2 of the inductive halving algorithm reduces nybble streams
738to streams of bit pairs.  The basic step in this algorithm consists
739of one \verb#simd<4>::pack<h,h># operation to extract the high pair
740of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
741low pair of each nybble.  Four applications of this step complete stage 2.
742
743Stage 3 similarly uses four applications of a step that uses a
744\verb#simd<2>::pack<h,h># operation to extract the high bit of
745each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
746each pair.  The complete algorithm to transform eight serial
747byte registers s0 through s7 into the eight parallel bit stream
748registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
749Under either IDISA-A or IDISA-B models, a mere 24 operations per 128
750input bytes is required.
751
752\begin{figure}[tbh]
753\small
754\begin{verbatim}
755hnybble0 = simd<8>::pack<h,h>(s0, s1);
756lnybble0 = simd<8>::pack<l,l>(s0, s1);
757hnybble1 = simd<8>::pack<h,h>(s2, s3);
758lnybble1 = simd<8>::pack<l,l>(s2, s3);
759hnybble2 = simd<8>::pack<h,h>(s4, s5);
760lnybble2 = simd<8>::pack<l,l>(s4, s5);
761hnybble3 = simd<8>::pack<h,h>(s6, s7);
762lnybble3 = simd<8>::pack<l,l>(s6, s7);
763hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
764hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
765lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
766ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
767hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
768hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
769lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
770ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
771bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
772bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
773bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
774bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
775bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
776bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
777bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
778bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
779\end{verbatim}
780\caption{Complete Inductive Halving Algorithm}
781\label{halvingalgorithm}
782\end{figure}
783
784\subsection{Optimality of the Inductive Halving Algorithm}
785
786Here we show that the inductive halving algorithm presented in
787the previous subsection is optimal in the following sense:
788no other algorithm on any 3-register SIMD architecture can use
789fewer than 24 operations to transform eight serial registers
790of byte stream data into eight parallel registers of bit stream data.
791By 3-register SIMD architecture, we refer to any architecture
792that uses SIMD instructions consistent with our overall model of
793binary operations using two input register operands to produce
794one output register value.
795
796Observe that the $N$ data bits from each input register must be
797distributed $N/8$ each to the 8 output registers by virtue of
798the problem definition.  Each output register can effectively
799be given a 3-bit address; the partitioning problem can be viewed
800as moving data to the correct address.  However, each
801operation can move results into at most one register. 
802At most this can result in the assignment of one correct address
803bit for each of the $N$ input bits.  As all $8N$ input bits
804need to be moved to a register with a correct 3-bit address,
805a minimum of 24 operations is required.
806
807\subsection{End-to-End Significance}
808
809In a study of several XML technologies applied to
810the problem of GML to SVG transformation, the parabix
811implementation (parallel bit streams for XML) was
812found to the fastest with a cost of approximately
81315 CPU cycles per input byte \cite{Herdy}.  Within parabix,
814transposition to parallel bit stream form requires
815approximately 1.1 cycles per byte \cite{CASCON08}.
816All other things being equal, a 3X speed-up of transposition
817alone would improve end-to-end performance in a
818complete XML processing application by more than 4\%.
819
820
821\section{Parallel to Serial Conversion}
822
823Parallel bit stream applications may apply string editing
824operations in bit space to substitute, delete or insert
825parallel sets of bits at particular positions.  In such cases,
826the inverse transform that converts a set of parallel bit
827streams back into byte space is needed.  In the example of
828UTF-8 to UTF-16 transcoding, the inverse transform is
829actually used twice for each application of the forward
830transform, to separately compute the high and low byte
831streams of each UTF-16 code unit.  Those two byte streams
832are subsequentely merged to form the final result.
833
834Algorithms for performing the inverse transform mirror those
835of the forward transform, employing SIMD merge operations
836in place of pack operations.  The best algorithm known
837to us on the commodity SIMD architectures takes advantage
838of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
839operations that are available with each of the SSE and Altivec instruction
840sets.  To perform the full inverse transform of 8 parallel
841registers of bit stream data into 8 serial registers of byte stream data,
842a RefA implementation requires 120 operations, while a RefB
843implementation reduces this to 72.
844
845\begin{figure}[tbh]
846\begin{center}\small
847\begin{verbatim}
848bit01_r0 = simd<1>::mergeh(bit0, bit1);
849bit01_r1 = simd<1>::mergel(bit0, bit1);
850bit23_r0 = simd<1>::mergeh(bit2, bit3);
851bit23_r1 = simd<1>::mergel(bit2, bit3);
852bit45_r0 = simd<1>::mergeh(bit4, bit5);
853bit45_r1 = simd<1>::mergel(bit4, bit5);
854bit67_r0 = simd<1>::mergeh(bit6, bit7);
855bit67_r1 = simd<1>::mergel(bit6, bit7);
856bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
857bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
858bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
859bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
860bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
861bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
862bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
863bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
864s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
865s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
866s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
867s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
868s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
869s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
870s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
871s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
872\end{verbatim}
873\end{center}
874\label{p2s-inductive}
875\caption{Parallel to Serial Transposition by Inductive Doubling}
876\end{figure}
877
878An algorithm employing only 24 operations using IDISA-A/B is relatively
879straightforward.. In stage 1, parallel registers for individual bit streams
880are first merged with bit-level interleaving
881using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
882operations.  For each of the four pairs of consecutive
883even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
884two consecutive registers of bitpair data are produced.
885In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
886are then applied to merge to bitpair streams to produce streams
887of nybbles for the high and low nybble of each byte.  Finally,
888the nybble streams are merged in stage 3 to produce the
889desired byte stream data.  The full inductive doubling
890algorithm for parallel to serial transposition is shown in Figure
891\ref{p2s-inductive}.
892
893This algorithm is again optimal, requiring the fewest operations
894of any possible algorithm using any 3-register instruction set
895model.  The proof directly follows that for the serial to parallel
896transposition.
897
898The existence of high-performance algorithms for transformation of
899character data between byte stream and parallel bit stream form
900in both directions makes it possible to consider applying these
901transformations multiple times during text processing applications.
902Just as the time domain and frequency domain each have their
903use in signal processing applications, the byte stream form and
904parallel bit stream form can then each be used at will in
905character stream applications.
906
907
908
909\section{Parallel Bit Deletion}
910
911\begin{figure*}[tbh]
912\begin{center}
913\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
914\hline
915\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
916\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
917\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
918\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
919\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
920\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
921\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
922\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
923\end{tabular}
924\end{center}
925\label{centraldelstep}
926\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
927\end{figure*}
928
929Parallel bit deletion is an important operation that allows string
930editing operations to be carried out while in parallel bit stream
931form.  It is also fundamental to UTF-8 to UTF-16 transcoding
932using parallel bit streams, allowing the excess code unit
933positions for UTF-8 two-, three- and four-byte sequences to
934be deleted once the sixteen parallel bit streams of UTF-16 have
935been computed \cite{PPoPP08}.
936
937Parallel bit deletion is specified using a deletion mask.
938A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
939to be deleted and 0s at positions identifying bits to be retained.
940For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
941bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
942accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
943
944Bit deletion may be performed using
945the parallel-prefix compress algorithm documented by
946Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
947only logic and shifts with a constant parameter to carry
948out the deletion process.  However, it requires $k^2$
949preprocessing steps for a final field width parameter
950of size $2^k$, as well as 4 operations per deletion step
951per stream.  Using the inductive doubling instruction set architecture
952it is possible to carry out bit deletion much more efficiently.
953
954Deletion within fixed size fields or registers may produce results that are either
955left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
956eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
957care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
958left-justified result produces an important intermediate form known as a
959{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
960right-justified and left-justified results from the application of the
9614-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
962stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
963the central result \verb:xxbdefhx:, which may easily be converted to a either a
964left or a right justified 8-element result by an appropriate shift operation.
965
966
967The observation about how two $n$-bit central deletion results can
968combine to yield a $2n$ central deletion result provides the basis
969for an inductive doubling algorithm.  Figure \ref{centraldelstep}
970illustrates the inductive process for the transition from 8-bit central
971deletion results to 16-bit central deletion results.  The top row shows
972the original deletion mask, while the second row shows the original
973bit stream to which deletions are to be applied, with deleted bits zeroed out.
974The third row shows the central result for each 8-bit field as the
975result of the previous inductive step.
976
977To perform the $8 \rightarrow 16$ central deletion step, we first form
978the population counts of 4-bit fields of the original deletion mask as
979shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
980an 8-bit central result, we perform a right shift by the population count
981of the low half of the field.  Similarly,
982left-justification requires a left-shift by the population count in the
983high half of the field.
984
985The left and right shifts can be performed simultaneously using a rotate
986left instruction.  Right justification by shifting an $n$ bit field
987$i$ positions to the right is equivalent to a left rotate of $n-i$
988positions.  These rotation amounts are computed by the operation
989\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
990except that don't care fields (which won't be subsequently used)
991are marked \verb:XX:. 
992
993The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
994as shown in row 6, and are combined with the right shift amounts
995by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
996as shown in row 7.  Using these computed values, the inductive step
997is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
998as shown in row 8.
999
1000At each inductive doubling level, it requires 4 operations to compute the
1001required deletion infomation and one operation per bit stream to perform deletion.
1002Note that, if deletion is to be applied to a set of eight parallel bit streams,
1003the computed deletion information is used for each stream without recomputation,
1004thus requiring 12 operations per inductive level.
1005
1006In comparison to the parallel-prefix compress method, the method of central
1007deletion results using the inductive doubling architecture has far fewer operations.
1008The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
1009induction versus $4k^2$ for the parallel-prefix method.  Using the computed
1010deletion operation, only a single SIMD rotate operation per bit stream
1011per level is needed, in comparison with 4 operations per level for parallel-prefix
1012compress.
1013
1014
1015
1016\section{Beyond Parallel Bit Streams}
1017
1018It can be argued that text processing with parallel bit
1019streams by itself is not sufficiently important to justify
1020the circuit complexity of IDISA.  In this section, then
1021we move on to consider further applications that may
1022benefit from IDISA features.  These include additional
1023basic inductive doubling
1024algorithms such as parity, and least power-of-2 ceiling.
1025Further applications for narrow field widths are discussed
1026as well, such as packed DNA representations using 2-bit
1027field widths and packed decimal representations using 4-bit
1028field widths. Most significantly, however, we show that IDISA offers
1029a fully general solution to the problem of supporting
1030{\em horizontal} SWAR operations.
1031
1032\subsection{Inductive Doubling with Logic Operations}
1033
1034\begin{figure}
1035\begin{center}\small
1036\begin{verbatim}
1037y = x ^ (x >> 1);
1038y = y ^ (y >> 2);
1039y = y ^ (y >> 4);
1040y = y ^ (y >> 8);
1041y = y ^ (y >>16);
1042y = y & 1;
1043\end{verbatim}
1044\end{center}
1045\caption{Parity Reference Algorithm}
1046\label{HD-parity}
1047\end{figure}
1048
1049\begin{figure}
1050\begin{center}\small
1051\begin{verbatim}
1052x = x - 1;
1053x = x | (x >> 1);
1054x = x | (x >> 2);
1055x = x | (x >> 4);
1056x = x | (x >> 8);
1057x = x | (x >>16);
1058x = x + 1;
1059\end{verbatim}
1060\end{center}
1061\caption{Power-of-2 Ceiling Reference Algorithm}
1062\label{HD-clp2}
1063\end{figure}
1064
1065\begin{figure}
1066\begin{center}\small
1067\begin{verbatim}
1068y = simd<2>::xor<h,l>(x, x);
1069y = simd<4>::xor<h,l>(y, y);
1070y = simd<8>::xor<h,l>(y, y);
1071y = simd<16>::xor<h,l>(y, y);
1072y = simd<32>::xor<h,l>(y, y);
1073\end{verbatim}
1074\end{center}
1075\caption{IDISA Parity Implementation}
1076\label{ID-parity}
1077\end{figure}
1078
1079\begin{figure}
1080\begin{center}\small
1081\begin{verbatim}
1082x = simd<32>::sub(x, simd<32>::const(1));
1083x = simd<2>::or<x, h>(x, x);
1084x = simd<4>::or<x, h>(x, x);
1085x = simd<8>::or<x, h>(x, x);
1086x = simd<16>::or<x, h>(x, x);
1087x = simd<32>::or<x, h>(x, x);
1088x = simd<32>::add(x, simd<32>::const(1));
1089\end{verbatim}
1090\end{center}
1091\caption{IDISA Power-of-2 Ceiling Implementation}
1092\label{ID-clp2}
1093\end{figure}
1094
1095Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1096code for two further inductive doubling examples using
1097logical operators as the combining feature.  In the
1098first of these, the ``exclusive or'' operation is applied
1099to all bits in a 32-bit value to determine parity.
1100Parity has important applications for error-correcting
1101codes such as the various Hamming codes for detecting
1102and correcting numbers of bit errors dependent on the
1103number of parity bits added.  Warren's version uses
110411 operations for parity of a single 32-bit value;
1105a straightforward SWAR adaptation also uses 11 operations
1106for the parity of a set of 32-bit fields.
1107
1108Figure \ref{ID-parity} shows the IDISA parity implementation
1109with only 5 operations required.  This represents slightly
1110more than a 2X improvement.  The improvement is less than
11113X seen in other cases because one of the operands need
1112not be modified before applying the exclusive-or operation.
1113
1114Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1115code for the least power-of-2 ceiling of a 32-bit value.  The
1116technique is to employ inductive doubling to fill in one bits at
1117all bit positions after the first one bit in the original value to
1118first produce a value of the form $2^n-1$.  This is then incremented
1119to determine the power of 2 ceiling.  This function and
1120its straightforward adaptation for SWAR application on a
1121set of 32-bit fields requires 12 operations.   The
1122corresponding IDISA implementation of Figure \ref{ID-clp2}
1123requires 7 operations, just under a 2X improvement.
1124
1125\subsection{Packed DNA Representation}
1126
1127DNA sequences are often represented as strings consisting
1128of the four nucleotide codes A, C, G and T.  Internally,
1129these sequences are frequently represented in internal
1130form as packed sequences of 2-bit values.  The IDISA
1131\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1132allow these packed representations to be quickly computed
1133from byte-oriented string values by two steps of inductive
1134halving.   Similarly, conversion back to string form
1135can use two steps of inductive merging.  Without direct
1136support for these pack and merge operations, the SWAR
1137implementations of these conversions require the cost
1138of explicit masking and shifting in combination with
1139the 16-bit to 8-bit packing and 8-bit to 16-bit
1140merging operations supported by existing SWAR facilities
1141on commodity processors.
1142
1143\subsection{Bit Reverse}
1144
1145Include or omit?
1146
1147\subsection{String/Decimal/Integer Conversion}
1148\begin{figure}
1149\begin{center}\small
1150\begin{verbatim}
1151b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1152b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1153b=(d & 0x0000FFFF) + 10000 * (d >> 16);
1154\end{verbatim}
1155\end{center}
1156\caption{BCD to Integer Reference Algorithm}
1157\label{BCD2int}
1158\end{figure}
1159
1160\begin{figure}
1161\begin{center}\small
1162\begin{verbatim}
1163t1=simd<8>:const(10);
1164t2=simd<16>:const(100);
1165t3=simd<32>:const(10000);
1166b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d);
1167b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b);
1168b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b);
1169\end{verbatim}
1170\end{center}
1171\caption{IDISA Parity Implementation}
1172\label{ID-BCD2int}
1173\end{figure}
1174
1175Just as DNA sequences represent an important use case for
1176SWAR operations on 2-bit fields, packed sequences of
1177decimal or hexadecimal digits represent a common use case
1178for 4-bit fields.  These representations can be used
1179both as an intermediate form in numeric string to integer
1180conversion and as a direct representation for
1181packed binary coded decimal.
1182
1183Figure \ref{BCD2int} shows a three-step inductive
1184doubling implementation for conversion of 32-bit packed BCD
1185values to integer form.  The 32-bit value consists
1186of 8 4-bit decimal digits.  Pairs of digits are
1187first combined by multiplying the higher digit
1188of the pair by 10 and adding.  Pairs of these
1189two-digit results are then further combined by
1190multiplying the value of the higher of the two-digit
1191results by 100 and adding.  The final step is
1192to combine four-digit results by multiplying the
1193higher one by 10000 and adding.  Overall, 20
1194operations are required for this implementation
1195as well as the corresponding SWAR implementation
1196for sets of 32-bit fields.  Preloading of 6 constants
1197into registers for repeated use can reduce the number of
1198operations to 14 at the cost of significant register
1199pressure.
1200
1201The IDISA implementation of this algorithm is shown
1202in Figure \ref{ID-BCD2int}.  This implementation
1203shows an interesting variation in the use of
1204half-operand modifiers, with only one operand
1205of each of the addition and multiplication operations
1206modified at each level.  Overall, this implementation
1207requires 9 operations, or 6 operations with 3
1208preloaded constants.  This represents more than a 2X
1209reduction in instruction count as well as a 2X reduction
1210in register pressure.
1211
1212
1213
1214
1215\subsection{Systematic Support for Horizontal SIMD Operations}
1216
1217In SIMD parlance, {\em horizontal} operations are
1218operations which combine values from two or more fields
1219of the same register, in contrast to the normal
1220{\em vertical} operations which combine corresponding
1221fields of different registers.  Horizontal operations
1222can be found that combine two (e.g., \verb:haddpd: on SSE3),
1223four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1224or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1225horizontal operations have a vertical component as well.
1226For example, \verb:psadbw: first forms the absolute value of
1227the difference of eight corresponding byte fields before
1228performing horizontal add of the eight values, while
1229\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1230four unsigned 8-bit fields within one register
1231and then combines the result horizontally with
1232corresponding 32-bit fields of a second registers.
1233
1234The space of potential horizontal operations thus has
1235many dimensions, including not only the particular
1236combining operation and the operand field width, but
1237also the number of fields being combined, whether a
1238vertical combination is applied and whether it is applied
1239before or after the horizontal operation and what the
1240nature of the vertical combining operation is.
1241Within this space, commodity SIMD architectures tend
1242to support only a very few combinations, without any
1243particular attempt at systematic support for horizontal
1244operations in general.
1245
1246In contrast to this {\em ad hoc} support on commodity
1247processors, IDISA offers a completely systematic treatment
1248of horizontal operations without any special features beyond
1249the inductive doubling features already described.
1250In the simplest case, any vertical operation
1251\verb#simd<n>::F# on $n$-bit fields gives rise to
1252an immediate horizontal operation
1253\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1254pairs of $n/2$ bit fields.
1255For example, \verb#simd<16>::add<h,l># adds values
1256in adjacent 8 bit fields to produce 16 bit results,
1257while \verb#simd<32>::min<h,l># can produce the
1258minimum value of adjacent 16-bit fields.
1259Thus any binary horizontal operation can be implemented
1260in a single IDISA instruction making use of the \verb:<h,l>:
1261operand modifier combination.
1262
1263Horizontal combinations of four adjacent fields can also be
1264realized in a general way through two steps of inductive
1265doubling.  For example, consider the or-across operation \verb:si_orx:
1266of the SPU, that performs a logical or operation
1267on four 32-bit fields.  This four field combination
1268can easily be implemented with the following two operations.
1269%\begin{singlespace}
1270\begin{verbatim}
1271t = simd<64>::or<h,l>(x, x)
1272t = simd<128>::or<h,l>(t, t)
1273\end{verbatim}
1274%\end{singlespace}
1275
1276In general, systematic support for horizontal
1277combinations of sets of $2^h$ adjacent fields may
1278be realized through $h$ inductive double steps
1279in a similar fashion.
1280Thus, IDISA esssentially offers systematic support
1281for horizontal operations entirely through the
1282use of \verb:<h,l>: half-operand modifier
1283combinations.
1284
1285Systematic support for general horizontal operations
1286under IDISA also creates opportunity for a design tradeoff:
1287offsetting the circuit complexity of half-operand
1288modifiers with potential elimination of dedicated
1289logic for some {/ad hoc} horizontal SIMD operations.
1290Even if legacy support for these operations is required,
1291it may be possible to provide that support through
1292software or firmware rather than a full hardware
1293implementation.  Evaluation of these possibilities
1294in the context of particular architectures is a potential
1295area for further work.
1296
1297
1298\section{Implementation}
1299
1300IDISA may be implemented as a software
1301abstraction on top of existing SWAR architectures or
1302directly in hardware.  In this section, we briefly
1303discuss implementation of IDISA libraries before
1304moving on to consider hardware design.  Although
1305a full realization of IDISA in hardware is beyond our
1306current capabilities, our goal is to develop a sufficiently
1307detailed design to assess the costs of IDISA implementation
1308in terms of the incremental complexity over the RefA
1309and RefB architectures.  The cost factors we consider, then,
1310are the implementation of the half-operand modifiers, and
1311the extension of core operations to the 2-bit, 4-bit and
1312128-bit field widths.  In each case, we also discuss
1313design tradeoffs.
1314
1315\subsection{IDISA Libraries}
1316
1317Implementation of IDISA instructions using template
1318and macro libraries has been useful in developing
1319and assessing the correctness of many of the algorithms
1320presented here.  Although these implementations do not
1321deliver the performance benefits associated with
1322direct hardware implementation of IDISA, they
1323have been quite useful in providing a practical means
1324for portable implementation of parallel bit stream
1325algorithms on multiple SWAR architectures.  However,
1326one additional facility has also proven necessary for
1327portability of parallel bit stream algorithms across
1328big-endian and little-endian architectures: the
1329notion of shift-forward and shift-back operations.
1330In essence, shift forward means shift to the left
1331on little-endian systems and shift to the right on
1332big-endian systems, while shift back has the reverse
1333interpretation.  Although this concept is unrelated to
1334inductive doubling, its inclusion with the IDISA
1335libraries has provided a suitable basis for portable
1336SIMD implementations of parallel bit stream algorithms.
1337Beyond this, the IDISA libraries have the additional
1338benefit of allowing the implementation
1339of inductive doubling algorithms at a higher level
1340abstraction, without need for programmer coding of
1341the underlying shift and mask operations.
1342
1343\subsection{IDISA Model}
1344\begin{figure}[tbh]
1345\begin{center}
1346\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1347\caption{IDISA Block Diagram}
1348\label{pipeline-model}
1349\end{center}
1350\end{figure}
1351
1352Figure \ref{pipeline-model} shows a block diagram
1353for a pipelined SIMD processor implementing IDISA.
1354The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1355registers each of width $N = 2^K$ bits. 
1356IDISA instructions identified by the Instruction Fetch
1357Unit (IFU) are forwarded for decoding to the SIMD
1358Instruction Decode Unit (SIDU).  This unit decodes
1359the instruction to produce
1360signals identifying the source and destination
1361operand registers, the half-operand modifiers, the
1362field width specification and the SIMD operation
1363to be applied.
1364
1365The SIDU supplies the source register information and the half-operand
1366modifier information to the SIMD Operand Fetch Unit (SOFU).
1367For each source operand, the SIDU provides an $A$-bit register
1368address and two 1-bit signals $h$ and $l$ indicating the value
1369of the decoded half-operand modifiers for this operand.
1370Only one of these values may be 1; both are 0 if
1371no modifier is specified.
1372The SIDU also supplies decoded field width signals $w_k$
1373for each field width $2^k$ to both the SOFU and to the
1374SIMD Instruction Execute Unit (SIEU).  Only one of the
1375field width signals has the value 1.
1376The SIDU also supplies decoded SIMD opcode information to SIEU and
1377a decoded $A$-bit register address for the destination register to
1378the SIMD Result Write Back Unit (SRWBU).
1379
1380The SOFU is the key component of the IDISA model that
1381differs from that found in a traditional SWAR
1382processor.  For each of the two $A$-bit source
1383register addresses, SOFU is first responsible for
1384fetching the raw operand values from the SRF.
1385Then, before supplying operand values to the
1386SIEU, the SOFU applies the half-operand modification
1387logic as specified by the $h$, $l$, and field-width
1388signals.  The possibly modified operand values are then
1389provided to the SIEU for carrying out the SIMD operations.
1390A detailed model of SOFU logic is described in the following
1391subsection.
1392
1393The SIEU differs from similar execution units in
1394current commodity processors primarily by providing
1395SIMD operations at each field width
1396$n=2^k$ for $0 \leq k \leq K$.  This involves
1397additional circuitry for field widths not supported
1398in existing processors.  In our evaluation model,
1399IDISA-A adds support for 2-bit, 4-bit and 128-bit
1400field widths in comparison with the RefA architecture,
1401while IDISA-B similarly extends RefB.
1402
1403When execution of the SWAR instruction is
1404completed, the result value is then provided
1405to the SRWBU to update the value stored in the
1406SRF at the address specified by the $A$-bit
1407destination operand.
1408
1409\subsection{Operand Fetch Unit Logic}
1410
1411The SOFU is responsible for implementing the half-operand
1412modification logic for each of up to two input operands fetched
1413from SRF.  For each operand, this logic is implemented
1414using the decoded half-operand modifiers signals $h$ and $l$,
1415the decoded field width signals $w_k$ and the 128-bit operand
1416value $r$ fetched from SRF to produce a modified 128-bit operand
1417value $s$ following the requirements of equations (4), (5) and
1418(6) above.  Those equations must be applied for each possible
1419modifier and each field width to determine the possible values $s[i]$
1420for each bit position $i$.  For example, consider bit
1421position 41, whose binary 7-bit address is $0101001$.
1422Considering the address bits left to right, each 1 bit
1423corresponds to a field width for which this bit lies in the
1424lower $n/2$ bits (widths 2, 16, 64), while each 0 bit corresponds to a field
1425width for which this bit lies in the high $n/2$ bits.
1426In response to the half-operand modifier signal $h$,
1427this bit may receive a value from the corresponding field
1428of width 2, 16 or 64 whose address bit is 0, namely $r[40]$,
1429$r[33]$ or $r[9]$.   Otherwise, this bit receives the value $r[41]$,
1430in the case of no half-operand modifier, or a low half-operand modifier
1431in conjunction with a field width signal $w_2$, $w_{16}$ or $w_{64}$.
1432The overall logic for determining this bit value is thus given as follows.
1433\begin{eqnarray*}
1434s[41] & = & h \wedge (w_2 \wedge r[40] \vee w_{16} \wedge r[33] \vee w_{64} \wedge r[9]) \\
1435& & \vee \neg h \wedge (\neg l \vee w_2 \vee w_{16} \vee w_{64}) \wedge r[41]
1436\end{eqnarray*}
1437
1438Similar logic is determined for each of the 128 bit positions.
1439For each of the 7 field widths, 64 bits are in the low $n/2$ bits,
1440resulting in 448 2-input and gates for the $w_k \wedge r[i]$ terms.
1441For 120 of the bit positions, or gates are needed to combine these
1442terms; $441 -120 = 321$ 2-input or gates are required.  Another
1443127 2-input and gates combine these values with the $h$ signal.
1444In the case of a low-half-operand modifier, the or-gates combining $w_k$
1445signals can share circuitry.  For each bit position $i=2^k+j$ one
1446additional or gate is required beyond that for position $j$.
1447Thus 127 2-input or gates are required.  Another 256 2-input and gates
1448are required for combination with the $\neg h$  and $r[i]$ terms.  The terms for
1449the low and high half-operand modifiers are then combined with an
1450additional 127 2-input or gates.   Thus, the circuity complexity
1451for the combinational logic implementation of half-operand
1452modifiers within the SOFU is 1279 2-input gates per operand,
1453or 2558 gates in total.
1454
1455The gate-level complexity of half-operand modifiers as described is nontrivial,
1456but modest.   However, one possible design tradeoff is to
1457differentiate the two operands, permitting a high half-operand
1458modifier to be used only with the first operand and a low-modifier
1459to be used only with the second operand.  This would exclude
1460\verb#<h,h># and \verb#<l,l># modifier combinations and also
1461certain combinations for noncommutative core operations. 
1462The principal
1463consequence for the applications considered here would be with
1464respect to the pack operations in forward transposition, but it
1465may be possible to address this through SIEU circuitry.
1466If this approach were to be taken, the gate complexity of
1467half-operand modification would be reduced by slightly more than half.
1468
1469\subsection{2-Bit and 4-Bit Field Widths}
1470
1471Beyond the half-operand modifiers, extension of core SWAR
1472operations to 2-bit and 4-bit field widths is critical to
1473inductive doubling support.  The principal operations
1474that need to be supported in this way are addition, pack, merge
1475merge, and rotate. 
1476
1477Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
1478as a modification to that for 8-bit fields by incorporating logic to
1479disable carry propagation at the 16 midfield boundaries.  For 2-bit
1480fields, disabling carry propagation at 32 additional boundaries suffices,
1481although it may be simpler to directly implement the simple logic
1482of 2-bit adders.
1483
1484Pack and merge require bit selection logic for each field width.
1485A straightforward implementation model for each operation
1486uses 128 2-input and gates to select the desired bits from the
1487operand registers and another 128 2-input or gates to feed
1488these results into the destination register.
1489
1490Rotation for 2-bit fields involves simple logic for selecting
1491between the 2 bits of each field of the operand being
1492rotated on the basis of the low bit of each field of the
1493rotation count.  Rotation for 4-bit fields is more complex,
1494but can also be based on 1-of-4 selection circuitry
1495involving the low 2 bits of the rotation count fields.
1496
1497\subsection{128-Bit Field Widths}
1498
1499For completeness, the IDISA model requires core operations
1500to be implemented at the full register width, as well
1501as power-of-2 partitions.  This may be problematic for
1502operations such as addition due to the inherent delays
1503in 128-bit carry propagation.  However, the primary
1504role of 128 bit operations in inductive doubling
1505is to combine two 64-bit fields using \verb#<h,l>#
1506operand combinations.  In view of this, it may be
1507reasonable to define hardware support for such
1508combinations to be based on 64-bit logic, with support
1509for 128-bit logic implemented through firmware or software.
1510
1511\subsection{Final Notes and Further Tradeoffs}
1512
1513In order to present IDISA as a concept for design extension of
1514any SWAR architecture, our discussion of gate-level implementation
1515is necessarily abstract.  Additional circuitry is sure to be
1516required, for example, in implementation of SIDU.  However,
1517in context of the 128-bit reference architectures studied,
1518our analysis suggests realistic IDISA implementation well
1519within a 10,000 gate budget.
1520
1521However, the additional circuitry required may be offset
1522by elimination of special-purpose instructions found in
1523existing processors that could instead be implemented through efficient
1524IDISA sequences.  These include examples such as population
1525count, count leading and/or trailing zeroes and parity.
1526They also include specialized horizontal SIMD operations.
1527Thus, design tradeoffs can be made with the potential of
1528reducing the chip area devoted to special purpose instructions
1529in favor of more general IDISA features.
1530
1531\section{Conclusions}
1532
1533In considering the architectural support for
1534SIMD text processing using the method of parallel bit streams,
1535this paper has presented the principle of inductive doubling
1536and a realization of that principle in the form of
1537IDISA, a modified SWAR instruction set architecture.
1538IDISA offers support for SWAR operations at all power-of-2
1539field widths, including 2-bit and 4-bit widths, in particular,
1540as well as half-operand modifiers and pack and merge operations
1541to support efficient transition between successive power-of-two
1542field widths.  The principle innovation is the notion of
1543half-operand modifiers that eliminate the cost associated
1544with the explicit mask and shift operations required for
1545such transitions.
1546
1547Several algorithms key to parallel bit stream methods
1548have been examined and shown to benefit from dramatic
1549reductions in instruction count compared to the best
1550known algorithms on reference architectures.  This
1551includes both a reference architecture modeled on
1552the SWAR capabilities of the SSE family as well as
1553an architecture incorporating the powerful permute
1554or shuffle capabilities found in Altivec or Cell BE processors.
1555In the case of transposition algorithms to and from parallel bit stream
1556form, the architecture has been shown to make possible
1557straightforward inductive doubling algorithms with the
1558lowest total number of operations that can be achieved by
1559any possible 3-register SWAR architecture.
1560
1561Applications of IDISA in other areas have also been
1562examined.  The support for 2-bit and 4-bit field widths
1563in SWAR processing is beneficial for packed DNA representations
1564and packed decimal representations respectively.  Additional
1565inductive doubling examples are presented and the phenomenon
1566of power-of-2 transitions discussed more broadly.
1567Most significantly, IDISA supports a fully general approach
1568to horizontal SWAR operations, offering a considerable
1569improvement over the {\em ad hoc} sets of special-purpose
1570horizontal operations found in existing SWAR instruction sets.
1571
1572An IDISA implementation model has been presented employing
1573a customized operand fetch unit to implement the half-operand
1574modifier logic.  Gate-level implementation of this unit
1575and operations at the 2-bit and 4-bit field widths have
1576been analyzed and found to be quite reasonable within
1577a 10,000 gate bound.  Design tradeoffs to reduce the cost
1578have also been discussed, possibly even leading to a
1579net complexity reduction through elimination of instructions
1580that implement special-case versions of inductive doubling.
1581
1582Future research may consider the extension of inductive doubling
1583support in further ways.   For example, it may be possible
1584to develop a pipelined architecture supporting two or three
1585steps of inductive doubling in a single operation.
1586
1587%\appendix
1588%\section{Appendix Title}
1589%
1590%This is the text of the appendix, if you need one.
1591
1592\acks
1593
1594This research was supported by a Discovery Grant from the
1595Natural Sciences and Engineering Research Council of Canada.
1596
1597%\bibliographystyle{plainnat}
1598\bibliographystyle{plain}
1599\bibliography{asplos094-cameron}
1600%\begin{thebibliography}{}
1601
1602%\bibitem{smith02}
1603%Smith, P. Q. reference text
1604
1605%\end{thebibliography}
1606
1607
1608\end{document}
Note: See TracBrowser for help on using the repository browser.