source: docs/ASPLOS09/asplos094-cameron.tex @ 242

Last change on this file since 242 was 242, checked in by cameron, 11 years ago

Fix text to match <h,l> order.

File size: 70.8 KB
Line 
1%-----------------------------------------------------------------------------
2%
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
5%
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
7%
8%-----------------------------------------------------------------------------
9\input epsf
10
11%\documentclass[preprint,natbib,10pt]{sigplanconf}
12%\documentclass[natbib,10pt]{sigplanconf}
13\documentclass[10pt]{sigplanconf}
14
15\usepackage{amsmath}
16\usepackage{graphicx}
17
18\begin{document}
19
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
21\CopyrightYear{2009}
22\copyrightdata{978-1-60558-215-3/09/03}
23
24
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
27
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
31
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}@cs.sfu.ca}
35
36
37\maketitle
38
39\begin{abstract}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
56\end{abstract}
57
58\category{CR-number}{subcategory}{third-level}
59
60\terms
61term1, term2
62
63\keywords
64keyword1, keyword2
65
66
67
68\section{Introduction}
69
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
81
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
93
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
115
116The report addressing the broader problem of XML parsing is perhaps more
117interesting, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 3X to 10X performance improvement is achieved in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser has functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
131
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.  In related work,
147efficient automatic interleaving based on power-of-2 strides has been
148reported quite useful for a number of SIMD kernels \cite{Nuzman}.
149The specific features presented herein will be referred to
150as IDISA: inductive doubling instruction set architecture.
151
152The remainder of this paper is organized as follows.
153
154The second section of this paper introduces IDISA and the
155SIMD notation used throughout this paper.  A brief comparison of
156IDISA features with existing SIMD instruction
157sets of commodity processors such as the Intel SSE
158instruction set and the Power PC Altivec instruction set
159is also made.
160
161The third section then discusses the evaluation methodology
162for IDISA.  Two reference architectures are briefly described as
163well as criteria for evaluation IDISA against these architectures.
164
165The fourth section provides a short first example of
166the inductive doubling principle in action through
167the case of population count.   Although this operation
168is not a strong determinant of performance for parallel bit
169stream applications, it is nevertheless an operation needed
170frequently enough in the general computing milieux to find
171its way into some instruction set architectures, typically
172at one particular field width. 
173%By way of comparison, the
174%inductive doubling architecture sacrifices some
175%performance at the chosen field width, while offering a more
176%general solution with frequently better performance at
177%other field widths.
178
179The fifth section then moves on to consider the performance-critical
180and key task of conversion between serial byte streams and parallel
181bit streams.  A first algorithm that uses the existing SIMD
182operations common to SSE and Altivec is shown, requiring 72
183operations to transform 128 bytes of data using the three-register
184instruction form.  We then move on to consider how the task may
185be simplified using IDISA to
186require a mere 24 operations.  As well as providing a 3X speed-up,
187it is also argued that the version using the inductive doubling
188architecture is considerably simpler and easier to program.
189
190The sixth section then briefly considers the inverse transposition
191process, converting parallel bit stream data back into byte
192streams.  Again, an algorithm to carry out this task requires
19372 straight-line SIMD operations in the Altivec three-register
194form, but is reduced to a simpler 24 operations using IDISA.
195
196The seventh section then goes on to consider the problem of
197parallel bit deletion.  This operation is performance-critical
198to any applications that require filtering or
199editing operations on strings using the parallel bit stream
200algorithms.   For example, it is fundamental to the
201high-speed UTF-8 to UTF-16 transcoding algorithm that is
202often a critical component in XML parsing.  In this
203section, an inductive doubling algorithm based on the
204concept of a central deletion result is described and
205shown to have much better performance than a parallel-prefix
206alternative.
207
208The eighth section then considers the potential role of
209IDISA in supporting applications beyond parallel bit streams.
210Additional examples from several domains are presented.
211Perhaps most importantly, however, the role of IDISA
212in supporting a fully systematic treatment of {\em horizontal}
213SWAR operations for any application area is discussed.
214
215An implementation model for IDISA is then considered
216in section 9 of the paper, focusing on a pipelined
217SIMD architecture featuring a modified operand fetch stage.
218A gate-count analysis of one feasible implementation is
219provided as well as a discussion of the implementation
220of required extensions to handle 2-bit and 4-bit fields not
221commonly supported on existing SIMD architectures.   Design
222tradeoffs are also considered focusing the potential removal of
223various {\em ad hoc} instructions on existing processors in favor of
224more general alternatives provided through IDISA.
225
226The tenth section concludes the paper with a summary of results
227and discussion of areas for future work.
228
229
230\section{Inductive Doubling Architecture}
231
232This section presents an idealized model for a single-instruction
233multiple-data (SIMD) instruction set architecture designed
234specifically to support inductive doubling algorithms in the
235domain of parallel bit stream programming.   The architecture is idealized
236in the sense that we concentrate on only the necessary features
237for our purpose, without enumerating the additional
238operations that would be required for
239SIMD applications in other domains.  The goal is to focus
240on the principles of inductive doubling support in a way
241that can accommodate a variety of realizations as other
242design constraints are brought to bear on the overall instruction set
243design.  First we introduce a simple model and notation for
244SIMD operations in general and then present the four key
245features of an idealized architecture in support of parallel
246bit stream programming.
247
248The idealized architecture supports typical SIMD integer
249operations common to existing commodity architectures such as SSE
250and Altivec.   The architecture is defined to support SIMD
251operations on registers of size $N=2^K$ bits, for some integer $K$.
252Typical values of $K$ for commodity processors include $K=6$ for
253the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
254the 128-bit registers of SSE and Altivec technology and $K=8$ for
255the upcoming Intel AVX technology.   The registers may be
256partitioned into $N/n$ fields of width $n=2^k$ bits for some values
257of $k \leq K$.   Typical values of $k$ used on commodity processors
258include $k = 3$ for SIMD operations on 8-bit fields (bytes),
259$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
260on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
261fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
262Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
263register $r$, using big-endian numbering.
264
265Let \verb:simd<n>: represent the class of SIMD operations defined
266on fields of size $n$ using C++ template syntax.  Given a
267binary function $F_n$ on $n$-bit fields, we denote the SIMD
268version of this operation as \verb#simd<n>::F#.  Given two
269SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
270respectively, the operation \verb#r=simd<n>::F(a,b)# stores
271the value $r$ in the register \verb:r: as determined by
272the simultaneous calculation of individual field values in
273accord with the following equation.
274\[ r_i = F_n(a_i, b_i) \]
275
276For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
277logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
278integers as follows.
279%\singlespace
280\begin{eqnarray}
281\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
282\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
283\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
284\end{eqnarray}
285%\doublespace
286The SSE and Altivec instruction sets support
287each of the addition and subtraction operations in SIMD form
288for 8, 16 and 32-bit fields, while the SSE set also includes
289the 64-bit forms.  For example, \verb#simd<8>::add# in our
290nomenclature is provided by the operation \verb:paddb:
291on SSE and the operation \verb:vaddubm: on Altivec.
292The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
293\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
294more constrained, requiring that all field shifts by the same amount.
295
296Given these definitions and notation, we now present
297the four key elements of an inductive doubling architecture.
298The first is a definition of a core set of binary functions
299on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
300The second is a set of {\em half-operand modifiers} that allow
301the inductive processing of fields of size $2n$ in terms of
302combinations of $n$-bit values selected from the fields.
303The third is the definition of packing operations that compress
304two consecutive registers of $n$-bit values into a single
305register of $n/2$-bit values.  The fourth is the definition
306of merging operations that produce a set of $2n$ bit fields
307by concatenating corresponding $n$-bit fields from two
308parallel registers.  Each of these features is described below.
309
310For the purpose of direct and efficient support for inductive
311doubling algorithms on bit streams, the provision of
312a core set of operations at field widths of 2 and 4 as
313well as the more traditional field witdhs of 8, 16 and 32
314is critical for elegant and efficient expression of the
315algorithms.  In essence, inductive doubling algorithms
316work by establishing some base property at either single
317or 2-bit fields.  Each iteration of the algorithm then
318goes on to establish the property for the power-of-2
319field width.  In order for this inductive step to be
320most conveniently and efficiently expressed, the
321core operations needed for the step should be available
322at each field width.  In the case of work with parallel
323bit streams, the operations \verb:add:, \verb:sub:,
324\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
325(rotate left) comprise the core.  In other domains,
326additional operations may be usefully included in the
327core depending on the work that needs to be performed
328at each inductive doubling level.
329
330Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
331also includes fields of width 1.  These are included for
332logical consistency, but are easily implemented by mapping
333directly to appropriate bitwise logic operations, which
334we assume are also available.  For example,
335\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
336bitwise exclusive-or operation on SIMD registers.
337
338The second key facility of the inductive doubling architecture
339is the potential application of half-operand modifiers to
340the fields of either or both of the operands of a SIMD
341operation.  These modifiers select either the
342low $n/2$
343bits of each $n$-bit field (modifier ``\verb:l:'') or the
344high $n/2$ bits (modifier ``\verb:h:'').  When required,
345the modifier ``\verb:x:'' means that the full $n$ bits
346should be used, unmodified.  The semantics of these
347modifiers are given by the following equations.
348%\singlespace
349\begin{eqnarray}
350l(r_n) & = & r_n \bmod 2^{n/2} \\
351h(r_n) & = & r_n / 2^{n/2} \\
352x(r_n) & = & r_n
353\end{eqnarray}
354%\doublespace
355In our notation, the half-operand modifiers are
356specified as optional template (compile-time) parameters
357for each of the binary functions.  Thus,
358\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
359the 2-bit quantity found in the high 2-bits of each 4-bit field
360of its first operand (\verb:a:)
361together with the corresponding 2-bit quantity found in the
362low 2-bits of its second operand (\verb:b:).
363In general, the purpose of the half-operand modifiers
364in support of inductive doubling is to allow the processing
365of $n$-bit fields to easily expressed in terms of
366combination of the results determined by processing
367$n/2$ bit fields.
368
369The third facility of the inductive doubling architecture
370is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
371The field values of \verb#r=simd<n>::pack(a,b)# are
372defined by the following equations.
373%\singlespace
374\begin{eqnarray}
375r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
376r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
377\end{eqnarray}
378%\doublespace
379Here conv is a function which performs conversion of an $n$-bit
380value to an $n/2$ bit value by signed saturation (although
381conversion by unsigned saturation would also suit our purpose).
382
383Half-operand modifiers may also be used with the pack
384operations.  Thus packing with conversion by masking off all
385but the low $n/2$ bits of each field may be
386be performed using the operation \verb#simd<n>::pack<l,l>#
387
388
389The final facility of the inductive doubling architecture is
390a set of merging operations \verb#simd<n>::mergeh# and
391\verb#simd<n>::mergel# that produce $2n$-bit fields
392by concatenating corresponding $n$-bit fields from the
393operand registers.   The respective
394operations \verb#r=simd<n>::mergeh(a,b)# and
395\verb#s=simd<n>::mergel(a,b)#
396are defined by the following equations.
397%\singlespace
398\begin{eqnarray}
399r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
400s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
401\end{eqnarray}
402%\doublespace
403Both SSE and Altivec provide versions of pack and merge operations
404for certain field widths.  The pack operations are provided
405with operands having 16-bit or 32-bit fields on each platform, although
406with some variation in how conversion is carried out.
407The merge operations are provided at 8-bit, 16-bit and 32-bit
408field widths on both architectures and also at the 64-bit level
409on SSE.
410
411This completes the description of IDISA.  As described, many of the
412features are already available with the SIMD facilities of
413existing commodity processors.  The extensions enumerated
414here are relatively straightforward.  The innovation
415is to specifically tackle the design of facilities to
416offer systematic support for transitions between power-of-2 field widths.
417As we shall show in the remainder of this paper, these facilities
418can dramatically reduce instruction count in core parallel bit
419stream algorithms, with a factor of 3 reduction being typical.
420
421\section{Evaluation Methodology}
422
423IDISA represents a set of instruction set features that
424could potentially be added to any SWAR processor.  The goal
425in this paper is to evaluate these features independent
426of artifacts that may be due to any particular realization,
427while still considering realistic models based on existing
428commodity instruction set architectures.  For the purpose
429of IDISA evaluation, then, we define two reference architectures.
430For concreteness, IDISA and the two reference architectures will
431each be considered as 128-bit processors employing the
432three-register SWAR model defined in the previous
433section.
434
435Reference architecture A (RefA) consists of a limited register
436processor providing a set of core binary operations defined
437for 8, 16, 32 and 64 bit fields.  The core binary operations
438will be assumed to be those defined by the SSE instruction
439set for 16-bit fields.  In addition, we assume that
440shift immediate operations for each field width exist,
441e.g., \verb#simd<8>::srli<1>(x)# for a right logical
442shift of each 8-bit field by 1.  We also assume that
443a constant load operation \verb#simd::const<n>(c)#
444loads the constant value $c$ into each $n$ bit field.
445
446Reference architecture B (RefB) consists of a register-rich
447processor incorporating all the operations of reference
448architecture A as well as the following additional facilities
449inspired by the Altivec instruction set.
450For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
451logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
452of $a$ by the rotation count in the corresponding field of $b$.
453A three-register \verb#simd<1>::if(a,b,c)# bitwise logical
454operator implements the logic $a \wedge b \vee \neg a \wedge c$, patterned
455after the Altivec \verb:vec_sel: operation.  Finally,
456a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
457permutation of bytes from the concatenation of $a$ and $b$ based
458on the set of indices in $c$.
459
460Two versions of IDISA are assessed against these reference
461architectures as follows.  IDISA-A has all the facilities
462of RefA extended with half-operand modifiers and all core
463operations at field widths of 2, 4 and 128.  IDISA-B is
464similarly defined and extended based on RefB.  Algorithms
465for both RefA and IDISA-A are assessed assuming that
466any required constants must be loaded as needed; this
467reflects the limited register assumption.  On the other,
468assessment for both RefB and IDISA-B will make the
469assumption that sufficiently many registers exist that
470constants can be kept preloaded.
471
472In each case, the processors are assumed to be
473pipelined processors with a throughput of one SWAR instruction
474each processor cycle for straight-line code free of memory
475access.  Such processors are certainly feasible with a
476suitable investment in circuitry, although practical designs
477may make sacrifices to achieve close to optimal throughput
478with less circuitry.  However, the assumption makes for
479straightforward performance evaluation based on instruction
480count for straight-line computational kernels.  Furthermore,
481the assumption also eliminates artifacts due to possibly different
482latencies in reference and IDISA architectures.  Because
483the same assumption is made for reference and IDISA
484architectures, determination of the additional circuit
485complexity due to IDISA features is unaffected by the
486assumption.
487
488In the remainder of this paper, then, IDISA-A and IDISA-B
489models are evaluated against their respective reference
490architectures on straight-line computational kernels
491used in parallel bit stream processing and other applications.
492As XML and other sequential text processing applications
493tend to use memory in an efficient streaming model, the
494applications tend to be compute-bound rather than IO-bound.
495Thus, the focus on computational kernels addresses the
496primary concern for performance improvement of these applications.
497
498The additional circuity complexity to realize IDISA-A and
499IDISA-B designs over their reference models will be
500addressed in the penultimate section.  That discussion
501will focus primarily on the complexity of implementing
502half-operand modifier logic, but will also address
503the extension of the core operations to operate on
5042-bit, 4-bit and 128-bit fields, as well.
505
506\section{Population Count}
507
508\begin{figure}[h]
509\begin{center}\small
510\begin{verbatim}
511c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
512c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
513c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
514c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
515c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
516\end{verbatim}
517\end{center}
518\caption{Population Count Reference Algorithm}
519\label{HD-pop}
520\end{figure}
521
522\begin{figure}
523\begin{center}\small
524\begin{verbatim}
525c = simd<2>::add<h,l>(x, x);
526c = simd<4>::add<h,l>(c, c);
527c = simd<8>::add<h,l>(c, c);
528c = simd<16>::add<h,l>(c, c);
529c = simd<32>::add<h,l>(c, c);
530\end{verbatim}
531\end{center}
532\caption{IDISA Population Count}
533\label{inductivepopcount}
534\end{figure}
535
536As an initial example to illustrate the principle of inductive doubling
537in practice, consider the problem of {\em population count}: determining
538the number of one bits within a particular bit field.  It is important
539enough for such operations as calculating Hamming distance to be included
540as a built-in instruction
541on some processors.  For example, the SPU of the Cell Broadband Engine
542has a SIMD population count instruction \verb:si_cntb: for simultaneously
543determining the
544number of 1 bits within each byte of a 16-byte register.
545In text processing with parallel bit streams, population count has direct
546application to keeping track of line numbers for error reporting, for example.
547Given a bit block identifying the positions of newline characters within
548a block of characters being processed, the population count of the
549bit block can be used to efficiently and conveniently be used to update
550the line number upon completion of block processing.
551
552Figure \ref{HD-pop} presents a traditional divide-and-conquer
553implementation for a 32-bit integer {\tt x} adapted from
554Warren \cite{HackersDelight}, while
555Figure \ref{inductivepopcount} shows the corresponding IDISA
556implementation for a vector of 32-bit fields.  Each implementation employs
557five steps of inductive doubling to produce population counts
558within 32 bit fields.  The traditional implementation employs
559explicit masking and shifting operations, while these
560operations are implicit within the semantics of the inductive
561doubling instructions shown in Figure \ref{inductivepopcount}.
562In each implementation, the first step determines the
563the population counts within 2-bit fields
564by adding the high bit of each such field to the low bit
565to produce a set of 2-bit counts in {\tt c}.
566In the second step, the counts within 4-bit fields of {\tt c} are determined
567by adding the counts of the corresponding high and low 2-bit subfields.
568Continuing in this fashion,
569the final population counts within 32-bit fields are determined in five steps.
570
571With the substitution of longer mask constants replicated for four
57232-bit fields, the implementation of Figure \ref{HD-pop} can be
573directly adapted to SWAR processing using 128-bit registers.
574Each binary operator is replaced by a corresponding binary
575SWAR operation.   Without the IDISA features, a
576straightforward RefA implementation of population count for
57732-bit fields thus employs 10 operations to load or generate
578mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
579total of 30 operations to complete the task.   Employing
580optimization identified by Warren, this can be reduced to
58120 operations, 5 of which are required to generate mask constants.
582At the cost of register pressure, it is possible that these constants
583could be kept preloaded in long vector processing.  In accord
584with our evaluation model, the RefB cost is thus 15 operations.
585As the IDISA implementation requires no constants at all,
586both the IDISA-A and IDISA-B cost is 5 operations.
587At our assumed one CPU cycle per instruction model, IDISA-A
588offers a 4X improvement over RefA, while IDISA-B offers a 3X
589improvement over its comparator.
590
591The pattern illustrated by population count is typical.
592An inductive doubling algorithm of $n$ steps typically applies
593mask or shift operations at each step for each of the
594two operands being combined in the step.  If there is
595one such operation for each operand, and the combination
596can be implemented in a single SWAR operation, then a total
597of $3n$ operations are the required in a RefB implementation,
598and possibly $4n$ for a RefA implementation including the
599cost of loading masks.  IDISA-A and IDISA-B implementations
600typically eliminate the explicit mask and shift operations
601through appropriate half-operand modifiers, reducing the
602total instruction count to $n$.   Thus a 3X to 4X improvement
603obtains in these cases.
604
605\section{Transposition to Parallel Bit Streams}
606
607In this section, we consider the first major
608application of IDISA: transposition of byte stream data to parallel bit stream
609form.   Of course, this operation is critical to the
610method of parallel bit streams and all applications
611of the method can benefit from a highly efficient
612transposition process.  Before considering how
613the IDISA supports this
614transposition process, however, we first consider
615algorithms on existing architectures.  Two algorithms
616are presented; the best of these requires 72
617SWAR operations under the RefB model to perform
618transposition of eight serial registers of byte stream data into
619eight parallel registers of bit stream data.
620
621We then go on to show how the transposition problem
622can be solved using IDISA-A or IDISA-B
623with a mere 24 three-register SIMD operations.  We also show
624that this is optimal for any three-register instruction set model.
625
626\begin{figure}[tbh]
627\begin{center}
628\includegraphics[width=90mm, trim= 50 50 0 50]{S2P_IO.pdf}
629\caption{Input/Output Model for Serial to Parallel Transposition}
630\label{s2p-spec}
631\end{center}
632
633\end{figure}
634Figure \ref{s2p-spec} illustrates the input-output requirements of
635the transposition problem.  We assume that inputs and
636outputs are each SWAR registers of size $N=2^K$ bits.
637The input consists of $N$ bytes of serial byte data,
638stored consecutively in eight SIMD registers each holding
639$N/8$ bytes.  The output consists of eight parallel
640registers, one each for the eight individual bit positions
641within a byte.  Upon completion of the transposition process,
642each output register is to hold the $N$ bits corresponding
643to the selected bit position in the sequence of $N$ input
644bytes.
645
646\subsection{Bit Gathering Algorithm}
647
648\begin{figure}[tbh]
649\begin{center}
650\includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
651\caption{Serial to Parallel Transposition Using Bit-Gathering}
652\label{gather}
653\end{center}
654\end{figure}
655One straightforward algorithm for implementing the transposition process
656takes advantage of SWAR bit gathering operations that exist
657on some architectures.  This operation gathers one bit per byte
658from a particular position within each byte of a SIMD register.
659For example, the {\tt pmovmskb} operation of the Intel MMX and
660SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
661consisting of the high bit of each byte.  Similarly, the
662{\tt \verb:si_gbb:} operation of the synergistic processing units of the
663Cell Broadband Engine gathers together the low bit of each byte
664from the SIMD register.  Figure \ref{gather} illustrates the
665bit gathering process.
666
667For each bit stream of output, the bit gather algorithm requires
668one gather operation for each of the 8 input registers,
669giving 64 bit gather operations in all.  In addition, for seven
670of the eight bit positions, it is necessary to shift the bits
671of each input register into the conventional position for
672gathering.  A total of 56 shift operations are required for this
673task.  Finally, the result of each bit gather operation must
674be properly inserted into the output stream.  If the first
675gather operation for a stream can be used to also initialize
676the output register, there will then need to be 7 insert
677operations for the results of the remaining gather operations
678for this stream, making 56 insert operations in all.
679The insert step may be more complex than a single operation
680in some cases, but we use one operation per insert as a lower bound.
681Thus, the bit gather algorithm requires
682at least 176 operations to perform the transposition task.
683
684\subsection{BytePack Algorithm}
685
686A much more efficient transposition algorithm on commodity
687SWAR architectures involves three
688stages of binary division transformation.  This is similar
689to the three stage bit matrix inversion described by
690Warren  \cite{HackersDelight}, although modified to use SWAR operations.
691In each stage, input streams are divided into two half-length output streams.
692The first stage separates the bits at even numbered positions from those
693at odd number positions.  The two output streams from the first
694stage are then further divided in the second stage.
695The stream comprising even numbered bits from the original byte stream
696divides into one stream consisting of bits from positions 0 and 4 of each
697byte in the original stream and a second stream consisting of bits
698from positions 2 and 6 of each original byte.  The stream of bits from
699odd positions is similarly divided into streams for bits from each of the
700positions 1 and 5 and bits from positions 2 and 6.
701Finally, each of the four streams resulting from the second stage are
702divided into the desired individual bit streams in the third stage.
703
704% \begin{figure}[tbh]
705% \begin{center}\small
706% \begin{verbatim}
707% s0h = simd<16>::srli<8>(s0);
708% s0l = simd_and(s0, simd<16>::const(0x00FF));
709% s1h = simd<16>::srli<8>(s1);
710% s1l = simd_and(s1, simd<16>::const(0x00FF));
711% t0 = simd<16>::pack(s0h, s1h);
712% t1 = simd<16>::pack(s0l, s1l);
713% t0_l1 = simd<16>::slli<1>(t0);
714% t0_r1 = simd<16>::srli<1>(t1);
715% mask = simd<8>::const(0xAA);
716% p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask));
717% p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask));
718% \end{verbatim}
719% \end{center}
720% \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
721% \label{s2pstep}
722% \end{figure}
723%
724
725\begin{figure}[tbh]
726\begin{center}\small
727\begin{verbatim}
728even = {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30};
729odd = {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
730mask = simd<8>::const(0xAA);
731t0 = simd<8>::permute(s0, s1, even);
732t1 = simd<8>::permute(s0, s1, odd);
733p0 = simd_if(mask, t0, simd<16>::srli<1>(t1));
734p1 = simd_if(mask, simd<16>::slli<1>(t0), t1);
735\end{verbatim}
736\end{center}
737\caption{RefB Transposition Step in BytePack Stage 1}
738\label{s2pstep}
739\end{figure}
740
741The binary division transformations are accomplished in each stage
742using byte packing, shifting and masking.  In each stage, a
743transposition step combines each pair of serial input registers to
744produce a pair of parallel output registers. 
745Figure \ref{s2pstep} shows a stage 1 transposition step in a
746Ref B implementation.  Using the permute facility, the even
747and odd bytes, respectively, from two serial input registers
748\verb:s0: and \verb:s1: are packed into temporary registers
749\verb:t0: and \verb:t1:.  The even and odd bits are then
750separated into two parallel output registers \verb:p0: and \verb:p1:
751by selecting alternating bits using a mask.  This step is applied
752four times in stage 1; stages 2 and 3 also consist of four applications
753of a similar step with different shift and mask constants.
754Overall, 6 operations per step are required, yielding a total
755of 72 operations to transpose 128 bytes to parallel bit stream
756form in the RefB implementation.
757
758In a RefA implementation, byte packing may also be achieved
759by the \verb#simd<16>::pack# with 4 additional operations to
760prepare operands.  Essentially, the RefB implementation
761uses single permuite instructions to implement the equivalent of
762\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
763The RefA implementation also requires 3 logic operations to implement
764each \verb#simd_if#.
765Assuming that mask loads are only need once per 128 bytes,
766a total of 148 operations are required in the RefB implementation.
767
768\subsection{Inductive Halving Algorithm}
769
770Using IDISA, it is possible to design
771a transposition algorithm that is both easier to understand and requires
772many fewer operations than the the bytepack algorithm described above.
773We call it the inductive halving algorithm for serial to parallel
774transposition, because it proceeds by reducing byte streams to
775two sets of nybble streams in a first stage, dividing the nybble
776streams into streams of bitpairs in a second stage and finally
777dividing the bitpair streams into bit streams in the third stage.
778
779\begin{figure}[tbh]
780\small
781\begin{verbatim}
782p0 = simd<8>::pack<h,h>(s0, s1);
783p1 = simd<8>::pack<l,l>(s0, s1);
784\end{verbatim}
785\caption{Stage 1 Transposition Step in the Inductive Halving Algorithm}
786\label{halvingstep}
787\end{figure}
788
789Figure \ref{halvingstep} shows one step in stage 1 of the inductive
790halving algorithm, comprising just two IDISA-A operations.
791The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
792from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
793the low nybble of each byte.  As in the bytepack algorithm, this step is
794applied 4 times in stage 1, for a total of 8 operations.
795
796Stage 2 of the inductive halving algorithm reduces nybble streams
797to streams of bit pairs.  The basic step in this algorithm consists
798of one \verb#simd<4>::pack<h,h># operation to extract the high pair
799of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
800low pair of each nybble.  Four applications of this step complete stage 2.
801
802Stage 3 similarly uses four applications of a step that uses a
803\verb#simd<2>::pack<h,h># operation to extract the high bit of
804each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
805each pair.  The complete algorithm to transform eight serial
806byte registers s0 through s7 into the eight parallel bit stream
807registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
808Under either IDISA-A or IDISA-B models, a mere 24 operations per 128
809input bytes is required.
810
811\begin{figure}[tbh]
812\small
813\begin{verbatim}
814hnybble0 = simd<8>::pack<h,h>(s0, s1);
815lnybble0 = simd<8>::pack<l,l>(s0, s1);
816hnybble1 = simd<8>::pack<h,h>(s2, s3);
817lnybble1 = simd<8>::pack<l,l>(s2, s3);
818hnybble2 = simd<8>::pack<h,h>(s4, s5);
819lnybble2 = simd<8>::pack<l,l>(s4, s5);
820hnybble3 = simd<8>::pack<h,h>(s6, s7);
821lnybble3 = simd<8>::pack<l,l>(s6, s7);
822hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
823hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
824lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
825ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
826hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
827hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
828lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
829ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
830bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
831bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
832bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
833bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
834bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
835bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
836bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
837bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
838\end{verbatim}
839\caption{Complete Inductive Halving Algorithm}
840\label{halvingalgorithm}
841\end{figure}
842
843\subsection{Optimality of the Inductive Halving Algorithm}
844
845Here we show that the inductive halving algorithm presented in
846the previous subsection is optimal in the following sense:
847no other algorithm on any 3-register SIMD architecture can use
848fewer than 24 operations to transform eight serial registers
849of byte stream data into eight parallel registers of bit stream data.
850By 3-register SIMD architecture, we refer to any architecture
851that uses SIMD instructions consistent with our overall model of
852binary operations using two input register operands to produce
853one output register value.
854
855Observe that the $N$ data bits from each input register must be
856distributed $N/8$ each to the 8 output registers by virtue of
857the problem definition.  Each output register can effectively
858be given a 3-bit address; the partitioning problem can be viewed
859as moving data to the correct address.  However, each
860operation can move results into at most one register. 
861At most this can result in the assignment of one correct address
862bit for each of the $N$ input bits.  As all $8N$ input bits
863need to be moved to a register with a correct 3-bit address,
864a minimum of 24 operations is required.
865
866\subsection{End-to-End Significance}
867
868In a study of several XML technologies applied to
869the problem of GML to SVG transformation, the parabix
870implementation (parallel bit streams for XML) was
871found to the fastest with a cost of approximately
87215 CPU cycles per input byte \cite{Herdy}.  Within parabix,
873transposition to parallel bit stream form requires
874approximately 1.1 cycles per byte \cite{CASCON08}.
875All other things being equal, a 3X speed-up of transposition
876alone would improve end-to-end performance in a
877complete XML processing application by more than 4\%.
878
879
880\section{Parallel to Serial Conversion}
881
882Parallel bit stream applications may apply string editing
883operations in bit space to substitute, delete or insert
884parallel sets of bits at particular positions.  In such cases,
885the inverse transform that converts a set of parallel bit
886streams back into byte space is needed.  In the example of
887UTF-8 to UTF-16 transcoding, the inverse transform is
888actually used twice for each application of the forward
889transform, to separately compute the high and low byte
890streams of each UTF-16 code unit.  Those two byte streams
891are subsequentely merged to form the final result.
892
893Algorithms for performing the inverse transform mirror those
894of the forward transform, employing SIMD merge operations
895in place of pack operations.  The best algorithm known
896to us on the commodity SIMD architectures takes advantage
897of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
898operations that are available with each of the SSE and Altivec instruction
899sets.  These algorithms take 72 operations to perform the
900inverse transposition of 8 parallel registers of bit stream
901data into 8 serial registers of byte stream data. 
902
903\begin{figure}[tbh]
904\begin{center}\small
905\begin{verbatim}
906bit01_r0 = simd<1>::mergeh(bit0, bit1);
907bit01_r1 = simd<1>::mergel(bit0, bit1);
908bit23_r0 = simd<1>::mergeh(bit2, bit3);
909bit23_r1 = simd<1>::mergel(bit2, bit3);
910bit45_r0 = simd<1>::mergeh(bit4, bit5);
911bit45_r1 = simd<1>::mergel(bit4, bit5);
912bit67_r0 = simd<1>::mergeh(bit6, bit7);
913bit67_r1 = simd<1>::mergel(bit6, bit7);
914bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
915bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
916bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
917bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
918bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
919bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
920bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
921bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
922s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
923s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
924s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
925s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
926s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
927s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
928s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
929s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
930\end{verbatim}
931\end{center}
932\label{p2s-inductive}
933\caption{Parallel to Serial Transposition by Inductive Doubling}
934\end{figure}
935
936An algorithm employing only 24 operations using the
937inductive doubling instruction set architecture is relatively
938straightforward.. In stage 1, parallel registers for individual bit streams
939are first merged with bit-level interleaving
940using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
941operations.  For each of the four pairs of consecutive
942even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
943two consecutive registers of bitpair data are produced.
944In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
945are then applied to merge to bitpair streams to produce streams
946of nybbles for the high and low nybble of each byte.  Finally,
947the nybble streams are merged in stage 3 to produce the
948desired byte stream data.  The full inductive doubling
949algorithm for parallel to serial transposition is shown in Figure
950\ref{p2s-inductive}.
951
952This algorithm is again optimal, requiring the fewest operations
953of any possible algorithm using any 3-register instruction set
954model.  The proof directly follows that for the serial to parallel
955transposition.
956
957The existence of high-performance algorithms for transformation of
958character data between byte stream and parallel bit stream form
959in both directions makes it possible to consider applying these
960transformations multiple times during text processing applications.
961Just as the time domain and frequency domain each have their
962use in signal processing applications, the byte stream form and
963parallel bit stream form can then each be used at will in
964character stream applications.
965
966
967
968
969
970\section{Parallel Bit Deletion}
971
972\begin{figure*}[tbh]
973\begin{center}
974\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
975\hline
976\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
977\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
978\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
979\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
980\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
981\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
982\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
983\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
984\end{tabular}
985\end{center}
986\label{centraldelstep}
987\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
988\end{figure*}
989
990Parallel bit deletion is an important operation that allows string
991editing operations to be carried out while in parallel bit stream
992form.  It is also fundamental to UTF-8 to UTF-16 transcoding
993using parallel bit streams, allowing the excess code unit
994positions for UTF-8 two-, three- and four-byte sequences to
995be deleted once the sixteen parallel bit streams of UTF-16 have
996been computed \cite{PPoPP08}.
997
998Parallel bit deletion is specified using a deletion mask.
999A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
1000to be deleted and 0s at positions identifying bits to be retained.
1001For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
1002bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
1003accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
1004
1005Bit deletion may be performed using
1006the parallel-prefix compress algorithm documented by
1007Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
1008only logic and shifts with a constant parameter to carry
1009out the deletion process.  However, it requires $k^2$
1010preprocessing steps for a final field width parameter
1011of size $2^k$, as well as 4 operations per deletion step
1012per stream.  Using the inductive doubling instruction set architecture
1013it is possible to carry out bit deletion much more efficiently.
1014
1015Deletion within fixed size fields or registers may produce results that are either
1016left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
1017eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
1018care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
1019left-justified result produces an important intermediate form known as a
1020{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
1021right-justified and left-justified results from the application of the
10224-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
1023stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
1024the central result \verb:xxbdefhx:, which may easily be converted to a either a
1025left or a right justified 8-element result by an appropriate shift operation.
1026
1027
1028The observation about how two $n$-bit central deletion results can
1029combine to yield a $2n$ central deletion result provides the basis
1030for an inductive doubling algorithm.  Figure \ref{centraldelstep}
1031illustrates the inductive process for the transition from 8-bit central
1032deletion results to 16-bit central deletion results.  The top row shows
1033the original deletion mask, while the second row shows the original
1034bit stream to which deletions are to be applied, with deleted bits zeroed out.
1035The third row shows the central result for each 8-bit field as the
1036result of the previous inductive step.
1037
1038To perform the $8 \rightarrow 16$ central deletion step, we first form
1039the population counts of 4-bit fields of the original deletion mask as
1040shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
1041an 8-bit central result, we perform a right shift by the population count
1042of the low half of the field.  Similarly,
1043left-justification requires a left-shift by the population count in the
1044high half of the field.
1045
1046The left and right shifts can be performed simultaneously using a rotate
1047left instruction.  Right justification by shifting an $n$ bit field
1048$i$ positions to the right is equivalent to a left rotate of $n-i$
1049positions.  These rotation amounts are computed by the operation
1050\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
1051except that don't care fields (which won't be subsequently used)
1052are marked \verb:XX:. 
1053
1054The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
1055as shown in row 6, and are combined with the right shift amounts
1056by the selection operation  \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
1057as shown in row 7.  Using these computed values, the inductive step
1058is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
1059as shown in row 8.
1060
1061At each inductive doubling level, it requires 4 operations to compute the
1062required deletion infomation and one operation per bit stream to perform deletion.
1063Note that, if deletion is to be applied to a set of eight parallel bit streams,
1064the computed deletion information is used for each stream without recomputation,
1065thus requiring 12 operations per inductive level.
1066
1067In comparison to the parallel-prefix compress method, the method of central
1068deletion results using the inductive doubling architecture has far fewer operations.
1069The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
1070induction versus $4k^2$ for the parallel-prefix method.  Using the computed
1071deletion operation, only a single SIMD rotate operation per bit stream
1072per level is needed, in comparison with 4 operations per level for parallel-prefix
1073compress.
1074
1075
1076
1077\section{Beyond Parallel Bit Streams}
1078
1079It can be argued that text processing with parallel bit
1080streams by itself is not sufficiently important to justify
1081the circuit complexity of IDISA.  In this section, then
1082we move on to consider further applications that may
1083benefit from IDISA features.  These include additional
1084basic inductive doubling
1085algorithms such as parity, and least power-of-2 ceiling.
1086Further applications for narrow field widths are discussed
1087as well, such as packed DNA representations using 2-bit
1088field widths and packed decimal representations using 4-bit
1089field widths. Most significantly, however, we show that IDISA offers
1090a fully general solution to the problem of supporting
1091{\em horizontal} SWAR operations.
1092
1093\subsection{Inductive Doubling with Logic Operations}
1094
1095\begin{figure}
1096\begin{center}\small
1097\begin{verbatim}
1098y = x ^ (x >> 1);
1099y = y ^ (y >> 2);
1100y = y ^ (y >> 4);
1101y = y ^ (y >> 8);
1102y = y ^ (y >>16);
1103y = y & 1;
1104\end{verbatim}
1105\end{center}
1106\caption{Parity Reference Algorithm}
1107\label{HD-parity}
1108\end{figure}
1109
1110\begin{figure}
1111\begin{center}\small
1112\begin{verbatim}
1113x = x - 1;
1114x = x | (x >> 1);
1115x = x | (x >> 2);
1116x = x | (x >> 4);
1117x = x | (x >> 8);
1118x = x | (x >>16);
1119x = x + 1;
1120\end{verbatim}
1121\end{center}
1122\caption{Power-of-2 Ceiling Reference Algorithm}
1123\label{HD-clp2}
1124\end{figure}
1125
1126\begin{figure}
1127\begin{center}\small
1128\begin{verbatim}
1129y = simd<2>::xor<h,l>(x, x);
1130y = simd<4>::xor<h,l>(y, y);
1131y = simd<8>::xor<h,l>(y, y);
1132y = simd<16>::xor<h,l>(y, y);
1133y = simd<32>::xor<h,l>(y, y);
1134\end{verbatim}
1135\end{center}
1136\caption{IDISA Parity Implementation}
1137\label{ID-parity}
1138\end{figure}
1139
1140\begin{figure}
1141\begin{center}\small
1142\begin{verbatim}
1143x = simd<32>::sub(x, simd<32>::const(1));
1144x = simd<2>::or<x, h>(x, x);
1145x = simd<4>::or<x, h>(x, x);
1146x = simd<8>::or<x, h>(x, x);
1147x = simd<16>::or<x, h>(x, x);
1148x = simd<32>::or<x, h>(x, x);
1149x = simd<32>::add(x, simd<32>::const(1));
1150\end{verbatim}
1151\end{center}
1152\caption{IDISA Power-of-2 Ceiling Implementation}
1153\label{ID-clp2}
1154\end{figure}
1155
1156Figures \ref{HD-parity} and \ref{HD-clp2} show Warren's
1157code for two further inductive doubling examples using
1158logical operators as the combining feature.  In the
1159first of these, the ``exclusive or'' operation is applied
1160to all bits in a 32-bit value to determine parity.
1161Parity has important applications for error-correcting
1162codes such as the various Hamming codes for detecting
1163and correcting numbers of bit errors dependent on the
1164number of parity bits added.  Warren's version uses
116511 operations for parity of a single 32-bit value;
1166a straightforward SWAR adaptation also uses 11 operations
1167for the parity of a set of 32-bit fields.
1168
1169Figure \ref{ID-parity} shows the IDISA parity implementation
1170with only 5 operations required.  This represents slightly
1171more than a 2X improvement.  The improvement is less than
11723X seen in other cases because one of the operands need
1173not be modified before applying the exclusive-or operation.
1174
1175Using the ``or'' logical operation, Figure \ref{HD-clp2} shows Warren's
1176code for the least power-of-2 ceiling of a 32-bit value.  The
1177technique is to employ inductive doubling to fill in one bits at
1178all bit positions after the first one bit in the original value to
1179first produce a value of the form $2^n-1$.  This is then incremented
1180to determine the power of 2 ceiling.  This function and
1181its straightforward adaptation for SWAR application on a
1182set of 32-bit fields requires 12 operations.   The
1183corresponding IDISA implementation of Figure \ref{ID-clp2}
1184requires 7 operations, just under a 2X improvement.
1185
1186\subsection{Packed DNA Representation}
1187
1188DNA sequences are often represented as strings consisting
1189of the four nucleotide codes A, C, G and T.  Internally,
1190these sequences are frequently represented in internal
1191form as packed sequences of 2-bit values.  The IDISA
1192\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1193allow these packed representations to be quickly computed
1194from byte-oriented string values by two steps of inductive
1195halving.   Similarly, conversion back to string form
1196can use two steps of inductive merging.  Without direct
1197support for these pack and merge operations, the SWAR
1198implementations of these conversions require the cost
1199of explicit masking and shifting in combination with
1200the 16-bit to 8-bit packing and 8-bit to 16-bit
1201merging operations supported by existing SWAR facilities
1202on commodity processors.
1203
1204\subsection{Bit Reverse}
1205
1206Include or omit?
1207
1208\subsection{String/Decimal/Integer Conversion}
1209\begin{figure}
1210\begin{center}\small
1211\begin{verbatim}
1212b = (d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F);
1213b = (d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF);
1214b = (d & 0x0000FFFF) + 10000 * (d >> 16);
1215\end{verbatim}
1216\end{center}
1217\caption{BCD to Integer Reference Algorithm}
1218\label{BCD2int}
1219\end{figure}
1220
1221\begin{figure}
1222\begin{center}\small
1223\begin{verbatim}
1224c10 = simd<16>:const(10);
1225c100 = simd<16>:const(100);
1226c10000 = simd<32>:const(10000);
1227b = simd<8>::add<x,l>(simd<8>::mult<h,x>(d, c10), d);
1228b = simd<16>::add<x,l>(simd<16>::mult<h,x>(b, c100), b);
1229b = simd<32>::add<x,l>(simd<32>::mult<h,x>(b, c10000), b);
1230\end{verbatim}
1231\end{center}
1232\caption{IDISA Parity Implementation}
1233\label{ID-BCD2int}
1234\end{figure}
1235
1236Just as DNA sequences represent an important use case for
1237SWAR operations on 2-bit fields, packed sequences of
1238decimal or hexadecimal digits represent a common use case
1239for 4-bit fields.  These representations can be used
1240both as an intermediate form in numeric string to integer
1241conversion and as a direct representation for
1242packed binary coded decimal.
1243
1244Figure \ref{BCD2int} shows a three-step inductive
1245doubling implementation for conversion of 32-bit packed BCD
1246values to integer form.  The 32-bit value consists
1247of 8 4-bit decimal digits.  Pairs of digits are
1248first combined by multiplying the higher digit
1249of the pair by 10 and adding.  Pairs of these
1250two-digit results are then further combined by
1251multiplying the value of the higher of the two-digit
1252results by 100 and adding.  The final step is
1253to combine four-digit results by multiplying the
1254higher one by 10000 and adding.  Overall, 20
1255operations are required for this implementation
1256as well as the corresponding SWAR implementation
1257for sets of 32-bit fields.  Preloading of 6 constants
1258into registers for repeated use can reduce the number of
1259operations to 14 at the cost of significant register
1260pressure.
1261
1262The IDISA implementation of this algorithm is shown
1263in Figure \ref{ID-BCD2int}.  This implementation
1264shows an interesting variation in the use of
1265half-operand modifiers, with only one operand
1266of each of the addition and multiplication operations
1267modified at each level.  Overall, this implementation
1268requires 9 operations, or 6 operations with 3
1269preloaded constants.  This represents more than a 2X
1270reduction in instruction count as well as a 2X reduction
1271in register pressure.
1272
1273
1274
1275
1276\subsection{Systematic Support for Horizontal SIMD Operations}
1277
1278In SIMD parlance, {\em horizontal} operations are
1279operations which combine values from two or more fields
1280of the same register, in contrast to the normal
1281{\em vertical} operations which combine corresponding
1282fields of different registers.  Horizontal operations
1283can be found that combine two (e.g., \verb:haddpd: on SSE3),
1284four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1285or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1286horizontal operations have a vertical component as well.
1287For example, \verb:psadbw: first forms the absolute value of
1288the difference of eight corresponding byte fields before
1289performing horizontal add of the eight values, while
1290\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1291four unsigned 8-bit fields within one register
1292and then combines the result horizontally with
1293corresponding 32-bit fields of a second registers.
1294
1295The space of potential horizontal operations thus has
1296many dimensions, including not only the particular
1297combining operation and the operand field width, but
1298also the number of fields being combined, whether a
1299vertical combination is applied and whether it is applied
1300before or after the horizontal operation and what the
1301nature of the vertical combining operation is.
1302Within this space, commodity SIMD architectures tend
1303to support only a very few combinations, without any
1304particular attempt at systematic support for horizontal
1305operations in general.
1306
1307In contrast to this {\em ad hoc} support on commodity
1308processors, IDISA offers a completely systematic treatment
1309of horizontal operations without any special features beyond
1310the inductive doubling features already described.
1311In the simplest case, any vertical operation
1312\verb#simd<n>::F# on $n$-bit fields gives rise to
1313an immediate horizontal operation
1314\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1315pairs of $n/2$ bit fields.
1316For example, \verb#simd<16>::add<h,l># adds values
1317in adjacent 8 bit fields to produce 16 bit results,
1318while \verb#simd<32>::min<h,l># can produce the
1319minimum value of adjacent 16-bit fields.
1320Thus any binary horizontal operation can be implemented
1321in a single IDISA instruction making use of the \verb:<h,l>:
1322operand modifier combination.
1323
1324Horizontal combinations of four adjacent fields can also be
1325realized in a general way through two steps of inductive
1326doubling.  For example, consider the or-across operation \verb:si_orx:
1327of the SPU, that performs a logical or operation
1328on four 32-bit fields.  This four field combination
1329can easily be implemented with the following two operations.
1330%\begin{singlespace}
1331\begin{verbatim}
1332t = simd<64>::or<h,l>(x, x)
1333t = simd<128>::or<h,l>(t, t)
1334\end{verbatim}
1335%\end{singlespace}
1336
1337In general, systematic support for horizontal
1338combinations of sets of $2^h$ adjacent fields may
1339be realized through $h$ inductive double steps
1340in a similar fashion.
1341Thus, IDISA esssentially offers systematic support
1342for horizontal operations entirely through the
1343use of \verb:<h,l>: half-operand modifier
1344combinations.
1345
1346Systematic support for general horizontal operations
1347under IDISA also creates opportunity for a design tradeoff:
1348offsetting the circuit complexity of half-operand
1349modifiers with potential elimination of dedicated
1350logic for some {/ad hoc} horizontal SIMD operations.
1351Even if legacy support for these operations is required,
1352it may be possible to provide that support through
1353software or firmware rather than a full hardware
1354implementation.  Evaluation of these possibilities
1355in the context of particular architectures is a potential
1356area for further work.
1357
1358
1359\section{Implementation}
1360
1361We have carried implementation work for IDISA in three
1362ways.  First, we have constructed libraries that
1363implement the IDISA instructions by template and/or macro
1364expansion for each of MMX, SSE, Altivec, and SPU platforms.
1365Second, we have developed a model implementation
1366involving a modified operand fetch component
1367of a pipelined SIMD processor.  Third, we have written
1368and evaluated Verilog HDL description of this model
1369implementation.
1370
1371\subsection{IDISA Libraries}
1372
1373Implementation of IDISA instructions using template
1374and macro libraries has been useful in developing
1375and assessing the correctness of many of the algorithms
1376presented here.  Although these implementations do not
1377deliver the performance benefits associated with
1378direct hardware implementation of IDISA, they
1379have been quite useful in providing a practical means
1380for portable implementation of parallel bit stream
1381algorithms on multiple SWAR architectures.  However,
1382one additional facility has also proven necessary for
1383portability of parallel bit stream algorithms across
1384big-endian and little-endian architectures: the
1385notion of shift-forward and shift-back operations.
1386In essence, shift forward means shift to the left
1387on little-endian systems and shift to the right on
1388big-endian systems, while shift back has the reverse
1389interpretation.  Although this concept is unrelated to
1390inductive doubling, its inclusion with the IDISA
1391libraries has provided a suitable basis for portable
1392SIMD implementations of parallel bit stream algorithms.
1393Beyond this, the IDISA libraries have the additional
1394benefit of allowing the implementation
1395of inductive doubling algorithms at a higher level
1396abstraction, without need for programmer coding of
1397the underlying shift and mask operations.
1398
1399\subsection{IDISA Model}
1400\begin{figure}[tbh]
1401\begin{center}
1402\includegraphics[width=90mm, trim= 50 350 0 50]{IDISA.pdf}
1403\caption{IDISA Block Diagram}
1404\label{pipeline-model}
1405\end{center}
1406\end{figure}
1407
1408Figure \ref{pipeline-model} shows a block diagram
1409for a pipelined SIMD processor implementing IDISA.
1410The SIMD Register File (SRF) provides a file of $R = 2^A$ 
1411registers each of width $N = 2^K$ bits. 
1412IDISA instructions identified by the Instruction Fetch
1413Unit (IFU) are forwarded for decoding to the SIMD
1414Instruction Decode Unit (SIDU).  This unit decodes
1415the instruction to produce
1416signals identifying the source and destination
1417operand registers, the half-operand modifiers, the
1418field width specification and the SIMD operation
1419to be applied.
1420
1421The SIDU supplies the source register information and the half-operand
1422modifier information to the SIMD Operand Fetch Unit (SOFU).
1423For each source operand, the SIDU provides an $A$-bit register
1424address and two 1-bit signals $h$ and $l$ indicating the value
1425of the decoded half-operand modifiers for this operand.
1426Only one of these values may be 1; both are 0 if
1427no modifier is specified.
1428In addition, the SIDU supplies decoded field width information
1429to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
1430The SIDU also supplies decoded SIMD opcode information to SIEU and
1431a decoded $A$-bit register address for the destination register to
1432the SIMD Result Write Back Unit (SRWBU).
1433
1434The SOFU is the key component of the IDISA model that
1435differs from that found in a traditional SWAR
1436processor.  For each of the two $A$-bit source
1437register addresses, SOFU is first responsible for
1438fetching the raw operand values from the SRF.
1439Then, before supplying operand values to the
1440SIEU, the SOFU applies the half-operand modification
1441logic as specified by the $h$, $l$, and field-width
1442signals.  The possibly modified operand values are then
1443provided to the SIEU for carrying out the SIMD operations.
1444A detailed model of SOFU logic is described in the following
1445subsection.
1446
1447The SIEU differs from similar execution units in
1448current commodity processors primarily by providing
1449SIMD operations at each field width
1450$n=2^k$ for $0 \leq k \leq K$.  This involves
1451additional circuitry for field widths not supported
1452in existing processors.  For inductive doubling
1453algorithms in support of parallel bit streams,
1454the principal need is for additional circuitry to
1455support 2-bit and 4-bit field widths.  This circuity
1456is generally less complicated than that for larger
1457fields.  Support for circuitry at these width
1458has other applications as well.   For example,
1459DNA sequences are frequently represented using
1460packed sequences of 2-bit codes for the four possible
1461nucleotides\cite{}, while the need for accurate financial
1462calculation has seen a resurgence of the 4-bit
1463packed BCD format for decimal floating point \cite{}.
1464
1465When execution of the SWAR instruction is
1466completed, the result value is then provided
1467to the SRWBU to update the value stored in the
1468SRF at the address specified by the $A$-bit
1469destination operand.
1470
1471\subsection{Operand Fetch Unit Logic}
1472
1473Discussion of gate-level implementation.
1474
1475
1476\section{Conclusions}
1477
1478In considering the architectural support for
1479SIMD text processing using the method of parallel bit streams,
1480this paper has presented the principle of inductive doubling
1481and a realization of that principle in the form of
1482IDISA, a modified SWAR instruction set architecture.
1483IDISA features support for SWAR operations at all power-of-2
1484field widths, including 2-bit and 4-bit widths, in particular,
1485as well as half-operand modifiers and pack and merge operations
1486to support efficient transition between successive power-of-two
1487field widths.  The principle innovation is the notion of
1488half-operand modifiers that eliminate the cost associated
1489with the explicit mask and shift operations required for
1490such transitions.
1491
1492Several algorithms key to parallel bit stream methods
1493have been examined and shown to benefit from dramatic
1494reductions in instruction count compared to the best
1495known algorithms on existing architectures.  In the case
1496of transposition algorithms to and from parallel bit stream
1497form, the architecture has been shown to make possible
1498straightforward inductive doubling algorithms with the
1499lowest total number of operations that can be achieved by
1500any possible 3-register SWAR architecture.
1501
1502Applications of IDISA in other areas have also been
1503examined.  The support for 2-bit and 4-bit field widths
1504in SWAR processing is beneficial for packed DNA representations
1505and packed decimal representations respectively.  Additional
1506inductive doubling examples are presented and the phenomenon
1507of power-of-2 transitions discussed more broadly.
1508Most significantly, IDISA supports a fully general approach
1509to horizontal SWAR operations, offering a considerable
1510improvement over the {\em ad hoc} sets of special-purpose
1511horizontal operations found in existing SWAR instruction sets.
1512
1513An IDISA implementation model has been presented employing
1514a customized operand fetch unit to implement the half-operand
1515modifier logic.  Gate-level implementation of this unit
1516has been analyzed and showed to be quite reasonable.
1517
1518Many special-purpose operations that are found in existing
1519processors could instead be implemented through efficient
1520IDISA sequences.  These include examples such as population
1521count, count leading and/or trailing zeroes and parity.
1522They also include specialized horizontal SIMD operations.
1523Thus, design tradeoffs can be made with the potential of
1524reducing the chip area devoted to special purpose instructions
1525in favor of more general IDISA features.
1526
1527Other tradeoffs may be possible in IDISA implementation itself.
1528Full support of IDISA features to the largest field widths
1529are not necessary in many cases.   For example, a given 128-bit
1530SIMD facility may support IDISA features only up to 32-bit
1531or 64-bit fields, similar to the current Altivec and SSE
1532architectures, respectively.   It may also be possible to
1533reduce the complexity of operand fetch circuitry by a factor
1534of two by dedicating one operand to a possible high half-operand
1535modifier and the other to a possible low half-operand modifier.
1536
1537Future research may consider the extension of inductive doubling
1538support in further ways.   For example, it may be possible
1539to develop a pipelined architecture supporting two or three
1540steps of inductive doubling in a single operation.
1541
1542%\appendix
1543%\section{Appendix Title}
1544%
1545%This is the text of the appendix, if you need one.
1546
1547\acks
1548
1549This research was supported by a Discovery Grant from the
1550Natural Sciences and Engineering Research Council of Canada.
1551
1552%\bibliographystyle{plainnat}
1553\bibliographystyle{plain}
1554\bibliography{asplos094-cameron}
1555%\begin{thebibliography}{}
1556
1557%\bibitem{smith02}
1558%Smith, P. Q. reference text
1559
1560%\end{thebibliography}
1561
1562
1563\end{document}
Note: See TracBrowser for help on using the repository browser.