source: docs/ASPLOS09/asplos094-cameron.tex @ 4490

Last change on this file since 4490 was 259, checked in by cameron, 11 years ago

Classification; submitted version

File size: 73.1 KB
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
9\input epsf
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\tt \{cameron, lindanl\}}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SWAR
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SWAR programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions involves relatively simple circuitry added to the operand fetch
54components in a pipelined processor.
57\category{C.1.2}{PROCESSOR ARCHITECTURES}{Multiple Data Stream Architectures (Multiprocessors)}[Single-instruction-stream, multiple-data-stream processors (SIMD)]
59\terms Design, Performance
61\keywords inductive doubling, parallel bit streams, SWAR
67In the landscape of parallel computing research, finding ways to
68exploit intrachip (multicore) and intraregister (SWAR) parallelism
69for text processing and other non-numeric applications is particularly
70challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
71study \cite{Landscape} identifies the finite-state machine algorithms associated
72with text processing to be the hardest of the thirteen ``dwarves''
73to parallelize, concluding that nothing seems to help.  Indeed,
74the study even speculates that applications in this area may simply be
75``embarrassingly sequential,'' easy to tackle for traditional
76sequential processing approaches suitable for uniprocessors,
77but perhaps fundamentally unsuited to parallel methods.
79One approach that shows some promise, however, is the
80method of parallel bit streams, recently applied to
81UTF-8 to UTF-16 transcoding \cite{u8u16, PPoPP08}, XML parsing \cite{CASCON08, Herdy}
82and amino acid sequencing\cite{Green}.
83In this method, byte-oriented character data is first transposed to eight
84parallel bit streams, one for each bit position within the character code
85units (bytes).  Loading bit stream data into 128-bit registers,
86then, allows data from 128 consecutive code units to be represented and
87processed at once.  Bitwise logic and shift operations, bit scans,
88population counts and other bit-based operations are then used to carry
89out the work.
91In application to UTF-8 to UTF-16 transcoding, a 3X to 25X speed-up
92is achieved in using parallel bit stream techniques on SWAR-capable uniprocessors
93employing the SSE or Altivec instruction sets\cite{PPoPP08}.
94In the broader context of XML parsing, further applications of these
95techniques demonstrate the utility of parallel bit stream techniques
96in delivering performance benefits through a significant portion of the
97web technology stack.  In an XML statistics gathering application,
98including the implementation of XML well-formedness checking, an
99overall 3X to 10X performance improvement is achieved in using
100the parallel bit stream methods in comparison with a similarly
101coded application using such well known parsers as Expat and Xerces \cite{CASCON08}.
102In an application involving transformation between different XML formats
103(GML and SVG), an implementation using parallel bit stream technology
104required a mere 15 cycles per byte, while a range of other technologies
105required from 25 to 200 cycles per byte \cite{Herdy}.
106Ongoing work is further applying the parallel bit stream methods to
107parallel hash value computation and parallel regular expression matching
108for the purpose of validating XML datatype declarations in
109accord with XML Schema \cite{CASCON08}.
111Given these promising initial results in the application of
112parallel bit stream methods, what role might architectural support play in
113further enhancing this route to parallelization of text processing?
114This paper addresses this question through presentation and
115analysis of a constructive proposal: a set of SWAR instruction set
116features based on the principle of systematic support for
117inductive doubling algorithms.  Inductive doubling refers
118to a general property of certain kinds of algorithm that
119systematically double the values of field widths or other
120data attributes with each iteration.  In essence, the goal
121of the proposed features is to support such algorithms
122with specific facilities to  transition between successive power-of-2 field
123widths.   These transitions are quite frequent in
124parallel bit stream programming as well as other applications.
125The specific features presented herein will be referred to
126as IDISA: inductive doubling instruction set architecture.
128The remainder of this paper is organized as follows.
129The second section of this paper introduces IDISA and the
130SWAR notation used throughout this paper.  The third
131section moves on to discuss an evaluation methodology
132for IDISA in comparison to two reference architectures
133motivated by the SSE and Altivec instruction sets.
134The fourth section provides a short first example of
135the inductive doubling principle in action through
136the case of population count.  Sections 5 through 7
137then address the application of IDISA to core
138algorithms in text processing with parallel bit
140The eighth section then considers the potential role of
141IDISA in supporting applications beyond parallel bit streams.
142Section 9 addresses IDISA implementation while
143Section 10 concludes the paper with a summary of
144results and directions for further work.
147\section{Inductive Doubling Architecture}
149This section presents IDISA as an idealized model for a SWAR instruction set
150architecture designed specifically to support inductive doubling
151algorithms.  The architecture is idealized
152in the sense that we concentrate on only the necessary features
153for our purpose, without enumerating the additional
154operations that would be required for
155SWAR applications in other domains.  The goal is to focus
156on the principles of inductive doubling support in a way
157that can accommodate a variety of realizations as other
158design constraints are brought to bear on the overall instruction set
159design.  First we introduce a simple model and notation for
160SWAR operations in general and then present the four key
161features of IDISA.
164IDISA supports typical SWAR integer operations using a {\em
165three-register model} involving two input registers
166and one output register.  Each register is of size $N=2^K$ bits,
167for some integer $K$.
168Typical values of $K$ for commodity processors include $K=6$ for
169the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
170the 128-bit registers of SSE and Altivec technology and $K=8$ for
171the upcoming Intel AVX technology. 
172The registers may be
173partitioned into $N/n$ fields of width $n=2^k$ bits for some values
174of $k \leq K$.   Typical values of $k$ used on commodity processors
175include $k = 3$ for SWAR operations on 8-bit fields (bytes),
176$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
177on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
178fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
179Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
180register $r$, using big-endian numbering.
182Let \verb:simd<n>: represent the class of SWAR operations defined
183on fields of size $n$ using C++ template syntax.  Given a
184binary function $F_n$ on $n$-bit fields, we denote the SWAR
185version of this operation as \verb#simd<n>::F#.  Given two input
186registers \verb:a: and \verb:b: holding values $a$ and $b$,
187respectively, the operation \verb#r=simd<n>::F(a,b)# stores
188the value $r$ in the output register \verb:r: as determined by
189the simultaneous calculation of individual field values in
190accord with the following equation.
192 r_i &=& F_n(a_i, b_i)
195For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
196logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
197integers as follows.
200\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
201\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
202\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
205The Altivec instruction set includes each of these operations
206for 8, 16 and 32-bit fields directly following the three-register
207model.   The SSE set uses a two-register model with the result
208being copied back to one of the input operands.   However, the
209C language intrinsics commonly used to access these instructions
210reflect the three-register model.  The SSE set extends these
211operations to include operations on 64-bit fields, but
212constrains the shift instructions, requiring that all field shifts by the same amount.
214Given these definitions and notation, we now present
215the four key elements of an inductive doubling architecture.
216The first is a definition of a core set of binary functions
217on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
218The second is a set of {\em half-operand modifiers} that allow
219the inductive processing of fields of size $2n$ in terms of
220combinations of $n$-bit values selected from the fields.
221The third is the definition of packing operations that compress
222two consecutive registers of $n$-bit values into a single
223register of $n/2$-bit values.  The fourth is the definition
224of merging operations that produce a set of $2n$ bit fields
225by concatenating corresponding $n$-bit fields from two
226parallel registers.  Each of these features is described below.
228For the purpose of direct and efficient support for inductive
229doubling algorithms, the provision of
230a core set of operations at field widths of 2 and 4 as
231well as the more traditional field widths of 8, 16 and 32
232is key.  In essence, inductive doubling algorithms
233work by establishing some base property at either single
234or 2-bit fields.  Each iteration of the algorithm then
235goes on to establish the property for the power-of-2
236field width.  In order for this inductive step to be
237most conveniently and efficiently expressed, the
238core operations needed for the step should be available
239at each field width.  In the case of work with parallel
240bit streams, the operations \verb:add:, \verb:sub:,
241\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
242(rotate left) comprise the core.  In other domains,
243additional operations may be usefully included in the
244core depending on the work that needs to be performed
245at each inductive doubling level.
247Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
248also includes fields of width 1.  These are included for
249logical consistency, but are easily implemented by mapping
250directly to appropriate bitwise logic operations, which
251we assume are also available.  For example,
252\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
253bitwise exclusive-or operation.
255The second key facility of the inductive doubling architecture
256is the potential application of half-operand modifiers to
257the fields of either or both of the operands of a SWAR
258operation.  These modifiers select either the
259low $n/2$
260bits of each $n$-bit field (modifier ``\verb:l:'') or the
261high $n/2$ bits (modifier ``\verb:h:'').  When required,
262the modifier ``\verb:x:'' means that the full $n$ bits
263should be used, unmodified.  The semantics of these
264modifiers are given by the following equations.
267l(r_n) & = & r_n \bmod 2^{n/2} \\
268h(r_n) & = & r_n / 2^{n/2} \\
269x(r_n) & = & r_n
272In our notation, the half-operand modifiers are
273specified as optional template (compile-time) parameters
274for each of the binary functions.  Thus,
275\verb#simd<4>::add<h,l>(a,b)# is an operation which adds
276the 2-bit quantity found in the high 2-bits of each 4-bit field
277of its first operand (\verb:a:)
278together with the corresponding 2-bit quantity found in the
279low 2-bits of its second operand (\verb:b:).
280In general, the purpose of the half-operand modifiers
281in support of inductive doubling is to allow the processing
282of $n$-bit fields to easily expressed in terms of
283combination of the results determined by processing
284$n/2$ bit fields.
286The third facility of the inductive doubling architecture
287is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
288The field values of \verb#r=simd<n>::pack(a,b)# are
289defined by the following equations.
292r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
293r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
296Here conv is a function which performs conversion of an $n$-bit
297value to an $n/2$ bit value by signed saturation (although
298conversion by unsigned saturation would also suit our purpose).
300Half-operand modifiers may also be used with the pack
301operations.  Thus packing with conversion by masking off all
302but the low $n/2$ bits of each field may be
303be performed using the operation \verb#simd<n>::pack<l,l>#.
305The final facility of the inductive doubling architecture is
306a set of merging operations that produce $2n$-bit fields
307by concatenating corresponding $n$-bit fields from the
308operand registers.   The
309operations \verb#r=simd<n>::mergeh(a,b)# and
311are defined by the following equations.
314r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
315s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
318Both SSE and Altivec provide versions of pack and merge operations
319for certain field widths.  The pack operations are provided
320with operands having 16-bit or 32-bit fields on each platform, although
321with some variation in how conversion is carried out.
322The merge operations are provided at 8-bit, 16-bit and 32-bit
323field widths on both architectures and also at the 64-bit level
324on SSE.
326This completes the description of IDISA.  As described, many of the
327features are already available with the SWAR facilities of
328existing commodity processors.  The extensions enumerated
329here are relatively straightforward.  The innovation
330is to specifically tackle the design of facilities to
331offer systematic support for transitions between power-of-2 field widths.
332As we shall show in the remainder of this paper, these facilities
333can dramatically reduce instruction count in core parallel bit
334stream algorithms, with a factor of 3 reduction being typical.
336\section{Evaluation Methodology}
338IDISA represents a set of instruction set features that
339could potentially be added to any SWAR processor.  The goal
340in this paper is to evaluate these features independent
341of artifacts that may be due to any particular realization,
342while still considering realistic models based on existing
343commodity instruction set architectures.  For the purpose
344of IDISA evaluation, then, we define two reference architectures.
345For concreteness, IDISA and the two reference architectures will
346each be considered as 128-bit processors employing the
347three-register SWAR model defined in the previous
350Reference architecture A (RefA) consists of a limited register
351processor providing a set of core binary operations defined
352for 8, 16, 32 and 64 bit fields.  The core binary operations
353will be assumed to be those defined by the SSE instruction
354set for 16-bit fields.  In addition, we assume that
355shift immediate operations for each field width exist,
356e.g., \verb#simd<8>::srli<1>(x)# for a right logical
357shift of each 8-bit field by 1.  We also assume that
358a constant load operation \verb#simd::constant<n>(c)#
359loads the constant value $c$ into each $n$ bit field.
360The pack and merge facilities of SSE will also be assumed.
362Reference architecture B (RefB) consists of a register-rich
363processor incorporating all the operations of reference
364architecture A as well as the following additional facilities
365inspired by the Altivec instruction set.
366For each of the 8, 16, 32 and 64 bit widths, a binary rotate left
367logical instruction \verb#simd<n>::rotl(a,b)#  rotates each field
368of $a$ by the rotation count in the corresponding field of $b$.
369A three-input \verb#simd<1>::if(a,b,c)# bitwise logical
370operator implements the logic $r=a \wedge b \vee \neg a \wedge c$, patterned
371after the Altivec \verb:vec_sel: operation.  Finally,
372a \verb#simd<8>::permute(a,b,c)# selects an arbitrary
373permutation of bytes from the concatenation of $a$ and $b$ based
374on the set of indices in $c$.
376Two versions of IDISA are assessed against these reference
377architectures as follows.  IDISA-A has all the facilities
378of RefA extended with half-operand modifiers and all core
379operations at field widths of 2, 4 and 128.  IDISA-B is
380similarly defined and extended based on RefB.  Algorithms
381for both RefA and IDISA-A are assessed assuming that
382any required constants must be loaded as needed; this
383reflects the limited register assumption.  On the other,
384assessment for both RefB and IDISA-B will make the
385assumption that sufficiently many registers exist that
386constants can be kept preloaded.
388In each case, the processors are assumed to be
389pipelined processors with a throughput of one SWAR instruction
390each processor cycle for straight-line code free of memory
391access.  This assumption makes for
392straightforward performance evaluation based on instruction
393count for straight-line computational kernels. 
394Furthermore, the assumption also eliminates artifacts due to
395possibly different latencies in reference and IDISA architectures. 
396Because the same assumption is made for reference and IDISA
397architectures, determination of the additional circuit
398complexity due to IDISA features is unaffected by the
401In the remainder of this paper, then, IDISA-A and IDISA-B
402models are evaluated against their respective reference
403architectures on straight-line computational kernels
404used in parallel bit stream processing and other applications.
405As XML and other sequential text processing applications
406tend to use memory in an efficient streaming model, the
407applications tend to be compute-bound rather than IO-bound.
408Thus, the focus on computational kernels addresses the
409primary concern for performance improvement of these applications.
411The additional circuit complexity to realize IDISA-A and
412IDISA-B designs over their reference models will be
413addressed in the penultimate section.  That discussion
414will focus primarily on the complexity of implementing
415half-operand modifier logic, but will also address
416the extension of the core operations to operate on
4172-bit, 4-bit and 128-bit fields, as well.
419\section{Population Count}
424c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
425c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
426c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
427c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
428c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
431\caption{Population Count Reference Algorithm}
438c = simd<2>::add<h,l>(x, x);
439c = simd<4>::add<h,l>(c, c);
440c = simd<8>::add<h,l>(c, c);
441c = simd<16>::add<h,l>(c, c);
442c = simd<32>::add<h,l>(c, c);
445\caption{IDISA Population Count}
449As an initial example to illustrate the principle of inductive doubling
450in practice, consider the problem of {\em population count}: determining
451the number of one bits within a particular bit field.  It is important
452enough for such operations as calculating Hamming distance to be included
453as a built-in instruction
454on some processors.  For example, the SPU of the Cell Broadband Engine
455has a SWAR population count instruction \verb:si_cntb: for simultaneously
456determining the
457number of 1 bits within each byte of a 16-byte register.
458In text processing with parallel bit streams, population count has direct
459application to keeping track of line numbers for error reporting, for example.
460Given a bit block identifying the positions of newline characters within
461a block of characters being processed, the population count of the
462bit block can be used to efficiently and conveniently be used to update
463the line number upon completion of block processing.
465Figure \ref{HD-pop} presents a traditional divide-and-conquer
466implementation for a 32-bit integer {\tt x} adapted from
467Warren \cite{HackersDelight}, while
468Figure \ref{inductivepopcount} shows the corresponding IDISA
469implementation for a vector of 32-bit fields.  Each implementation employs
470five steps of inductive doubling to produce population counts
471within 32 bit fields.  The traditional implementation employs
472explicit masking and shifting operations, while these
473operations are implicit within the semantics of the inductive
474doubling instructions shown in Figure \ref{inductivepopcount}.
475In each implementation, the first step determines the
476the population counts within 2-bit fields
477by adding the high bit of each such field to the low bit
478to produce a set of 2-bit counts in {\tt c}.
479In the second step, the counts within 4-bit fields of {\tt c} are determined
480by adding the counts of the corresponding high and low 2-bit subfields.
481Continuing in this fashion,
482the final population counts within 32-bit fields are determined in five steps.
484With the substitution of longer mask constants replicated for four
48532-bit fields, the implementation of Figure \ref{HD-pop} can be
486directly adapted to SWAR processing using 128-bit registers.
487Each binary operator is replaced by a corresponding binary
488SWAR operation.   Without the IDISA features, a
489straightforward RefA implementation of population count for
49032-bit fields thus employs 10 operations to load or generate
491mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
492total of 30 operations to complete the task.   Employing
493optimization identified by Warren, this can be reduced to
49420 operations, 5 of which are required to generate mask constants.
495At the cost of register pressure, it is possible that these constants
496could be kept preloaded in long vector processing.  In accord
497with our evaluation model, the RefB cost is thus 15 operations.
498As the IDISA implementation requires no constants at all,
499both the IDISA-A and IDISA-B cost is 5 operations.
500At our assumed one CPU cycle per instruction model, IDISA-A
501offers a 4X improvement over RefA, while IDISA-B offers a 3X
502improvement over its comparator.
504The pattern illustrated by population count is typical.
505An inductive doubling algorithm of $n$ steps typically applies
506mask or shift operations at each step for each of the
507two operands being combined in the step.  In general,
508the mask constants shown in Figure \ref{HD-pop} recur; these
509are termed ``magic masks'' by Knuth \cite{v4pf1a}.
510If the algorithm employs a single operation at each step, then a total
511of $3n$ operations are the required in a RefB implementation,
512and possibly $4n$ for a RefA implementation including the
513cost of loading masks.  IDISA-A and IDISA-B implementations
514typically eliminate the explicit mask and shift operations
515through appropriate half-operand modifiers, reducing the
516total instruction count to $n$.   Thus a 3X to 4X improvement
517obtains in these cases.
519\section{Transposition to Parallel Bit Streams}
521In this section, we consider the first major
522application of IDISA: transposition of byte stream data to parallel bit stream
523form.   Of course, this operation is critical to the
524method of parallel bit streams and all applications
525of the method can benefit from a highly efficient
526transposition process.  Before considering how
527the IDISA supports this
528transposition process, however, we first consider
529algorithms on existing architectures.  Two algorithms
530are presented; the best of these requires 72
531SWAR operations under the RefB model to perform
532transposition of eight serial registers of byte stream data into
533eight parallel registers of bit stream data.
535We then go on to show how the transposition problem
536can be solved using IDISA-A or IDISA-B
537with a mere 24 three-register SWAR operations.  We also show
538that this is optimal for any three-register instruction set model.
542\includegraphics[width=87mm, trim= 40 50 0 50]{S2P_IO.pdf}
543\caption{Serial to Parallel Transposition}
548Figure \ref{s2p-spec} illustrates the input-output requirements of
549the transposition problem.  We assume that inputs and
550outputs are each SWAR registers of size $N=2^K$ bits.
551The input consists of $N$ bytes of serial byte data,
552stored consecutively in eight SWAR registers each holding
553$N/8$ bytes.  The output consists of eight parallel
554registers, one each for the eight individual bit positions
555within a byte.  Upon completion of the transposition process,
556each output register is to hold the $N$ bits corresponding
557to the selected bit position in the sequence of $N$ input
560\subsection{Bit Gathering Algorithm}
562% \begin{figure}[tbh]
563% \begin{center}
564% \includegraphics[width=100mm, trim= 50 100 0 0]{S2P.pdf}
565% \caption{Serial to Parallel Transposition Using Bit-Gathering}
566% \label{gather}
567% \end{center}
568% \end{figure}
569One straightforward algorithm for implementing the transposition process
570takes advantage of SWAR bit gathering operations that exist
571on some architectures.  This operation gathers one bit per byte
572from a particular position within each byte of a register.
573For example, the {\tt pmovmskb} operation of the Intel
574SSE instruction set forms a 16-bit mask
575consisting of the high bit of each byte.  Similarly, the
576{\tt \verb:si_gbb:} operation of the synergistic processing units of the
577Cell Broadband Engine gathers together the low bit of each byte.
578% Figure \ref{gather} illustrates the
579% bit gathering process.
581Using bit gathering, each bit stream of output is assembled 16 positions
582at a time.  Bits from the input register must be shifted into
583position, the gather operation performed and the result inserted
584into position in the output register.  For the 8 streams, this
585requires at least 22 operations for 16 positions, or 176 operations
586to complete the transposition task.
588\subsection{BytePack Algorithm}
590A more efficient transposition algorithm on commodity
591SWAR architectures involves three
592stages of binary division transformation.  This is similar
593to the three stage bit matrix inversion described by
594Warren  \cite{HackersDelight}, although modified to use SWAR operations.
595In each stage, input streams are divided into two half-length output streams.
596The first stage separates the bits at even numbered positions from those
597at odd number positions.  The two output streams from the first
598stage are then further divided in the second stage.
599The stream comprising even numbered bits from the original byte stream
600divides into one stream consisting of bits from positions 0 and 4 of each
601byte in the original stream and a second stream consisting of bits
602from positions 2 and 6 of each original byte.  The stream of bits from
603odd positions is similarly divided into streams for bits from each of the
604positions 1 and 5 and bits from positions 2 and 6.
605Finally, each of the four streams resulting from the second stage are
606divided into the desired individual bit streams in the third stage.
608% \begin{figure}[tbh]
609% \begin{center}\small
610% \begin{verbatim}
611% s0h = simd<16>::srli<8>(s0);
612% s0l = simd_and(s0, simd<16>::constant(0x00FF));
613% s1h = simd<16>::srli<8>(s1);
614% s1l = simd_and(s1, simd<16>::constant(0x00FF));
615% t0 = simd<16>::pack(s0h, s1h);
616% t1 = simd<16>::pack(s0l, s1l);
617% t0_l1 = simd<16>::slli<1>(t0);
618% t0_r1 = simd<16>::srli<1>(t1);
619% mask = simd<8>::constant(0xAA);
620% p0 = simd_or(simd_and(t0, mask), simd_andc(t1_r1, mask));
621% p1 = simd_or(simd_and(t0_l1, mask), simd_andc(t1, mask));
622% \end{verbatim}
623% \end{center}
624% \caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
625% \label{s2pstep}
626% \end{figure}
633odd ={1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31};
634mask = simd<8>::constant(0xAA);
635t0 = simd<8>::permute(s0, s1, even);
636t1 = simd<8>::permute(s0, s1, odd);
637p0 = simd_if(mask, t0, simd<16>::srli<1>(t1));
638p1 = simd_if(mask, simd<16>::slli<1>(t0), t1);
641\caption{RefB Transposition Step in BytePack Stage 1}
645The binary division transformations are accomplished in each stage
646using byte packing, shifting and masking.  In each stage, a
647transposition step combines each pair of serial input registers to
648produce a pair of parallel output registers. 
649Figure \ref{s2pstep} shows a stage 1 transposition step in a
650Ref B implementation.  Using the permute facility, the even
651and odd bytes, respectively, from two serial input registers
652\verb:s0: and \verb:s1: are packed into temporary registers
653\verb:t0: and \verb:t1:.  The even and odd bits are then
654separated into two parallel output registers \verb:p0: and \verb:p1:
655by selecting alternating bits using a mask.  This step is applied
656four times in stage 1; stages 2 and 3 also consist of four applications
657of a similar step with different shift and mask constants.
658Overall, 6 operations per step are required, yielding a total
659of 72 operations to transpose 128 bytes to parallel bit stream
660form in the RefB implementation.
662In a RefA implementation, byte packing may also be achieved
663by the \verb#simd<16>::pack# with 4 additional operations to
664prepare operands.  Essentially, the RefB implementation
665uses single permute instructions to implement the equivalent of
666\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
667The RefA implementation also requires 3 logic operations to implement
668each \verb#simd_if#.
669Assuming that mask loads are only need once per 128 bytes,
670a total of 148 operations are required in the RefB implementation.
672\subsection{Inductive Halving Algorithm}
674Using IDISA, it is possible to design
675a transposition algorithm that is both easier to understand and requires
676many fewer operations than the the BytePack algorithm described above.
677We call it the inductive halving algorithm for serial to parallel
678transposition, because it proceeds by reducing byte streams to
679two sets of nybble streams in a first stage, dividing the nybble
680streams into streams of bit-pairs in a second stage and finally
681dividing the bit-pair streams into bit streams in the third stage.
683Figure \ref{halvingstep} shows one step in stage 1 of the inductive
684halving algorithm, comprising just two IDISA-A operations.
685The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
686from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
687the low nybble of each byte.  As in the BytePack algorithm, this step is
688applied 4 times in stage 1, for a total of 8 operations.
690Stage 2 of the inductive halving algorithm reduces nybble streams
691to streams of bit pairs.  The basic step in this algorithm consists
692of one \verb#simd<4>::pack<h,h># operation to extract the high pair
693of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
694low pair of each nybble.  Four applications of this step complete stage 2.
699p0 = simd<8>::pack<h,h>(s0, s1);
700p1 = simd<8>::pack<l,l>(s0, s1);
702\caption{Step in Inductive Halving Algorithm Stage 1}
707Stage 3 similarly uses four applications of a step that uses a
708\verb#simd<2>::pack<h,h># operation to extract the high bit of
709each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
710each pair.  Under either IDISA-A or IDISA-B models,
711the complete algorithm to transform eight serial
712byte registers s0 through s7 into the eight parallel bit stream
713registers bit0 through bit7 requires a mere 24 instructions per 128
714input bytes.
716% \begin{figure}[tbh]
717% \small
718% \begin{verbatim}
719% hnybble0 = simd<8>::pack<h,h>(s0, s1);
720% lnybble0 = simd<8>::pack<l,l>(s0, s1);
721% hnybble1 = simd<8>::pack<h,h>(s2, s3);
722% lnybble1 = simd<8>::pack<l,l>(s2, s3);
723% hnybble2 = simd<8>::pack<h,h>(s4, s5);
724% lnybble2 = simd<8>::pack<l,l>(s4, s5);
725% hnybble3 = simd<8>::pack<h,h>(s6, s7);
726% lnybble3 = simd<8>::pack<l,l>(s6, s7);
727% hh_pair0 = simd<4>::pack<h,h>(hnybble0, hnybble1);
728% hl_pair0 = simd<4>::pack<l,l>(hnybble0, hnybble1);
729% lh_pair0 = simd<4>::pack<h,h>(lnybble0, lnybble1);
730% ll_pair0 = simd<4>::pack<l,l>(lnybble0, lnybble1);
731% hh_pair1 = simd<4>::pack<h,h>(hnybble2, hnybble3);
732% hl_pair1 = simd<4>::pack<l,l>(hnybble2, hnybble3);
733% lh_pair1 = simd<4>::pack<h,h>(lnybble2, lnybble3);
734% ll_pair1 = simd<4>::pack<l,l>(lnybble2, lnybble3);
735% bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
736% bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
737% bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
738% bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
739% bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
740% bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
741% bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
742% bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
743% \end{verbatim}
744% \caption{Complete Inductive Halving Algorithm}
745% \label{halvingalgorithm}
746% \end{figure}
748\subsection{Optimality of the Inductive Halving Algorithm}
750Here we show that the inductive halving algorithm presented in
751the previous subsection is optimal in the following sense:
752no other algorithm on any 3-register SWAR architecture can use
753fewer than 24 operations to transform eight serial registers
754of byte stream data into eight parallel registers of bit stream data.
755By 3-register SWAR architecture, we refer to any architecture
756that uses SWAR instructions consistent with our overall model of
757binary operations using two input register operands to produce
758one output register value.
760Observe that the $N$ data bits from each input register must be
761distributed $N/8$ each to the 8 output registers by virtue of
762the problem definition.  Each output register can effectively
763be given a 3-bit address; the partitioning problem can be viewed
764as moving data to the correct address.  However, each
765operation can move results into at most one register. 
766At most this can result in the assignment of one correct address
767bit for each of the $N$ input bits.  As all $8N$ input bits
768need to be moved to a register with a correct 3-bit address,
769a minimum of 24 operations is required.
771\subsection{End-to-End Significance}
773In a study of several XML technologies applied to
774the problem of GML to SVG transformation, the parabix
775implementation (parallel bit streams for XML) was
776found to the fastest with a cost of approximately
77715 CPU cycles per input byte \cite{Herdy}.  Within parabix,
778transposition to parallel bit stream form requires
779approximately 1.1 cycles per byte \cite{CASCON08}.
780All other things being equal, a 3X speed-up of transposition
781alone would improve end-to-end performance in a
782complete XML processing application by more than 4\%.
785\section{Parallel to Serial Conversion}
787Parallel bit stream applications may apply string editing
788operations in bit space to substitute, delete or insert
789parallel sets of bits at particular positions.  In such cases,
790the inverse transform that converts a set of parallel bit
791streams back into byte space is needed.  In the example of
792UTF-8 to UTF-16 transcoding, the inverse transform is
793actually used twice for each application of the forward
794transform, to separately compute the high and low byte
795streams of each UTF-16 code unit.  Those two byte streams
796are subsequently merged to form the final result.
798Algorithms for performing the inverse transform mirror those
799of the forward transform, employing SWAR merge operations
800in place of pack operations.  The best algorithm known
801to us on the commodity SWAR architectures takes advantage
802of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
803operations that are available with each of the SSE and Altivec instruction
804sets.  To perform the full inverse transform of 8 parallel
805registers of bit stream data into 8 serial registers of byte stream data,
806a RefA implementation requires 120 operations, while a RefB
807implementation reduces this to 72.
809% \begin{figure}[tbh]
810% \begin{center}\small
811% \begin{verbatim}
812% bit01_r0 = simd<1>::mergeh(bit0, bit1);
813% bit01_r1 = simd<1>::mergel(bit0, bit1);
814% bit23_r0 = simd<1>::mergeh(bit2, bit3);
815% bit23_r1 = simd<1>::mergel(bit2, bit3);
816% bit45_r0 = simd<1>::mergeh(bit4, bit5);
817% bit45_r1 = simd<1>::mergel(bit4, bit5);
818% bit67_r0 = simd<1>::mergeh(bit6, bit7);
819% bit67_r1 = simd<1>::mergel(bit6, bit7);
820% bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
821% bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
822% bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
823% bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
824% bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
825% bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
826% bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
827% bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
828% s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
829% s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
830% s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
831% s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
832% s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
833% s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
834% s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
835% s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
836% \end{verbatim}
837% \end{center}
838% \label{p2s-inductive}
839% \caption{Parallel to Serial Transposition by Inductive Doubling}
840% \end{figure}
843An algorithm employing only 24 operations using IDISA-A/B is relatively
844straightforward.. In stage 1, parallel registers for individual bit streams
845are first merged with bit-level interleaving
846using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
847operations.  For each of the four pairs of consecutive
848even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
849two consecutive registers of bit-pair data are produced.
850In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
851are then applied to merge to bit-pair streams to produce streams
852of nybbles for the high and low nybble of each byte.  Finally,
853the nybble streams are merged in stage 3 to produce the
854desired byte stream data.  The full inductive doubling
855algorithm for parallel to serial transposition thus
856requires three stages of 8 instructions each.  The algorithm is again
857optimal, requiring the fewest operations
858of any possible algorithm using any 3-register instruction set
865\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
866\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
867\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
868\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
869\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
870\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
871\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
872\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
876\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
879The existence of high-performance algorithms for transformation of
880character data between byte stream and parallel bit stream form
881in both directions makes it possible to consider applying these
882transformations multiple times during text processing applications.
883Just as the time domain and frequency domain each have their
884use in signal processing applications, the byte stream form and
885parallel bit stream form can then each be used at will in
886character stream applications.
890\section{Parallel Bit Deletion}
893Parallel bit deletion is an important operation that allows string
894editing operations to be carried out while in parallel bit stream
895form.  It is also fundamental to UTF-8 to UTF-16 transcoding
896using parallel bit streams, allowing the excess code unit
897positions for UTF-8 two-, three- and four-byte sequences to
898be deleted once the sixteen parallel bit streams of UTF-16 have
899been computed \cite{PPoPP08}.
901Parallel bit deletion is specified using a deletion mask.
902A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
903to be deleted and 0s at positions identifying bits to be retained.
904For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
905bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
906accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
908Bit deletion may be performed using
909the parallel-prefix compress algorithm documented by
910Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
911only logic and shifts with a constant parameter to carry
912out the deletion process.  However, it requires $k^2$
913preprocessing steps for a final field width parameter
914of size $2^k$, as well as 4 operations per deletion step
915per stream.  Using the inductive doubling instruction set architecture
916it is possible to carry out bit deletion much more efficiently.
918Deletion within fixed size fields or registers may produce results that are either
919left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
920eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
921care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
922left-justified result produces an important intermediate form known as a
923{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
924right-justified and left-justified results from the application of the
9254-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
926stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
927the central result \verb:xxbdefhx:, which may easily be converted to a either a
928left or a right justified 8-element result by an appropriate shift operation.
931The observation about how two $n$-bit central deletion results can
932combine to yield a $2n$ central deletion result provides the basis
933for an inductive doubling algorithm.  Figure \ref{centraldelstep}
934illustrates the inductive process for the transition from 8-bit central
935deletion results to 16-bit central deletion results.  The top row shows
936the original deletion mask, while the second row shows the original
937bit stream to which deletions are to be applied, with deleted bits zeroed out.
938The third row shows the central result for each 8-bit field as the
939result of the previous inductive step.
941To perform the $8 \rightarrow 16$ central deletion step, we first form
942the population counts of 4-bit fields of the original deletion mask as
943shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
944an 8-bit central result, we perform a right shift by the population count
945of the low half of the field.  Similarly,
946left-justification requires a left-shift by the population count in the
947high half of the field.
949The left and right shifts can be performed simultaneously using a rotate
950left instruction.  Right justification by shifting an $n$ bit field
951$i$ positions to the right is equivalent to a left rotate of $n-i$
952positions.  Given a register value \verb:c8: preloaded with
953the value 8 in each 8-bit field, the right rotation
954amounts are computed by the operation
955\verb#rj=simd<8>::sub<x,l>(c8, cts_4)# producing values shown in row 5,
956except that don't care fields (which won't be subsequently used)
957are marked \verb:XX:. 
959The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
960producing the values shown in row 6, and are then combined with the right shift amounts
961by the selection operation  \verb#rot_8=simd_if(mask0xFF00, rj, lj)#
962as shown in row 7.  Using these computed values, the inductive step
963is completed by application of the operation  \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
964as shown in row 8.
966At each inductive doubling level, it requires 4 operations to compute the
967required deletion information and one operation per bit stream to perform deletion.
968Note that, if deletion is to be applied to a set of eight parallel bit streams,
969the computed deletion information is used for each stream without recomputation,
970thus requiring 12 operations per inductive level.
972In comparison to the parallel-prefix compress method, the method of central
973deletion results using the inductive doubling architecture has far fewer operations.
974The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
975induction versus $4k^2$ for the parallel-prefix method.  Using the computed
976deletion operation, only a single SWAR rotate operation per bit stream
977per level is needed, in comparison with 4 operations per level for parallel-prefix
982\section{Beyond Parallel Bit Streams}
984IDISA has a variety of applications in domains beyond
985text processing with parallel bit streams.  We present
986a number of examples in this section, including,
987most significantly, a full general solution to the problem of supporting
988{\em horizontal} SWAR operations.
992% \begin{figure}[h]
993% \begin{center}\small
994% \begin{verbatim}
995% y = x ^ (x >> 1);
996% y = y ^ (y >> 2);
997% y = y ^ (y >> 4);
998% y = y ^ (y >> 8);
999% y = y ^ (y >>16);
1000% y = y & 1;
1001% \end{verbatim}
1002% \end{center}
1003% \caption{Parity Reference Algorithm}
1004% \label{HD-parity}
1005% \end{figure}
1010y = simd<2>::xor<h,l>(x, x);
1011y = simd<4>::xor<h,l>(y, y);
1012y = simd<8>::xor<h,l>(y, y);
1013y = simd<16>::xor<h,l>(y, y);
1014y = simd<32>::xor<h,l>(y, y);
1017\caption{IDISA Parity Implementation}
1021Parity has important applications for error-correcting
1022codes such as the various Hamming codes for detecting
1023and correcting numbers of bit errors dependent on the
1024number of parity bits added. 
1025Figure \ref{ID-parity} shows an IDISA-A parity implementation
1026with only 5 operations required for 32-bit fields,
1027slightly more than a 2X improvement over the 11 operations
1028required in a RefB implementation following Warren
1029 \cite{HackersDelight}. The improvement is less than
10303X seen in other cases because one of the operands need
1031not be modified before applying the exclusive-or operation.
1033\subsection{Bit Reverse}
1035Bit reverse is an important operation needed in a number
1036of low level codecs.  Following Warren's inductive
1037doubling implementation using masks and shifts \cite{HackersDelight},
1038a RefA implementation on 32-bit fields requires 28
1039operations, while a straightforward IDISA-A implementation
1040using \verb#simd<n>::rotli# at each inductive doubling
1041level requires only 5 operations.
1042\subsection{Packed DNA Representation}
1044DNA sequences are often represented as strings consisting
1045of the four nucleotide codes A, C, G and T.  Internally,
1046these sequences are frequently represented in internal
1047form as packed sequences of 2-bit values.  The IDISA
1048\verb#simd<8>:pack# and \verb#simd<4>:pack# operations
1049allow these packed representations to be quickly computed
1050from byte-oriented string values by two steps of inductive
1051halving.   Similarly, conversion back to string form
1052can use two steps of inductive merging.  Without direct
1053support for these pack and merge operations, the SWAR
1054implementations of these conversions require the cost
1055of explicit masking and shifting in combination with
1056the 16-bit to 8-bit packing and 8-bit to 16-bit
1057merging operations supported by existing SWAR facilities
1058on commodity processors.
1060\subsection{String/Decimal/Integer Conversion}
1062Just as DNA sequences represent an important use case for
1063SWAR operations on 2-bit fields, packed sequences of
1064decimal or hexadecimal digits represent a common use case
1065for 4-bit fields.  These representations can be used
1066both as an intermediate form in numeric string to integer
1067conversion and as a direct representation for
1068packed binary coded decimal.
1073b=(d & 0x0F0F0F0F) + 10 * ((d >> 4) & 0x0F0F0F0F)
1074b=(d & 0x00FF00FF) + 100 * ((d >> 8) & 0x00FF00FF)
1075b=(d & 0x0000FFFF) + 10000 * (d >> 16)
1078\caption{BCD to Integer Reference Algorithm}
1088b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d)
1089b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b)
1090b=simd<32>::add<x,l>(simd<32>::mult<h,x>(b,t3), b)
1093\caption{IDISA BCD to Integer}
1097Figure \ref{BCD2int} shows a three-step inductive
1098doubling implementation for conversion of 32-bit packed BCD
1099values to integer form.  The 32-bit value consists
1100of 8 4-bit decimal digits.  Pairs of digits are
1101first combined by multiplying the higher digit
1102of the pair by 10 and adding.  Pairs of these
1103two-digit results are then further combined by
1104multiplying the value of the higher of the two-digit
1105results by 100 and adding.  The final step is
1106to combine four-digit results by multiplying the
1107higher one by 10000 and adding.  Overall, 20
1108operations are required for this implementation
1109as well as the corresponding RefA implementation
1110for sets of 32-bit fields.  Under the RefB model, preloading of
11116 constants into registers for repeated use can reduce the
1112number of operations to 14 at the cost of register
1115The IDISA implementation of this algorithm is shown
1116in Figure \ref{ID-BCD2int}.  This implementation
1117shows an interesting variation in the use of
1118half-operand modifiers, with only one operand
1119of each of the addition and multiplication operations
1120modified at each level.  Overall, the IDISA-A implementation
1121requires 9 operations, while the IDISA-B model requires
11226 operations with 3 preloaded registers.
1123In either case, this represents more than a 2X
1124reduction in instruction count as well as a 2X reduction
1125in register pressure.
1128\subsection{Further Applications}
1131Further applications of IDISA can often be found
1132by searching for algorithms employing the magic masks
1133 \verb:0x55555555:,  \verb:0x33333333:, and so on.
1134Examples include the bit-slice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}
1135and integer contraction and dilation for quadtrees and
1136octrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
1137Pixel packing from 32 bit fields into a 5:5:5 representation
1138is a further application of parallel bit deletion.
1140\subsection{Systematic Support for Horizontal SWAR Operations}
1142In SWAR parlance, {\em horizontal} operations are
1143operations which combine values from two or more fields
1144of the same register, in contrast to the normal
1145{\em vertical} operations which combine corresponding
1146fields of different registers.  Horizontal operations
1147can be found that combine two (e.g., \verb:haddpd: on SSE3),
1148four (e.g, \verb:si_orx: on SPU), eight (e.g, \verb:psadbw: on SSE)
1149or sixteen values (e.g., \verb:vcmpequb: on Altivec).  Some
1150horizontal operations have a vertical component as well.
1151For example, \verb:psadbw: first forms the absolute value of
1152the difference of eight corresponding byte fields before
1153performing horizontal add of the eight values, while
1154\verb:vsum4ubs: on Altivec performs horizontal add of sets of
1155four unsigned 8-bit fields within one register
1156and then combines the result horizontally with
1157corresponding 32-bit fields of a second registers.
1159The space of potential horizontal operations thus has
1160many dimensions, including not only the particular
1161combining operation and the operand field width, but
1162also the number of fields being combined, whether a
1163vertical combination is applied and whether it is applied
1164before or after the horizontal operation and what the
1165nature of the vertical combining operation is.
1166Within this space, commodity SWAR architectures tend
1167to support only a very few combinations, without any
1168particular attempt at systematic support for horizontal
1169operations in general.
1171In contrast to this {\em ad hoc} support on commodity
1172processors, IDISA offers a completely systematic treatment
1173of horizontal operations without any special features beyond
1174the inductive doubling features already described.
1175In the simplest case, any vertical operation
1176\verb#simd<n>::F# on $n$-bit fields gives rise to
1177an immediate horizontal operation
1178\verb#simd<n>::F<h,l>(r, r)# for combining adjacent
1179pairs of $n/2$ bit fields.
1180For example, \verb#simd<16>::add<h,l># adds values
1181in adjacent 8 bit fields to produce 16 bit results,
1182while \verb#simd<32>::min<h,l># can produce the
1183minimum value of adjacent 16-bit fields.
1184Thus any binary horizontal operation can be implemented
1185in a single IDISA instruction making use of the \verb:<h,l>:
1186operand modifier combination.
1188Horizontal combinations of four adjacent fields can also be
1189realized in a general way through two steps of inductive
1190doubling.  For example, consider the or-across operation \verb:si_orx:
1191of the SPU, that performs a logical or operation
1192on four 32-bit fields.  This four field combination
1193can easily be implemented with the following two operations.
1196t = simd<64>::or<h,l>(x, x)
1197t = simd<128>::or<h,l>(t, t)
1201In general, systematic support for horizontal
1202combinations of sets of $2^h$ adjacent fields may
1203be realized through $h$ inductive double steps
1204in a similar fashion.
1205Thus, IDISA essentially offers systematic support
1206for horizontal operations entirely through the
1207use of \verb:<h,l>: half-operand modifier
1210Systematic support for general horizontal operations
1211under IDISA also creates opportunity for a design tradeoff:
1212offsetting the circuit complexity of half-operand
1213modifiers with potential elimination of dedicated
1214logic for some {\em ad hoc} horizontal SWAR operations.
1215Even if legacy support for these operations is required,
1216it may be possible to provide that support through
1217software or firmware rather than a full hardware
1218implementation.  Evaluation of these possibilities
1219in the context of particular architectures is a potential
1220area for further work.
1225IDISA may be implemented as a software
1226abstraction on top of existing SWAR architectures or
1227directly in hardware.  In this section, we briefly
1228discuss implementation of IDISA libraries before
1229moving on to consider hardware design.  Although
1230a full realization of IDISA in hardware is beyond our
1231current capabilities, our goal is to develop a sufficiently
1232detailed design to assess the costs of IDISA implementation
1233in terms of the incremental complexity over the RefA
1234and RefB architectures.  The cost factors we consider, then,
1235are the implementation of the half-operand modifiers, and
1236the extension of core operations to the 2-bit, 4-bit and
1237128-bit field widths.  In each case, we also discuss
1238design tradeoffs.
1240\subsection{IDISA Libraries}
1242Implementation of IDISA instructions using template
1243and macro libraries has been useful in developing
1244and assessing the correctness of many of the algorithms
1245presented here.  Although these implementations do not
1246deliver the performance benefits associated with
1247direct hardware implementation of IDISA, they
1248have been quite useful in providing a practical means
1249for portable implementation of parallel bit stream
1250algorithms on multiple SWAR architectures.  However,
1251one additional facility has also proven necessary for
1252portability of parallel bit stream algorithms across
1253big-endian and little-endian architectures: the
1254notion of shift-forward and shift-back operations.
1255In essence, shift forward means shift to the left
1256on little-endian systems and shift to the right on
1257big-endian systems, while shift back has the reverse
1258interpretation.  Although this concept is unrelated to
1259inductive doubling, its inclusion with the IDISA
1260libraries has provided a suitable basis for portable
1261SWAR implementations of parallel bit stream algorithms.
1262Beyond this, the IDISA libraries have the additional
1263benefit of allowing the implementation
1264of inductive doubling algorithms at a higher level
1265abstraction, without need for programmer coding of
1266the underlying shift and mask operations.
1268\subsection{IDISA Model}
1271\includegraphics[width=85mm, trim= 40 350 0 50]{IDISA.pdf}
1272\caption{IDISA Block Diagram}
1277Figure \ref{pipeline-model} shows a block diagram
1278for a pipelined SWAR processor implementing IDISA.
1279The SWAR Register File (SRF) provides a file of $R = 2^A$ 
1280registers each of width $N = 2^K$ bits. 
1281IDISA instructions identified by the Instruction Fetch
1282Unit (IFU) are forwarded for decoding to the SWAR
1283Instruction Decode Unit (SIDU).  This unit decodes
1284the instruction to produce
1285signals identifying the source and destination
1286operand registers, the half-operand modifiers, the
1287field width specification and the SWAR operation
1288to be applied.
1290The SIDU supplies the source register information and the half-operand
1291modifier information to the SWAR Operand Fetch Unit (SOFU).
1292For each source operand, the SIDU provides an $A$-bit register
1293address and two 1-bit signals $h$ and $l$ indicating the value
1294of the decoded half-operand modifiers for this operand.
1295Only one of these values may be 1; both are 0 if
1296no modifier is specified.
1297The SIDU also supplies decoded field width signals $w_k$
1298for each field width $2^k$ to both the SOFU and to the
1299SWAR Instruction Execute Unit (SIEU).  Only one of the
1300field width signals has the value 1.
1301The SIDU also supplies decoded SWAR opcode information to SIEU and
1302a decoded $A$-bit register address for the destination register to
1303the SWAR Result Write Back Unit (SRWBU).
1305The SOFU is the key component of the IDISA model that
1306differs from that found in a traditional SWAR
1307processor.  For each of the two $A$-bit source
1308register addresses, SOFU is first responsible for
1309fetching the raw operand values from the SRF.
1310Then, before supplying operand values to the
1311SIEU, the SOFU applies the half-operand modification
1312logic as specified by the $h$, $l$, and field-width
1313signals.  The possibly modified operand values are then
1314provided to the SIEU for carrying out the SWAR operations.
1315A detailed model of SOFU logic is described in the following
1318The SIEU differs from similar execution units in
1319current commodity processors primarily by providing
1320SWAR operations at each field width
1321$n=2^k$ for $0 \leq k \leq K$.  This involves
1322additional circuitry for field widths not supported
1323in existing processors.  In our evaluation model,
1324IDISA-A adds support for 2-bit, 4-bit and 128-bit
1325field widths in comparison with the RefA architecture,
1326while IDISA-B similarly extends RefB.
1328When execution of the SWAR instruction is
1329completed, the result value is then provided
1330to the SRWBU to update the value stored in the
1331SRF at the address specified by the $A$-bit
1332destination operand.
1334\subsection{Operand Fetch Unit Logic}
1336The SOFU is responsible for implementing the half-operand
1337modification logic for each of up to two input operands fetched
1338from SRF.  For each operand, this logic is implemented
1339using the decoded half-operand modifiers signals $h$ and $l$,
1340the decoded field width signals $w_k$ and the 128-bit operand
1341value $r$ fetched from SRF to produce a modified 128-bit operand
1342value $s$ following the requirements of equations (4), (5) and
1343(6) above.  Those equations must be applied for each possible
1344modifier and each field width to determine the possible values $s[i]$
1345for each bit position $i$.  For example, consider bit
1346position 41, whose binary 7-bit address is $0101001$.
1347Considering the address bits left to right, each 1 bit
1348corresponds to a field width for which this bit lies in the
1349lower $n/2$ bits (widths 2, 16, 64), while each 0 bit corresponds to a field
1350width for which this bit lies in the high $n/2$ bits.
1351In response to the half-operand modifier signal $h$,
1352bit $s[41]$ may receive a value from the corresponding high bit in the field
1353of width 2, 16 or 64, namely $r[40]$,
1354$r[33]$ or $r[9]$.   Otherwise, this bit receives the value $r[41]$,
1355in the case of no half-operand modifier, or a low half-operand modifier
1356in conjunction with a field width signal $w_2$, $w_{16}$ or $w_{64}$.
1357The overall logic for determining this bit value is thus given as follows.
1359s[41] & = & h \wedge (w_2 \wedge r[40] \vee w_{16} \wedge r[33] \vee w_{64} \wedge r[9]) \\
1360& & \vee \neg h \wedge (\neg l \vee w_2 \vee w_{16} \vee w_{64}) \wedge r[41]
1363Similar logic is determined for each of the 128 bit positions.
1364For each of the 7 field widths, 64 bits are in the low $n/2$ bits,
1365resulting in 448 2-input and gates for the $w_k \wedge r[i]$ terms.
1366For 120 of the bit positions, or gates are needed to combine these
1367terms; $441 -120 = 321$ 2-input or gates are required.  Another
1368127 2-input and gates combine these values with the $h$ signal.
1369In the case of a low-half-operand modifier, the or-gates combining $w_k$
1370signals can share circuitry.  For each bit position $i=2^k+j$ one
1371additional or gate is required beyond that for position $j$.
1372Thus 127 2-input or gates are required.  Another 256 2-input and gates
1373are required for combination with the $\neg h$  and $r[i]$ terms.  The terms for
1374the low and high half-operand modifiers are then combined with an
1375additional 127 2-input or gates.   Thus, the circuitry complexity
1376for the combinational logic implementation of half-operand
1377modifiers within the SOFU is 1279 2-input gates per operand,
1378or 2558 gates in total.
1380The gate-level complexity of half-operand modifiers as described is nontrivial,
1381but modest.   However, one possible design tradeoff is to
1382differentiate the two operands, permitting a high half-operand
1383modifier to be used only with the first operand and a low-modifier
1384to be used only with the second operand.  This would exclude
1385\verb#<h,h># and \verb#<l,l># modifier combinations and also
1386certain combinations for noncommutative core operations. 
1387The principal
1388consequence for the applications considered here would be with
1389respect to the pack operations in forward transposition, but it
1390may be possible to address this through SIEU circuitry.
1391If this approach were to be taken, the gate complexity of
1392half-operand modification would be reduced by slightly more than half.
1394\subsection{2-Bit and 4-Bit Field Widths}
1396Beyond the half-operand modifiers, extension of core SWAR
1397operations to 2-bit and 4-bit field widths is critical to
1398inductive doubling support.  The principal operations
1399that need to be supported in this way are addition, pack, merge
1400merge, and rotate. 
1402Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
1403as a modification to that for 8-bit fields by incorporating logic to
1404disable carry propagation at the 16 mid-field boundaries.  For 2-bit
1405fields, disabling carry propagation at 32 additional boundaries suffices,
1406although it may be simpler to directly implement the simple logic
1407of 2-bit adders.
1409Pack and merge require bit selection logic for each field width.
1410A straightforward implementation model for each operation
1411uses 128 2-input and gates to select the desired bits from the
1412operand registers and another 128 2-input or gates to feed
1413these results into the destination register.
1415Rotation for 2-bit fields involves simple logic for selecting
1416between the 2 bits of each field of the operand being
1417rotated on the basis of the low bit of each field of the
1418rotation count.  Rotation for 4-bit fields is more complex,
1419but can also be based on 1-of-4 selection circuitry
1420involving the low 2 bits of the rotation count fields.
1422\subsection{128-Bit Field Widths}
1424For completeness, the IDISA model requires core operations
1425to be implemented at the full register width, as well
1426as power-of-2 partitions.  This may be problematic for
1427operations such as addition due to the inherent delays
1428in 128-bit carry propagation.  However, the primary
1429role of 128 bit operations in inductive doubling
1430is to combine two 64-bit fields using \verb#<h,l>#
1431operand combinations.  In view of this, it may be
1432reasonable to define hardware support for such
1433combinations to be based on 64-bit logic, with support
1434for 128-bit logic implemented through firmware or software.
1436\subsection{Final Notes and Further Tradeoffs}
1438In order to present IDISA as a concept for design extension of
1439any SWAR architecture, our discussion of gate-level implementation
1440is necessarily abstract.  Additional circuitry is sure to be
1441required, for example, in implementation of SIDU.  However,
1442in context of the 128-bit reference architectures studied,
1443our analysis suggests realistic IDISA implementation well
1444within a 10,000 gate budget.
1446However, the additional circuitry required may be offset
1447by elimination of special-purpose instructions found in
1448existing processors that could instead be implemented through efficient
1449IDISA sequences.  These include examples such as population
1450count, count leading and/or trailing zeroes and parity.
1451They also include specialized horizontal SWAR operations.
1452Thus, design tradeoffs can be made with the potential of
1453reducing the chip area devoted to special purpose instructions
1454in favor of more general IDISA features.
1458In considering the architectural support for
1459SWAR text processing using the method of parallel bit streams,
1460this paper has presented the principle of inductive doubling
1461and a realization of that principle in the form of
1462IDISA, a modified SWAR instruction set architecture.
1463IDISA offers support for SWAR operations at all power-of-2
1464field widths, including 2-bit and 4-bit widths, in particular,
1465as well as half-operand modifiers and pack and merge operations
1466to support efficient transition between successive power-of-two
1467field widths.  The principal innovation is the notion of
1468half-operand modifiers that eliminate the cost associated
1469with the explicit mask and shift operations required for
1470such transitions.
1472Several algorithms key to parallel bit stream methods
1473have been examined and shown to benefit from dramatic
1474reductions in instruction count compared to the best
1475known algorithms on reference architectures.  This
1476includes both a reference architecture modeled on
1477the SWAR capabilities of the SSE family as well as
1478an architecture incorporating the powerful permute
1479or shuffle capabilities found in Altivec or Cell BE processors.
1480In the case of transposition algorithms to and from parallel bit stream
1481form, the architecture has been shown to make possible
1482straightforward inductive doubling algorithms with a 3X
1483speedup over the best known versions on permute-capable
1484reference architectures, achieving the lowest total number
1485of operations of any possible 3-register SWAR architecture.
1487Applications of IDISA in other areas have also been
1488examined.  The support for 2-bit and 4-bit field widths
1489in SWAR processing is beneficial for packed DNA representations
1490and packed decimal representations respectively.  Additional
1491inductive doubling examples are presented and the phenomenon
1492of power-of-2 transitions discussed more broadly.
1493Most significantly, IDISA supports a fully general approach
1494to horizontal SWAR operations, offering a considerable
1495improvement over the {\em ad hoc} sets of special-purpose
1496horizontal operations found in existing SWAR instruction sets.
1498An IDISA implementation model has been presented employing
1499a customized operand fetch unit to implement the half-operand
1500modifier logic.  Gate-level implementation of this unit
1501and operations at the 2-bit and 4-bit field widths have
1502been analyzed and found to be quite reasonable within
1503a 10,000 gate budget.  Design tradeoffs to reduce the cost
1504have also been discussed, possibly even leading to a
1505net complexity reduction through elimination of instructions
1506that implement special-case versions of inductive doubling.
1508Future research may consider the extension of inductive doubling
1509support in further ways.   For example, it may be possible
1510to develop a pipelined architecture supporting two or three
1511steps of inductive doubling in a single operation. 
1514%\section{Appendix Title}
1516%This is the text of the appendix, if you need one.
1520This research was supported by a Discovery Grant from the
1521Natural Sciences and Engineering Research Council of Canada.
1529Krste Asanovic, Ras Bodik, Bryan~Christopher Catanzaro, Joseph~James Gebis,
1530  Parry Husbands, Kurt Keutzer, David~A. Patterson, William~Lester Plishker,
1531  John Shalf, Samuel~Webb Williams, and Katherine~A. Yelick.
1532\newblock The landscape of parallel computing research: A view from berkeley.
1533\newblock Technical Report UCB/EECS-2006-183, EECS Department, University of
1534  California, Berkeley, Dec 2006.
1537Robert~D. Cameron.
1538\newblock {\tt u8u16} -- a high-speed {UTF-8 to UTF-16} transcoder using
1539  parallel bit streams.
1540\newblock Technical Report TR 2007-18, Simon Fraser University, Burnaby, BC,
1541  Canada, 2007.
1544Robert~D. Cameron.
1545\newblock A case study in {SIMD} text processing with parallel bit streams.
1546\newblock In {\em {ACM Symposium on Principles and Practice of Parallel
1547  Programming (PPoPP)}}, {Salt Lake City, Utah}, 2008.
1550Robert~D. Cameron, Kenneth~S. Herdy, and Dan Lin.
1551\newblock High performance {XML} parsing using parallel bit stream technology.
1552\newblock In {\em {CASCON '08: Proceedings of the 2008 conference of the Centre
1553  for Advanced Studies on Collaborative research}}, {Toronto, Ontario ,
1554  Canada}, 2008.
1557James~R. Green, Hanan Mahmoud, and Michel Dumontier.
1558\newblock Towards real time protein identification using the {Cell BE}.
1559\newblock In {\em Workshop on Cell BE and Heterogeneous Multicore Systems:
1560  Architectures and Applications at CASCON '08}, {Toronto, Ontario , Canada},
1561  2008.
1564Kenneth~S. Herdy, David~S. Burggraf, and Robert~D. Cameron.
1565\newblock High performance {GML to SVG} transformation for the visual
1566  presentation of geographic data in web-based mapping systems.
1567\newblock In {\em {Proceedings of SVG Open 2008}}, {Nuremburg, Germany}, 2008.
1570Donald~E. Knuth.
1571\newblock {The Art of Computer Programming Volume 4 Pre-Fascicle 1A: Bitwise
1572  Tricks and Techniques}.
1573\newblock Draft of 22 December 2008, Stanford University.
1576R.~Raman and D.S. Wise.
1577\newblock Converting to and from dilated integers.
1578\newblock {\em {IEEE Transactions on Computers}}, 57(4):567--573, 2008.
1581Chester Rebeiro, A.~David Selvakumar, and A.~S.~L. Devi.
1582\newblock Bitslice implementation of AES.
1583\newblock In David Pointcheval, Yi~Mu, and Kefei Chen, editors, {\em CANS},
1584  volume 4301 of {\em Lecture Notes in Computer Science}, pages 203--212.
1585  Springer, 2006.
1588Leo Stocco and G\"{u}nther Schrack.
1589\newblock Integer dilation and contraction for quadtrees and octrees.
1590\newblock In {\em {Proceedings of the IEEE Pacific Rim Conference on
1591  Communications, Computers, and Signal Processing}}, pages 426--428,
1592  {Victoria, B.C.}, 1995.
1595Henry~S. Warren.
1596\newblock {\em Hacker's Delight}.
1597\newblock Addison-Wesley, 2002.
Note: See TracBrowser for help on using the repository browser.