source: docs/ASPLOS09/asplos094-cameron.tex @ 225

Last change on this file since 225 was 225, checked in by cameron, 11 years ago

Green citation; adjust figs 1, 2; plain bibstyle

File size: 50.4 KB
3% asplos094-cameron.tex
4% Robert D. Cameron and Dan Lin
6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
9\input epsf
20\conferenceinfo{ASPLOS'09,} {March 7--11, 2009, Washington, DC, USA.}
25\titlebanner{banner above paper title}        % These are ignored unless
26\preprintfooter{short description of paper}   % 'preprint' option specified.
28\title{Architectural Support for SWAR Text Processing with Parallel Bit
29Streams: The Inductive Doubling Principle}
30%\subtitle{Subtitle Text, if any}
32\authorinfo{Robert D. Cameron \and Dan Lin}
33           {School of Computing Science, Simon Fraser University}
34           {\{cameron, lindanl\}}
40Parallel bit stream algorithms exploit the SWAR (SIMD within a
41register) capabilities of commodity
42processors in high-performance text processing applications such as
43UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
44matching.   Direct architectural support for these algorithms in future SIMD
45instruction sets could further increase performance as well as simplifying the
46programming task. A set of simple SWAR instruction set extensions are proposed
47for this purpose based on the principle of systematic support for  inductive
48doubling as an algorithmic technique.  These extensions are shown to
49significantly reduce instruction count in core parallel bit stream algorithms,
50often providing a 3X or better improvement.  The extensions are also shown to be useful
51for SIMD programming in other application areas, including providing a
52systematic treatment for horizontal operations.  An implementation model for
53these extensions
54involves relatively simple circuitry added to the operand fetch components
55in a pipelined processor.
61term1, term2
64keyword1, keyword2
70In the landscape of parallel computing research, finding ways to
71exploit intrachip (multicore) and intraregister (SWAR) parallelism
72for text processing and other non-numeric applications is particularly
73challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
74study \cite{Landscape} identifies the finite-state machine algorithms associated
75with text processing to be the hardest of the thirteen ``dwarves''
76to parallelize, concluding that nothing seems to help.  Indeed,
77the study even speculates that applications in this area may simply be
78``embarrasingly sequential,'' easy to tackle for traditional
79sequential processing approaches suitable for uniprocessors,
80but perhaps fundamentally unsuited to parallel methods.
82One approach that shows some promise, however, is the
83method of parallel bit streams, recently applied to
84UTF-8 to UTF-16 transcoding \cite{PPoPP08}, XML parsing \cite{CASCON08}
85and amino acid sequencing\cite{Green}.
86In this method, byte-oriented character data is first transposed to eight
87parallel bit streams, one for each bit position within the character code
88units (bytes).  Loading bit stream data into 128-bit SIMD registers,
89then, allows data from 128 consecutive code units to be represented and
90processed at once.  Bitwise logic and shift operations, bit scans,
91population counts and other bit-based operations are then used to carry
92out the work.
94The cited study of UTF-8 to UTF-16 transcoding reports a 3X to 25X
95speed-up in using parallel bit stream techniques on SIMD-capable uniprocessors
96employing the SSE or Altivec instruction sets.  A full implementation
97for each of these platforms is documented using literate programming techniques
98and available as an open source code base \cite{u8u16}.
99Although this transcoding task represents but a single text processing kernel,
100it is widely cited as a critical bottleneck in XML parsing, accounting for 30\% or more
101of processing time\cite{NicolaJohn03, Perkins05, Psaila06}.  The example is also interesting in that
102it illustrates a text processing task that can largely
103be carried out using SIMD operations even though
104UTF-8 character sequences are variable length. Indeed, one weakness of the
105actual implementation is that it reverts to byte-at-a-time processing
106at block boundaries, employing a block shortening strategy to
107reduce block lengths to as low as 125 bytes from 128.
108With more capable SIMD instruction set architectures providing
109better facilities for multiregister shifts and larger register
110files, a solution employing SIMD techniques for virtually all
111of the main work would maintain better data alignment, avoid
112problems of register pressure and be easier to parallelize across
113multiple cores.  It should also naturally scale to 256-bit SIMD technology
114such as the expected AVX technology of Intel.
116The report addressing the broader problem of XML parsing is perhaps more
117compelling, demonstrating the utility of parallel bit stream techniques
118in delivering performance benefits through a significant portion of the
119web technology stack.  In an XML statistics gathering application,
120including the implementation of XML well-formedness checking, an
121overall 2.5X to 6X performance improvement is reported in using
122the parallel bit stream methods in comparison with a similarly
123coded application using such well known parsers as Expat and Xerces.
124Although still a work in progress, the parser is claimed to have functional
125coverage of XML and related specifications comparable to, but somewhat
126beyond Expat.   The study also identifies promising further
127work in extending the parallel bit stream methods to parallel
128hash value computation and parallel regular expression matching
129for the purpose of validating XML datatype declarations in
130accord with XML Schema.
132Given these promising initial results in the application of
133parallel bit stream methods, what role might architectural support play in
134further enhancing this route to parallelization of text processing?
135This paper addresses this question through presentation and
136analysis of a constructive proposal: a set of SIMD instruction set
137features based on the principle of systematic support for
138inductive doubling algorithms.  Inductive doubling refers
139to a general property of certain kinds of algorithm that
140systematically double the values of field widths or other
141data attributes with each iteration.  In essence, the goal
142of the proposed features is to support such algorithms
143with specific facilities to  transition between successive power-of-2 field
144widths.   These transitions are quite frequent in several critical
145algorithms for parallel bit streams.   These transitions also
146occur in other applications as well.   For example, efficient
147automatic interleaving based on power-of-2 field widths has been
148reported as generally useful for a number of SIMD kernels \cite{Nuzman}.
150The remainder of this paper is organized as follows.
152The second section of this paper introduces the proposed
153inductive doubling instruction set architecture and the
154SIMD notation used throughout this paper.  A brief comparison of
155this architecture with existing SIMD instruction
156sets of commodity processors such as the Intel SSE
157instruction set and the Power PC Altivec instruction set
158is also made.
160The third section provides a short first example of
161the inductive doubling principle in action through
162the case of population count.   Although this operation
163is not a strong determinant of performance for parallel bit
164stream applications, it is nevertheless an operation needed
165frequently enough in the general computing milieux to find
166its way into some instruction set architectures, typically
167at one particular field width.  By way of comparison, the
168inductive doubling architecture sacrifices some
169performance at the chosen field width, while offering a more
170general solution with frequently better performance at
171other field widths.
173The fourth section then moves on to consider the performance-critical
174and key task of conversion between serial byte streams and parallel
175bit streams.  A first algorithm that uses the existing SIMD
176operations common to SSE and Altivec is shown, requiring 72
177operations to transform 128 bytes of data using the three-register
178instruction form.  We then move on to consider how the task may
179be simplified using the inductive doubling architecture to
180require a mere 24 operations.  As well as providing a 3X speed-up,
181it is also argued that the version using the inductive doubling
182architecture is considerably simpler and easier to program.
184The fifth section then briefly considers the inverse transposition
185process, converting parallel bit stream data back into byte
186streams.  Again, an algorithm to carry out this task requires
18772 straight-line SIMD operations in the Altivec three-register
188form, but is reduced to a simpler 24 operations using the
189inductive doubling architecture.
191The sixth section then goes on to consider the problem of
192parallel bit deletion.  This operation is performance-critical
193to any applications that require filtering or
194editing operations on strings using the parallel bit stream
195algorithms.   For example, it is fundamental to the
196high-speed UTF-8 to UTF-16 transcoding algorithm that is
197often a critical component in XML parsing.  In this
198section, an inductive doubling algorithm based on the
199concept of a central deletion result is described and
200shown to have much better performance than a parallel-prefix
204The seventh section considers the issue of
205horizontal SIMD operations, that is, operations for combining
206sets of adjacent fields within individual SIMD registers rather than
207corresponding fields within sets of registers.  While existing
208SIMD instruction set architectures tend to only support a few
209ad hoc horizontal combinations, the inductive doubling
210architecture is shown to provide a systematic means for
211efficient horizontal combinations of any kind.
213Implementation issues are briefly considered in section 8 of the paper,
214while the ninth section concludes the paper with a summary of results
215and discussion of areas for future work.
218\section{Inductive Doubling Architecture}
220This section presents an idealized model for a single-instruction
221multiple-data (SIMD) instruction set architecture designed
222specifically to support inductive doubling algorithms in the
223domain of parallel bit stream programming.   The architecture is idealized
224in the sense that we concentrate on only the necessary features
225for our purpose, without enumerating the additional
226operations that would be required for
227SIMD applications in other domains.  The goal is to focus
228on the principles of inductive doubling support in a way
229that can accommodate a variety of realizations as other
230design constraints are brought to bear on the overall instruction set
231design.  First we introduce a simple model and notation for
232SIMD operations in general and then present the four key
233features of an idealized architecture in support of parallel
234bit stream programming.
236The idealized architecture supports typical SIMD integer
237operations common to existing commodity architectures such as SSE
238and Altivec.   The architecture is defined to support SIMD
239operations on registers of size $N=2^K$ bits, for some integer $K$.
240Typical values of $K$ for commodity processors include $K=6$ for
241the 64-bit registers of Intel MMX and Sun VIS technology, $K=7$ for
242the 128-bit registers of SSE and Altivec technology and $K=8$ for
243the upcoming Intel AVX technology.   The registers may be
244partitioned into $N/n$ fields of width $n=2^k$ bits for some values
245of $k \leq K$.   Typical values of $k$ used on commodity processors
246include $k = 3$ for SIMD operations on 8-bit fields (bytes),
247$k = 4$ for operations on 16-bit fields and $k = 5$ for operations
248on 32-bit fields.  Whenever a register $r$  is partitioned into $n$-bit
249fields, the fields are indexed $r_n[0]$ through $r_n[N/n-1]$
250Field $r_n[i]$ consists of bits $i \times n$ through $(i+1) \times n -1$ of
251register $r$, using big-endian numbering.
253Let \verb:simd<n>: represent the class of SIMD operations defined
254on fields of size $n$ using C++ template syntax.  Given a
255binary function $F_n$ on $n$-bit fields, we denote the SIMD
256version of this operation as \verb#simd<n>::F#.  Given two
257SIMD registers \verb:a: and \verb:b: holding values $a$ and $b$,
258respectively, the operation \verb#r=simd<n>::F(a,b)# stores
259the value $r$ in the register \verb:r: as determined by
260the simultaneous calculation of individual field values in
261accord with the following equation.
262\[ r_i = F_n(a_i, b_i) \]
264For example, addition(\verb:add:), subtraction (\verb:sub:) and shift left
265logical (\verb:sll:) may be defined as binary functions on $n$-bit unsigned
266integers as follows.
269\mbox{add}_n(a,b) & = & (a+b) \bmod 2^n \\
270\mbox{sub}_n(a,b) & = & (a-b+2^n) \bmod 2^n \\
271\mbox{sll}_n(a,b) & = & a \times 2^{b \bmod n} \bmod 2^n
274The SSE and Altivec instruction sets support
275each of the addition and subtraction operations in SIMD form
276for 8, 16 and 32-bit fields, while the SSE set also includes
277the 64-bit forms.  For example, \verb#simd<8>::add# in our
278nomenclature is provided by the operation \verb:paddb:
279on SSE and the operation \verb:vaddubm: on Altivec.
280The equivalents of \verb#simd<8>::sll#, \verb#simd<16>::sll#, and
281\verb#simd<32>::sll# are avilable on Altivec; the SSE facilities are
282more constrained, requiring that all field shifts by the same amount.
284Given these definitions and notation, we now present
285the four key elements of an inductive doubling architecture.
286The first is a definition of a core set of binary functions
287on $n$-bit fields for all field widths $n=2^k$ for $0 \leq k \leq K$.
288The second is a set of {\em half-operand modifiers} that allow
289the inductive processing of fields of size $2n$ in terms of
290combinations of $n$-bit values selected from the fields.
291The third is the definition of packing operations that compress
292two consecutive registers of $n$-bit values into a single
293register of $n/2$-bit values.  The fourth is the definition
294of merging operations that produce a set of $2n$ bit fields
295by concatenating corresponding $n$-bit fields from two
296parallel registers.  Each of these features is described below.
298For the purpose of direct and efficient support for inductive
299doubling algorithms on bit streams, the provision of
300a core set of operations at field widths of 2 and 4 as
301well as the more traditional field witdhs of 8, 16 and 32
302is critical for elegant and efficient expression of the
303algorithms.  In essence, inductive doubling algorithms
304work by establishing some base property at either single
305or 2-bit fields.  Each iteration of the algorithm then
306goes on to establish the property for the power-of-2
307field width.  In order for this inductive step to be
308most conveniently and efficiently expressed, the
309core operations needed for the step should be available
310at each field width.  In the case of work with parallel
311bit streams, the operations \verb:add:, \verb:sub:,
312\verb:sll:, \verb:srl: (shift right logical), and \verb:rotl:
313(rotate left) comprise the core.  In other domains,
314additional operations may be usefully included in the
315core depending on the work that needs to be performed
316at each inductive doubling level.
318Note that the definition of field widths $n=2^k$ for $0 \leq k \leq K$
319also includes fields of width 1.  These are included for
320logical consistency, but are easily implemented by mapping
321directly to appropriate bitwise logic operations, which
322we assume are also available.  For example,
323\verb#simd<1>::add# is equivalent to \verb:simd_xor:, the
324bitwise exclusive-or operation on SIMD registers.
326The second key facility of the inductive doubling architecture
327is the potential application of half-operand modifiers to
328the fields of either or both of the operands of a SIMD
329operation.  These modifiers select either the
330low $n/2$
331bits of each $n$-bit field (modifier ``\verb:l:'') or the
332high $n/2$ bits (modifier ``\verb:h:'').  When required,
333the modifier ``\verb:x:'' means that the full $n$ bits
334should be used, unmodified.  The semantics of these
335modifiers are given by the following equations.
338l(r_n) & = & r_n \bmod 2^{n/2} \\
339h(r_n) & = & r_n / 2^{n/2} \\
340x(r_n) & = & r_n
343In our notation, the half-operand modifiers are
344specified as optional template (compile-time) parameters
345for each of the binary functions.  Thus,
346\verb#simd<4>::add<l,h>(a,b)# is an operation which adds
347the 2-bit quantity found in the low 2-bits of each 4-bit field
348of its first operand (\verb:a:)
349together with the corresponding 2-bit quantity found in the
350high 2-bits of its second operand (\verb:b:).
351In general, the purpose of the half-operand modifiers
352in support of inductive doubling is to allow the processing
353of $n$-bit fields to easily expressed in terms of
354combination of the results determined by processing
355$n/2$ bit fields.
357The third facility of the inductive doubling architecture
358is a set of pack operations at each field width $n=2^k$ for $1 \leq k \leq K$.
359The field values of \verb#r=simd<n>::pack(a,b)# are
360defined by the following equations.
363r_{n/2}[i] & = & \mbox{conv}(a_n[i], n/2), \textrm{for } i < N/n \\
364r_{n/2}[i] & = & \mbox{conv}(b_n[i - N/n], n/2), \textrm{for } i \geq N/n
367Here conv is a function which performs conversion of an $n$-bit
368value to an $n/2$ bit value by signed saturation (although
369conversion by unsigned saturation would also suit our purpose).
371Half-operand modifiers may also be used with the pack
372operations.  Thus packing with conversion by masking off all
373but the low $n/2$ bits of each field may be
374be performed using the operation \verb#simd<n>::pack<l,l>#
377The final facility of the inductive doubling architecture is
378a set of merging operations \verb#simd<n>::mergeh# and
379\verb#simd<n>::mergel# that produce $2n$-bit fields
380by concatenating corresponding $n$-bit fields from the
381operand registers.   The respective
382operations \verb#r=simd<n>::mergeh(a,b)# and
384are defined by the following equations.
387r_{2n}[i] & = & a[i] \times 2^n + b[i] \\
388s_{2n}[i] & = & a[i+N/(2n)] \times 2^n + b[i+N/(2n)]
391Both SSE and Altivec provide versions of pack and merge operations
392for certain field widths.  The pack operations are provided
393with operands having 16-bit or 32-bit fields on each platform, although
394with some variation in how conversion is carried out.
395The merge operations are provided at 8-bit, 16-bit and 32-bit
396field widths on both architectures and also at the 64-bit level
397on SSE.
399This completes the description of the proposed inductive
400doubling architecture.  As described, many of the features
401are already available with the SIMD facilities of
402existing commodity processors.  The extensions enumerated
403here are relatively straightforward.  The innovation
404is to specifically tackle the design of facilities to
405offer systematic support for transitions between power-of-2 field widths.
406As we shall show in the remainder of this paper, these facilities
407can dramatically reduce instruction count in core parallel bit
408stream algorithms, with a factor of 3 reduction being typical.
409The next section goes on to illustrate a first simple example
410for the task of population count.
412\section{Population Count}
417c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
418c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
419c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
420c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
421c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
424\caption{Population Count Reference Algorithm}
431c = simd<2>::add<l,h>(x, x);
432c = simd<4>::add<l,h>(c, c);
433c = simd<8>::add<l,h>(c, c);
434c = simd<16>::add<l,h>(c, c);
435c = simd<32>::add<l,h>(c, c);
438\caption{Ideal SIMD Population Count}
442As an initial example to illustrate the principle of inductive doubling
443in practice, consider the problem of {\em population count}: determining
444the number of one bits within a particular bit field.  It is important
445enough for such operations as
446calculating Hamming distance to be included as a built-in instruction
447on some processors.  For example, the SPU of the Cell Broadband Engine
448has a SIMD population count instruction \verb:si_cntb: for simultaneously
449determining the
450number of 1 bits within each byte of a 16-byte register.
451In text processing with parallel bit streams, population count has direct
452application to keeping track of line numbers for error reporting, for example.
453Given a bit block identifying the positions of newline characters within
454a block of characters being processed, the population count of the
455bit block can be used to efficiently and conveniently be used to update
456the line number upon completion of block processing.
458Figure \ref{HD-pop} presents a traditional divide-and-conquer
459implementation for a 32-bit integer {\tt x} adapted from
460Warren \cite{HackersDelight}, while
461Figure \ref{inductivepopcount} shows the corresponding SWAR
462implementation for a vector of 32-bit fields using the inductive doubling
463instruction set architecture.  Each implementation employs
464five steps of inductive doubling to produce population counts
465within 32 bit fields.  The traditional implementation employs
466explicit masking and shifting operations, while these
467operations are implicit within the semantics of the inductive
468doubling instructions used in Figure \ref{inductivepopcount}.
469In each implementation, the first step determines the
470the population counts within 2-bit fields
471by adding the high bit of each such field to the low bit
472to produce a set of 2-bit counts in {\tt c}.
473In the second step, the counts within 4-bit fields of {\tt c} are determined
474by adding the counts of the corresponding high and low 2-bit subfields.
475Continuing in this fashion,
476the final population counts within 32-bit fields are determined in five steps.
478With the substitution of longer mask constants replicated for four
47932-bit fields, the implementation of Figure \ref{HD-pop} can be
480easily adapted to SWAR processing using 128-bit registers.
481Such an implementation requires 10 operations to load or generate
482mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
483total of 30 operations to complete the task in comparison
484to a mere   However, Warren further refines the
485implementation to an optimized version requiring only 20 operations,
4865 of which are required to generate mask constants.  At the cost
487of register pressure, it is possible that these constants
488could be kept preloaded.  In any case, the IDISA implementation
489offers a 3X to 4X improvement in instruction count as well as
490a reduction in register pressure.
495\section{Transposition to Parallel Bit Streams}
497In this section, we consider the first major
498application of the inductive doubling architecture:
499transposition of byte stream data to parallel bit stream
500form.   Of course, this operation is critical to the
501method of parallel bit streams and all applications
502of the method can benefit from a highly efficient
503transposition process.  Before considering how
504the inductive doubling architecture supports this
505transposition process, however, we first consider
506algorithms on existing architectures.  Two algorithms
507are presented; the best of these requires 72
508SIMD operations in the three-register model to perform
509transposition of eight serial registers of byte stream data into
510eight parallel registers of bit stream data.
512We then go on to show how the transposition problem
513can be solved using the inductive doubling architecture
514with a mere 24 three-register SIMD operations.  We also prove
515that this is optimal for any three-register instruction set model.
520\includegraphics[width=120mm, trim= 0 90 0 50]{S2P_IO.pdf}
521\caption{Input/Output Model for Serial to Parallel Transposition}
526Figure \ref{s2p-spec} illustrates the input-output requirements of
527the transposition problem.  We assume that inputs and
528outputs are each SIMD registers of size $N=2^K$ bits.
529The input consists of $N$ bytes of serial byte data,
530stored consecutively in eight SIMD registers each holding
531$N/8$ bytes.  The output consists of eight parallel
532registers, one each for the eight individual bit positions
533within a byte.  Upon completion of the transposition process,
534each output register is to hold the $N$ bits corresponding
535to the selected bit position in the sequence of $N$ input
538\subsection{Bit Gathering Algorithm}
542\includegraphics[width=150mm, trim= 0 150 0 50]{S2P.pdf}
543\caption{Serial to Parallel Transposition Using Bit-Gathering}
547One straightforward algorithm for implementing the transposition process
548takes advantage of SIMD bit gathering operations that exist
549on some architectures.  This operation gathers one bit per byte
550from a particular position within each byte of a SIMD register.
551For example, the {\tt pmovmskb} operation of the Intel MMX and
552SSE instruction sets forms an 8-bit (MMX) or 16-bit (SSE) mask
553consisting of the high bit of each byte.  Similarly, the
554{\tt \verb:si_gbb:} operation of the synergistic processing units of the
555Cell Broadband Engine gathers together the low bit of each byte
556from the SIMD register.  Figure \ref{gather} illustrates the
557bit gathering process.
559For each bit stream of output, the bit gather algorithm requires
560one gather operation for each of the 8 input registers,
561giving 64 bit gather operations in all.  In addition, for seven
562of the eight bit positions, it is necessary to shift the bits
563of each input register into the conventional position for
564gathering.  A total of 56 shift operations are required for this
565task.  Finally, the result of each bit gather operation must
566be properly inserted into the output stream.  If the first
567gather operation for a stream can be used to also initialize
568the output register, there will then need to be 7 insert
569operations for the results of the remaining gather operations
570for this stream, making 56 insert operations in all.
571The insert step may be more complex than a single operation
572in some cases, but we use one operation per insert as a lower bound.
573Thus, the bit gather algorithm requires
574at least 176 operations to perform the transposition task.
576\subsection{BytePack Algorithm}
578A much more efficient transposition algorithm on commodity
579SIMD architectures (SSE and Altivec) involves three
580stages of binary division transformation.  This is similar
581to the three stage bit matrix inversion described by
582Warren  \cite{HackersDelight}, although modified to use SIMD operations.
583In each stage, input streams are divided into two half-length output streams.
584The first stage separates the bits at even numbered positions from those
585at odd number positions.  The two output streams from the first
586stage are then further divided in the second stage.
587The stream comprising even numbered bits from the original byte stream
588divides into one stream consisting of bits from positions 0 and 4 of each
589byte in the original stream and a second stream consisting of bits
590from positions 2 and 6 of each original byte.  The stream of bits from
591odd positions is similarly divided into streams for bits from Each of the
592positions 1 and 5 and bits from positions 2 and 6.
593Finally, each of the four streams resulting from the second stage are
594divided into the desired individual bit streams in the third stage.
599t0 = simd<16>::pack<h,h>(s0, s1);
600t1 = simd<16>::pack<l,l>(s0, s1);
601p0 = simd_if(simd<8>::const(0xC0C0), t0, simd::<16>srli<1>(t1));
602p1 = simd_if(simd<8>::const(0xC0C0), simd::<16>slli<1>(t0), t1);
605\caption{Basic Stage 1 Transposition Step in the BytePack Algorithm}
609The binary division transformations are accomplished in each stage
610using byte packing, shifting and masking.  In each stage, a
611transposition step combines each pair of serial input registers to
612produce a pair of parallel output registers.  In essence,
613doublebytes from the input registers are packed into bytes
614in the output registers, with the bits from even positions stored
615in the bytes of one output stream and the bits from odd positions
616stored in the bytes of the second output stream.
617Figure \ref{s2pstep} shows a step in stage 1 of the algorithm
618producing two parallel registers \verb:p0: and \verb:p1: from
619two serial registers \verb:s0: and \verb:s1:.  This step is applied
620four times in stage 1; stages 2 and 3 also consist of four applications
621of a similar step with different shift and masking constants.
623Although we have used the idealized SIMD notation here, each of the
624operations maps to a single operation in the Altivec set and a small number
625of operations in the SSE set.  Using the Altivec set, there are
6266 operations per step for a total of 24 operations per stage. 
627The three stages combined required 72 operations to transpose 128 bytes
628to parallel bit stream form.  This is the best algorithm known to
629us for existing SIMD architectures.
631\subsection{Inductive Halving Algorithm}
633Using the inductive doubling architecture, it is possible to design
634a transposition algorithm that is both easier to understand and requires
635many fewer operations than the the bytepack algorithm described above.
636We call it the inductive halving algorithm for serial to parallel
637transposition, because it proceeds by reducing byte streams to
638two sets of nybble streams in a first stage, dividing the nybble
639streams into streams of bitpairs in a second stage and finally
640dividing the bitpair streams into bit streams in the third stage.
645p0 = simd<8>::pack<h,h>(s0, s1);
646p1 = simd<8>::pack<l,l>(s0, s1);
648\caption{Basic Stage 1 Transposition Step in the Inductive Halving Algorithm}
652Figure \ref{halvingstep} shows one step in stage 1 of the inductive
653halving algorithm, comprising just two SIMD operations.
654The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
655from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
656the low nybble of each byte.  As in the bytepack algorithm, this step is
657applied 4 times in stage 1, for a total of 8 operations.
659Stage 2 of the inductive halving algorithm reduces nybble streams
660to streams of bit pairs.  The basic step in this algorithm consists
661of one \verb#simd<4>::pack<h,h># operation to extract the high pair
662of each nybble and one \verb#simd<4>::pack<l,l># operation to extract the
663low pair of each nybble.  Four applications of this step complete stage 2.
665Stage 3 similarly uses four applications of a step that uses a
666\verb#simd<2>::pack<h,h># operation to extract the high bit of
667each pair and a \verb#simd<2>::pack<l,l># to extract the low bit of
668each pair.  The complete algorithm to transform eight serial
669byte registers s0 through s7 into the eight parallel bit stream
670registers bit0 through bit7 is shown in Figure \ref{halvingalgorithm}.
674hi_nybble0 = simd<8>::pack<h,h>(s0, s1);
675lo_nybble0 = simd<8>::pack<l,l>(s0, s1);
676hi_nybble1 = simd<8>::pack<h,h>(s2, s3);
677lo_nybble1 = simd<8>::pack<l,l>(s2, s3);
678hi_nybble2 = simd<8>::pack<h,h>(s4, s5);
679lo_nybble2 = simd<8>::pack<l,l>(s4, s5);
680hi_nybble3 = simd<8>::pack<h,h>(s6, s7);
681lo_nybble3 = simd<8>::pack<l,l>(s6, s7);
682hh_pair0 = simd<4>::pack<h,h>(hi_nybble0, hi_nybble1);
683hl_pair0 = simd<4>::pack<l,l>(hi_nybble0, hi_nybble1);
684lh_pair0 = simd<4>::pack<h,h>(lo_nybble0, lo_nybble1);
685ll_pair0 = simd<4>::pack<l,l>(lo_nybble0, lo_nybble1);
686hh_pair1 = simd<4>::pack<h,h>(hi_nybble2, hi_nybble3);
687hl_pair1 = simd<4>::pack<l,l>(hi_nybble2, hi_nybble3);
688lh_pair1 = simd<4>::pack<h,h>(lo_nybble2, lo_nybble3);
689ll_pair1 = simd<4>::pack<l,l>(lo_nybble2, lo_nybble3);
690bit0 = simd<2>::pack<h,h>(hh_pair0, hh_pair1);
691bit1 = simd<2>::pack<l,l>(hh_pair0, hh_pair1);
692bit2 = simd<2>::pack<h,h>(hl_pair0, hl_pair1);
693bit3 = simd<2>::pack<l,l>(hl_pair0, hl_pair1);
694bit4 = simd<2>::pack<h,h>(lh_pair0, lh_pair1);
695bit5 = simd<2>::pack<l,l>(lh_pair0, lh_pair1);
696bit6 = simd<2>::pack<h,h>(ll_pair0, ll_pair1);
697bit7 = simd<2>::pack<l,l>(ll_pair0, ll_pair1);
699\caption{Complete Inductive Halving Algorithm}
703\subsection{Optimality of the Inductive Halving Algorithm}
705Here we show that the inductive halving algorithm presented in
706the previous subsection is optimal in the following sense:
707no other algorithm on any 3-register SIMD architecture can use
708fewer than 24 operations to transform eight serial registers
709of byte stream data into eight parallel registers of bit stream data.
710By 3-register SIMD architecture, we refer to any architecture
711that uses SIMD instructions consistent with our overall model of
712binary operations using two input register operands to produce
713one output register value.
715Observe that the $N$ data bits from each input register must be
716distributed $N/8$ each to the 8 output registers by virtue of
717the problem definition.  Each output register can effectively
718be given a 3-bit address; the partitioning problem can be viewed
719as moving data to the correct address.  However, each
720operation can move results into at most one register. 
721At most this can result in the assignment of one correct address
722bit for each of the $N$ input bits.  As all $8N$ input bits
723need to be moved to a register with a correct 3-bit address,
724a minimum of 24 operations is required.
726\section{Parallel to Serial Conversion}
728Parallel bit stream applications may apply string editing
729operations in bit space to substitute, delete or insert
730parallel sets of bits at particular positions.  In such cases,
731the inverse transform that converts a set of parallel bit
732streams back into byte space is needed.  In the example of
733UTF-8 to UTF-16 transcoding, the inverse transform is
734actually used twice for each application of the forward
735transform, to separately compute the high and low byte
736streams of each UTF-16 code unit.  Those two byte streams
737are subsequentely merged to form the final result.
739Algorithms for performing the inverse transform mirror those
740of the forward transform, employing SIMD merge operations
741in place of pack operations.  The best algorithm known
742to us on the commodity SIMD architectures takes advantage
743of versions of the \verb#simd<8>::mergeh# and \verb#simd<8>::mergel#
744operations that are available with each of the SSE and Altivec instruction
745sets.  These algorithms take 72 operations to perform the
746inverse transposition of 8 parallel registers of bit stream
747data into 8 serial registers of byte stream data. 
752bit01_r0 = simd<1>::mergeh(bit0, bit1);
753bit01_r1 = simd<1>::mergel(bit0, bit1);
754bit23_r0 = simd<1>::mergeh(bit2, bit3);
755bit23_r1 = simd<1>::mergel(bit2, bit3);
756bit45_r0 = simd<1>::mergeh(bit4, bit5);
757bit45_r1 = simd<1>::mergel(bit4, bit5);
758bit67_r0 = simd<1>::mergeh(bit6, bit7);
759bit67_r1 = simd<1>::mergel(bit6, bit7);
760bit0123_r0 = simd<2>::mergeh(bit01_r0, bit23_r0);
761bit0123_r1 = simd<2>::mergel(bit01_r0, bit23_r0);
762bit0123_r2 = simd<2>::mergeh(bit01_r1, bit23_r1);
763bit0123_r3 = simd<2>::mergel(bit01_r1, bit23_r1);
764bit4567_r0 = simd<2>::mergeh(bit45_r0, bit67_r0);
765bit4567_r1 = simd<2>::mergel(bit45_r0, bit67_r0);
766bit4567_r2 = simd<2>::mergeh(bit45_r1, bit67_r1);
767bit4567_r3 = simd<2>::mergel(bit45_r1, bit67_r1);
768s0 = simd<4>::mergeh(bit0123_r0, bit4567_r0);
769s1 = simd<4>::mergel(bit0123_r0, bit4567_r0);
770s2 = simd<4>::mergeh(bit0123_r1, bit4567_r1);
771s3 = simd<4>::mergel(bit0123_r1, bit4567_r1);
772s4 = simd<4>::mergeh(bit0123_r2, bit4567_r2);
773s5 = simd<4>::mergel(bit0123_r2, bit4567_r2);
774s6 = simd<4>::mergeh(bit0123_r3, bit4567_r3);
775s7 = simd<4>::mergel(bit0123_r3, bit4567_r3);
779\caption{Parallel to Serial Transposition by Inductive Doubling}
782An algorithm employing only 24 operations using the
783inductive doubling instruction set architecture is relatively
784straightforward.. In stage 1, parallel registers for individual bit streams
785are first merged with bit-level interleaving
786using \verb#simd<1>::mergeh# and \verb#simd<8>::mergel#
787operations.  For each of the four pairs of consecutive
788even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
789two consecutive registers of bitpair data are produced.
790In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
791are then applied to merge to bitpair streams to produce streams
792of nybbles for the high and low nybble of each byte.  Finally,
793the nybble streams are merged in stage 3 to produce the
794desired byte stream data.  The full inductive doubling
795algorithm for parallel to serial transposition is shown in Figure
798This algorithm is again optimal, requiring the fewest operations
799of any possible algorithm using any 3-register instruction set
800model.  The proof directly follows that for the serial to parallel
803The existence of high-performance algorithms for transformation of
804character data between byte stream and parallel bit stream form
805in both directions makes it possible to consider applying these
806transformations multiple times during text processing applications.
807Just as the time domain and frequency domain each have their
808use in signal processing applications, the byte stream form and
809parallel bit stream form can then each be used at will in
810character stream applications.
816\section{Parallel Bit Deletion}
818Parallel bit deletion is an important operation that allows string
819editing operations to be carried out while in parallel bit stream
820form.  It is also fundamental to UTF-8 to UTF-16 transcoding
821using parallel bit streams, allowing the excess code unit
822positions for UTF-8 two-, three- and four-byte sequences to
823be deleted once the sixteen parallel bit streams of UTF-16 have
824been computed \cite{PPoPP08}.
826Parallel bit deletion is specified using a deletion mask.
827A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
828to be deleted and 0s at positions identifying bits to be retained.
829For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
830bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
831accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
833Bit deletion may be performed using
834the parallel-prefix compress algorithm documented by
835Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
836only logic and shifts with a constant parameter to carry
837out the deletion process.  However, it requires $k^2$
838preprocessing steps for a final field width parameter
839of size $2^k$, as well as 4 operations per deletion step
840per stream.  Using the inductive doubling instruction set architecture
841it is possible to carry out bit deletion much more efficiently.
843Deletion within fixed size fields or registers may produce results that are either
844left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
845eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
846care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
847left-justified result produces an important intermediate form known as a
848{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
849right-justified and left-justified results from the application of the
8504-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
851stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
852the central result \verb:xxbdefhx:, which may easily be converted to a either a
853left or a right justified 8-element result by an appropriate shift operation.
859\verb:delmask: & \verb:1001: & \verb:1100: & \verb:0100: & \verb:1111: & \verb:0111: & \verb:0010: & \verb:0011: & \verb:0010:  \\ \hline
860\verb:bits: & \verb:0bc0: & \verb:00gh: & \verb:i0kl: & \verb:0000: & \verb:q000: & \verb:uv0x: & \verb:yz00: & \verb:CD0F:  \\ \hline
861\verb:rslt_8: & \multicolumn{2}{c|}{\tt 00bcgh00} & \multicolumn{2}{c|}{\tt 0ikl0000} & \multicolumn{2}{c|}{\tt 000quvx0} & \multicolumn{2}{c|}{\tt 00yzCDF0} \\ \hline
862\verb:cts_4: & 2 & 2 & 1 & 4 & 3 & 1 & 2 & 1  \\ \hline
863\verb:rj: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{XX} \\ \hline
864\verb:lj: & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{XX} & \multicolumn{2}{c|}{2} \\ \hline
865\verb:rot_8: & \multicolumn{2}{c|}{6} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{7} & \multicolumn{2}{c|}{2} \\ \hline
866\verb:rslt_16: & \multicolumn{4}{c|}{\tt 0000bcghikl00000} & \multicolumn{4}{c|}{\tt 0000quvxyzCDF000} \\ \hline
870\caption{Example $8 \rightarrow 16$ Step in Deletion by Central Result Induction}
873The observation about how two $n$-bit central deletion results can
874combine to yield a $2n$ central deletion result provides the basis
875for an inductive doubling algorithm.  Figure \ref{centraldelstep}
876illustrates the inductive process for the transition from 8-bit central
877deletion results to 16-bit central deletion results.  The top row shows
878the original deletion mask, while the second row shows the original
879bit stream to which deletions are to be applied, with deleted bits zeroed out.
880The third row shows the central result for each 8-bit field as the
881result of the previous inductive step.
883To perform the $8 \rightarrow 16$ central deletion step, we first form
884the population counts of 4-bit fields of the original deletion mask as
885shown in row 4 of Figure \ref{centraldelstep}. Note that in right-justifying
886an 8-bit central result, we perform a right shift by the population count
887of the low half of the field.  Similarly,
888left-justification requires a left-shift by the population count in the
889high half of the field.
891The left and right shifts can be performed simultaneously using a rotate
892left instruction.  Right justification by shifting an $n$ bit field
893$i$ positions to the right is equivalent to a left rotate of $n-i$
894positions.  These rotation amounts are computed by the operation \newline
895\verb#rj=simd<8>::sub<x,l>(simd<8>::const(8), cts_4)# as shown in row 5,
896except that don't care fields (which won't be subsequently used)
897are marked \verb:XX:. 
899The left shift amounts are calculated by \verb#lj=simd<8>::srli<4>(cts_4)#
900as shown in row 6, and are combined with the right shift amounts
901by the selection operation \newline \verb#rot_8=simd_if(simd<16>::const(0xFF00), rj, lj)#
902as shown in row 7.  Using these computed values, the inductive step
903is completed by application of the operation \newline \verb#rslt_16=simd<8>::rotl(rslt_8, rot_8)#
904as shown in row 8.
906At each inductive doubling level, it requires 4 operations to compute the
907required deletion infomation and one operation per bit stream to perform deletion.
908Note that, if deletion is to be applied to a set of eight parallel bit streams,
909the computed deletion information is used for each stream without recomputation,
910thus requiring 12 operations per inductive level.
912In comparison to the parallel-prefix compress method, the method of central
913deletion results using the inductive doubling architecture has far fewer operations.
914The total preprocessing cost is $4k$ for $k$ steps of deletion by central result
915induction versus $4k^2$ for the parallel-prefix method.  Using the computed
916deletion operation, only a single SIMD rotate operation per bit stream
917per level is needed, in comparison with 4 operations per level for parallel-prefix
921\section{Systematic Support for Horizontal SIMD Operations}
923In SIMD parlance, {\em horizontal} operations are
924operations which combine values from two or more fields
925of the same register, in contrast to the normal
926{\em vertical} operations which combine corresponding
927fields of different registers.  Horizontal operations
928can be found that combine two (e.g., haddpd on SSE3),
929four (e.g, \verb:si_orx: on SPU), eight (e.g, psadbw on SSE)
930or sixteen values (e.g., vcmpequb on Altivec).  Some
931horizontal operations have a vertical component as well.
932For example, psadbw first forms the absolute value of
933the difference of eight corresponding byte fields before
934performing horizontal add of the eight values, while
935vsum4ubs on Altivec performs horizontal add of sets of
936four unsigned 8-bit fields within one register
937and then combines the result horizontally with
938corresponding 32-bit fields of a second registers.
940The space of potential horizontal operations thus has
941many dimensions, including not only the particular
942combining operation and the operand field width, but
943also the number of fields being combined, whether a
944vertical combination is applied and whether it is applied
945before or after the horizontal operation and what the
946nature of the vertical combining operation is.
947Within this space, commodity SIMD architectures tend
948to support only a very few combinations, without any
949particular attempt at systematic support for horizontal
950operations in general.
952By making use of \verb:<l,h>: half-operand modifier
953combinations, the inductive doubling architecture
954offers systematic support for horizontal operations
955on pairs of adjacent fields.
956For example, \verb#simd<16>::add<l,h># adds values
957in adjacent 8 bit fields to produce 16 bit results,
958while \verb#simd<32>::min<l,h># can produce the
959minimum value of adjacent 16-bit fields.  In general,
960\newline \verb#simd<n>::F<l,h># denotes the horizontal
961binary combination of adjacent fields for any
962operator $F$ and field width $n$.
964Horizontal combinations of larger numbers of fields
965makes use of the inductive doubling property.
966For example, consider the or-across operation \verb:si_orx:
967of the SPU, that performs a logical or operation
968on four 32-bit fields.  This four field combination
969involves two steps in the inductive doubling approach.
972t = simd<64>::or<l,h>(x, x)
973t = simd<128>::or<l,h>(t, t)
976This example is also interesting in showing a potential
977value for supporting bitwise logical operations at
978different field widths, i.e., specifically for use with
979half-operand modifiers.
981Similarly, to combine any eight fields simply requires
982three inductive doubling steps using the desired
983operator at successive power-of-two field widths, while
984combining sixteen fields requires four such operations.
985In this way, the inductive doubling architecture provides
986systematic support for horizontal operations well beyond
987the existing facilities of commodity architectures,
988although lacking some of the special features found in
989some cases.
994We have constructed libraries that provide
995simulated implementation of the inductive doubling architecture
996on each of the MMX, SSE, Altivec, and SPU platforms and have
997used these libraries in the implementation of each of the
998parallel bit stream algorithms discussed herein.
999This implementation work has been successful in validating
1000the basic concepts underlying the inductive doubling instruction
1001set architecture.
1003Implementation of the architecture on chip is beyond the
1004scope of our present resources and capabilities.  However,
1005the principal requirements are implementation of the various
1006operations at all power-of-2 field widths and implementation
1007of half-operand modifiers.  Implementation of SIMD operations
1008at additional field widths involves design trade-offs
1009with respect to transistor counts, available opcode space,
1010and the potential value of the new operations to SIMD
1011programmers.  From the perspective of parallel bit
1012stream programming, the primary need is for SIMD integer,
1013shift, pack and merge operations at field widths of 2, 4
1014and 8, as well as the field width of 1, where it makes
1015sense (e.g. with merge operations).  In support of the
1016general concept of inductive doubling architecture,
1017SIMD operations at large field widths (64, 128) are also
1018called for, but these operations cannot be justified on
1019the basis of parallel bit stream programming.
1021Implementation of half-operand modifiers can logically
1022be carried out with additional circuitry attached to the
1023register fetch units of a pipelined processor.  This
1024circuitry would require control signals from the
1025instruction decode unit to identify the field widths
1026of operands and the particular half-operand modifier to be applied,
1027if any.  The additional logic required for instruction
1028decode and that required for operand modification
1029as part of the operand fetch process is expected to be
1030reasonably modest.
1032Full assessment of implementation issues is an important
1033area for future work.
1038This paper has considered the issue of architectural support for
1039SIMD text processing using the method of parallel bit streams and has
1040argued that this architectural support can best be provided
1041through a SIMD instruction set architecture that implements
1042features for direct support of inductive doubling algorithms.
1043Four key features of the inductive doubling architecture have
1044been identified include support for operations at all
1045power-of-2 field widths, half-operand modifiers and
1046pack and merge operations.  The principle innovation is the
1047notion of half-operand modifiers to support efficient
1048transition between successive power-of-two field widths.
1050Several algorithms key to parallel bit stream methods
1051have been examined and shown to benefit from dramatic
1052reductions in instruction count compared to the best
1053known algorithms on existing architecture.  In the case
1054of transposition algorithms to and from parallel bit stream
1055form, the architecture has been shown to make possible
1056straightforward inductive doubling algorithms with the
1057lowest total number of operations that can be achieved by any
1058possible 3-register SIMD architecture.
1060The inductive doubling architecture also has considerable
1061benefits beyond its role in supporting SIMD programming
1062with parallel bit streams.  Notable among these is
1063that the architecture provides a framework for systematic
1064support of horizontal SIMD operations.
1069%\section{Appendix Title}
1071%This is the text of the appendix, if you need one.
1075This research was supported by a Discovery Grant from the
1076Natural Sciences and Engineering Research Council of Canada.
1084%Smith, P. Q. reference text
Note: See TracBrowser for help on using the repository browser.