Index: /docs/ASPLOS09/asplos094cameron.tex
===================================================================
 /docs/ASPLOS09/asplos094cameron.tex (revision 244)
+++ /docs/ASPLOS09/asplos094cameron.tex (revision 245)
@@ 151,79 +151,20 @@
The remainder of this paper is organized as follows.

The second section of this paper introduces IDISA and the
SIMD notation used throughout this paper. A brief comparison of
IDISA features with existing SIMD instruction
sets of commodity processors such as the Intel SSE
instruction set and the Power PC Altivec instruction set
is also made.

The third section then discusses the evaluation methodology
for IDISA. Two reference architectures are briefly described as
well as criteria for evaluation IDISA against these architectures.

+SIMD notation used throughout this paper. The third
+section moves on to discuss an evaluation methodology
+for IDISA in comparison to two reference architectures
+motivated by the SSE and Altivec instruction sets.
The fourth section provides a short first example of
the inductive doubling principle in action through
the case of population count. Although this operation
is not a strong determinant of performance for parallel bit
stream applications, it is nevertheless an operation needed
frequently enough in the general computing milieux to find
its way into some instruction set architectures, typically
at one particular field width.
%By way of comparison, the
%inductive doubling architecture sacrifices some
%performance at the chosen field width, while offering a more
%general solution with frequently better performance at
%other field widths.

The fifth section then moves on to consider the performancecritical
and key task of conversion between serial byte streams and parallel
bit streams. A first algorithm that uses the existing SIMD
operations common to SSE and Altivec is shown, requiring 72
operations to transform 128 bytes of data using the threeregister
instruction form. We then move on to consider how the task may
be simplified using IDISA to
require a mere 24 operations. As well as providing a 3X speedup,
it is also argued that the version using the inductive doubling
architecture is considerably simpler and easier to program.

The sixth section then briefly considers the inverse transposition
process, converting parallel bit stream data back into byte
streams. Again, an algorithm to carry out this task requires
72 straightline SIMD operations in the Altivec threeregister
form, but is reduced to a simpler 24 operations using IDISA.

The seventh section then goes on to consider the problem of
parallel bit deletion. This operation is performancecritical
to any applications that require filtering or
editing operations on strings using the parallel bit stream
algorithms. For example, it is fundamental to the
highspeed UTF8 to UTF16 transcoding algorithm that is
often a critical component in XML parsing. In this
section, an inductive doubling algorithm based on the
concept of a central deletion result is described and
shown to have much better performance than a parallelprefix
alternative.

+the case of population count. Sections 5 through 7
+then address the application of IDISA to core
+algorithms in text processing with parallel bit
+streams.
The eighth section then considers the potential role of
IDISA in supporting applications beyond parallel bit streams.
Additional examples from several domains are presented.
Perhaps most importantly, however, the role of IDISA
in supporting a fully systematic treatment of {\em horizontal}
SWAR operations for any application area is discussed.

An implementation model for IDISA is then considered
in section 9 of the paper, focusing on a pipelined
SIMD architecture featuring a modified operand fetch stage.
A gatecount analysis of one feasible implementation is
provided as well as a discussion of the implementation
of required extensions to handle 2bit and 4bit fields not
commonly supported on existing SIMD architectures. Design
tradeoffs are also considered focusing the potential removal of
various {\em ad hoc} instructions on existing processors in favor of
more general alternatives provided through IDISA.

The tenth section concludes the paper with a summary of results
and discussion of areas for future work.
+Section 9 addresses IDISA implementation while
+Section 10 concludes the paper with a summary of
+results and directions for further work.
@@ 1357,13 +1298,18 @@
\section{Implementation}
We have carried implementation work for IDISA in three
ways. First, we have constructed libraries that
implement the IDISA instructions by template and/or macro
expansion for each of MMX, SSE, Altivec, and SPU platforms.
Second, we have developed a model implementation
involving a modified operand fetch component
of a pipelined SIMD processor. Third, we have written
and evaluated Verilog HDL description of this model
implementation.
+IDISA may be implemented as a software
+abstraction on top of existing SWAR architectures or
+directly in hardware. In this section, we briefly
+discuss implementation of IDISA libraries before
+moving on to consider hardware design. Although
+a full realization of IDISA in hardware is beyond our
+current capabilities, our goal is to develop a sufficiently
+detailed design to assess the costs of IDISA implementation
+in terms of the incremental complexity over the RefA
+and RefB architectures. The cost factors we consider, then,
+are the implementation of the halfoperand modifiers, and
+the extension of core operations to the 2bit, 4bit and
+128bit field widths. In each case, we also discuss
+design tradeoffs.
\subsection{IDISA Libraries}
@@ 1507,4 +1453,79 @@
or 2558 gates in total.
+The gatelevel complexity of halfoperand modifiers as described is nontrivial,
+but modest. However, one possible design tradeoff is to
+differentiate the two operands, permitting a high halfoperand
+modifier to be used only with the first operand and a lowmodifier
+to be used only with the second operand. This would exclude
+\verb## and \verb## modifier combinations and also
+certain combinations for noncommutative core operations.
+The principal
+consequence for the applications considered here would be with
+respect to the pack operations in forward transposition, but it
+may be possible to address this through SIEU circuitry.
+If this approach were to be taken, the gate complexity of
+halfoperand modification would be reduced by slightly more than half.
+
+\subsection{2Bit and 4Bit Field Widths}
+
+Beyond the halfoperand modifiers, extension of core SWAR
+operations to 2bit and 4bit field widths is critical to
+inductive doubling support. The principal operations
+that need to be supported in this way are addition, pack, merge
+merge, and rotate.
+
+Addition for 4bit fields in a 128bit SWAR processor may be implemented
+as a modification to that for 8bit fields by incorporating logic to
+disable carry propagation at the 16 midfield boundaries. For 2bit
+fields, disabling carry propagation at 32 additional boundaries suffices,
+although it may be simpler to directly implement the simple logic
+of 2bit adders.
+
+Pack and merge require bit selection logic for each field width.
+A straightforward implementation model for each operation
+uses 128 2input and gates to select the desired bits from the
+operand registers and another 128 2input or gates to feed
+these results into the destination register.
+
+Rotation for 2bit fields involves simple logic for selecting
+between the 2 bits of each field of the operand being
+rotated on the basis of the low bit of each field of the
+rotation count. Rotation for 4bit fields is more complex,
+but can also be based on 1of4 selection circuitry
+involving the low 2 bits of the rotation count fields.
+
+\subsection{128Bit Field Widths}
+
+For completeness, the IDISA model requires core operations
+to be implemented at the full register width, as well
+as powerof2 partitions. This may be problematic for
+operations such as addition due to the inherent delays
+in 128bit carry propagation. However, the primary
+role of 128 bit operations in inductive doubling
+is to combine two 64bit fields using \verb##
+operand combinations. In view of this, it may be
+reasonable to define hardware support for such
+combinations to be based on 64bit logic, with support
+for 128bit logic implemented through firmware or software.
+
+\subsection{Final Notes and Further Tradeoffs}
+
+In order to present IDISA as a concept for design extension of
+any SWAR architecture, our discussion of gatelevel implementation
+is necessarily abstract. Additional circuitry is sure to be
+required, for example, in implementation of SIDU. However,
+in context of the 128bit reference architectures studied,
+our analysis suggests realistic IDISA implementation well
+within a 10,000 gate cost.
+
+However, the additional circuitry required may be offset
+by elimination of specialpurpose instructions found in
+existing processors that could instead be implemented through efficient
+IDISA sequences. These include examples such as population
+count, count leading and/or trailing zeroes and parity.
+They also include specialized horizontal SIMD operations.
+Thus, design tradeoffs can be made with the potential of
+reducing the chip area devoted to special purpose instructions
+in favor of more general IDISA features.
\section{Conclusions}
@@ 1515,5 +1536,5 @@
and a realization of that principle in the form of
IDISA, a modified SWAR instruction set architecture.
IDISA features support for SWAR operations at all powerof2
+IDISA offers support for SWAR operations at all powerof2
field widths, including 2bit and 4bit widths, in particular,
as well as halfoperand modifiers and pack and merge operations
@@ 1527,6 +1548,10 @@
have been examined and shown to benefit from dramatic
reductions in instruction count compared to the best
known algorithms on existing architectures. In the case
of transposition algorithms to and from parallel bit stream
+known algorithms on reference architectures. This
+includes both a reference architecture modeled on
+the SWAR capabilities of the SSE family as well as
+an architecture incorporating the powerful permute
+or shuffle capabilities found in Altivec or Cell BE processors.
+In the case of transposition algorithms to and from parallel bit stream
form, the architecture has been shown to make possible
straightforward inductive doubling algorithms with the
@@ 1548,24 +1573,10 @@
a customized operand fetch unit to implement the halfoperand
modifier logic. Gatelevel implementation of this unit
has been analyzed and showed to be quite reasonable.

Many specialpurpose operations that are found in existing
processors could instead be implemented through efficient
IDISA sequences. These include examples such as population
count, count leading and/or trailing zeroes and parity.
They also include specialized horizontal SIMD operations.
Thus, design tradeoffs can be made with the potential of
reducing the chip area devoted to special purpose instructions
in favor of more general IDISA features.

Other tradeoffs may be possible in IDISA implementation itself.
Full support of IDISA features to the largest field widths
are not necessary in many cases. For example, a given 128bit
SIMD facility may support IDISA features only up to 32bit
or 64bit fields, similar to the current Altivec and SSE
architectures, respectively. It may also be possible to
reduce the complexity of operand fetch circuitry by a factor
of two by dedicating one operand to a possible high halfoperand
modifier and the other to a possible low halfoperand modifier.
+and operations at the 2bit and 4bit field widths have
+been analyzed and found to be quite reasonable within
+a 10,000 gate bound. Design tradeoffs to reduce the cost
+have also been discussed, possibly even leading to a
+net complexity reduction through elimination of instructions
+that implement specialcase versions of inductive doubling.
Future research may consider the extension of inductive doubling