# Changeset 235 for docs/ASPLOS09

Ignore:
Timestamp:
Dec 15, 2008, 10:22:57 AM (10 years ago)
Message:

Strengthened Conclusions

Location:
docs/ASPLOS09
Files:
2 edited

### Legend:

Unmodified
 r234 \section{Parallel Bit Deletion} Parallel bit deletion is an important operation that allows string editing operations to be carried out while in parallel bit stream form.  It is also fundamental to UTF-8 to UTF-16 transcoding using parallel bit streams, allowing the excess code unit positions for UTF-8 two-, three- and four-byte sequences to be deleted once the sixteen parallel bit streams of UTF-16 have been computed \cite{PPoPP08}. Parallel bit deletion is specified using a deletion mask. A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits to be deleted and 0s at positions identifying bits to be retained. For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:. Bit deletion may be performed using the parallel-prefix compress algorithm documented by Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses only logic and shifts with a constant parameter to carry out the deletion process.  However, it requires $k^2$ preprocessing steps for a final field width parameter of size $2^k$, as well as 4 operations per deletion step per stream.  Using the inductive doubling instruction set architecture it is possible to carry out bit deletion much more efficiently. Deletion within fixed size fields or registers may produce results that are either left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't care positions marked \verb:x:'.  Concatenating an adjacent right-justified result with a left-justified result produces an important intermediate form known as a {\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective right-justified and left-justified results from the application of the 4-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces the central result \verb:xxbdefhx:, which may easily be converted to a either a left or a right justified 8-element result by an appropriate shift operation. \begin{figure*} \begin{figure*}[tbh] \begin{center} \begin{tabular}{|c||c|c|c|c|c|c|c|c|} \end{figure*} Parallel bit deletion is an important operation that allows string editing operations to be carried out while in parallel bit stream form.  It is also fundamental to UTF-8 to UTF-16 transcoding using parallel bit streams, allowing the excess code unit positions for UTF-8 two-, three- and four-byte sequences to be deleted once the sixteen parallel bit streams of UTF-16 have been computed \cite{PPoPP08}. Parallel bit deletion is specified using a deletion mask. A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits to be deleted and 0s at positions identifying bits to be retained. For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:. Bit deletion may be performed using the parallel-prefix compress algorithm documented by Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses only logic and shifts with a constant parameter to carry out the deletion process.  However, it requires $k^2$ preprocessing steps for a final field width parameter of size $2^k$, as well as 4 operations per deletion step per stream.  Using the inductive doubling instruction set architecture it is possible to carry out bit deletion much more efficiently. Deletion within fixed size fields or registers may produce results that are either left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't care positions marked \verb:x:'.  Concatenating an adjacent right-justified result with a left-justified result produces an important intermediate form known as a {\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective right-justified and left-justified results from the application of the 4-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces the central result \verb:xxbdefhx:, which may easily be converted to a either a left or a right justified 8-element result by an appropriate shift operation. The observation about how two $n$-bit central deletion results can combine to yield a $2n$ central deletion result provides the basis \begin{verbatim} x = x - 1; x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >>16); x = x + 1; \end{verbatim} \subsection{Operand Fetch Unit Logic} Discussion of gate-level implementation. \section{Conclusions} This paper has considered the issue of architectural support for SIMD text processing using the method of parallel bit streams and has argued that this architectural support can best be provided through a SIMD instruction set architecture that implements features for direct support of inductive doubling algorithms. Four key features of the inductive doubling architecture have been identified include support for operations at all power-of-2 field widths, half-operand modifiers and pack and merge operations.  The principle innovation is the notion of half-operand modifiers to support efficient transition between successive power-of-two field widths. In considering the architectural support for SIMD text processing using the method of parallel bit streams, this paper has presented the principle of inductive doubling and a realization of that principle in the form of IDISA, a modified SWAR instruction set architecture. IDISA features support for SWAR operations at all power-of-2 field widths, including 2-bit and 4-bit widths, in particular, as well as half-operand modifiers and pack and merge operations to support efficient transition between successive power-of-two field widths.  The principle innovation is the notion of half-operand modifiers that eliminate the cost associated with the explicit mask and shift operations required for such transitions. Several algorithms key to parallel bit stream methods have been examined and shown to benefit from dramatic reductions in instruction count compared to the best known algorithms on existing architecture.  In the case known algorithms on existing architectures.  In the case of transposition algorithms to and from parallel bit stream form, the architecture has been shown to make possible straightforward inductive doubling algorithms with the lowest total number of operations that can be achieved by any possible 3-register SIMD architecture. The inductive doubling architecture also has considerable benefits beyond its role in supporting SIMD programming with parallel bit streams.  Notable among these is that the architecture provides a framework for systematic support of horizontal SIMD operations. lowest total number of operations that can be achieved by any possible 3-register SWAR architecture. Applications of IDISA in other areas have also been examined.  The support for 2-bit and 4-bit field widths in SWAR processing is beneficial for packed DNA representations and packed decimal representations respectively.  Additional inductive doubling examples are presented and the phenomenon of power-of-2 transitions discussed more broadly. Most significantly, IDISA supports a fully general approach to horizontal SWAR operations, offering a considerable improvement over the {\em ad hoc} sets of special-purpose horizontal operations found in existing SWAR instruction sets. An IDISA implementation model has been presented employing a customized operand fetch unit to implement the half-operand modifier logic.  Gate-level implementation of this unit has been analyzed and showed to be quite reasonable. Many special-purpose operations that are found in existing processors could instead be implemented through efficient IDISA sequences.  These include examples such as population count, count leading and/or trailing zeroes and parity. They also include specialized horizontal SIMD operations. Thus, design tradeoffs can be made with the potential of reducing the chip area devoted to special purpose instructions in favor of more general IDISA features. Other tradeoffs may be possible in IDISA implementation itself. Full support of IDISA features to the largest field widths are not necessary in many cases.   For example, a given 128-bit SIMD facility may support IDISA features only up to 32-bit or 64-bit fields, similar to the current Altivec and SSE architectures, respectively.   It may also be possible to reduce the complexity of operand fetch circuitry by a factor of two by dedicating one operand to a possible high half-operand modifier and the other to a possible low half-operand modifier. Future research may consider the extension of inductive doubling support in further ways.   For example, it may be possible to develop a pipelined architecture supporting two or three steps of inductive doubling in a single operation. %\appendix