Changeset 235 for docs


Ignore:
Timestamp:
Dec 15, 2008, 10:22:57 AM (11 years ago)
Author:
cameron
Message:

Strengthened Conclusions

Location:
docs/ASPLOS09
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r234 r235  
    862862\section{Parallel Bit Deletion}
    863863
    864 Parallel bit deletion is an important operation that allows string
    865 editing operations to be carried out while in parallel bit stream
    866 form.  It is also fundamental to UTF-8 to UTF-16 transcoding
    867 using parallel bit streams, allowing the excess code unit
    868 positions for UTF-8 two-, three- and four-byte sequences to
    869 be deleted once the sixteen parallel bit streams of UTF-16 have
    870 been computed \cite{PPoPP08}.
    871 
    872 Parallel bit deletion is specified using a deletion mask.
    873 A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
    874 to be deleted and 0s at positions identifying bits to be retained.
    875 For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
    876 bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
    877 accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
    878 
    879 Bit deletion may be performed using
    880 the parallel-prefix compress algorithm documented by
    881 Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
    882 only logic and shifts with a constant parameter to carry
    883 out the deletion process.  However, it requires $k^2$
    884 preprocessing steps for a final field width parameter
    885 of size $2^k$, as well as 4 operations per deletion step
    886 per stream.  Using the inductive doubling instruction set architecture
    887 it is possible to carry out bit deletion much more efficiently.
    888 
    889 Deletion within fixed size fields or registers may produce results that are either
    890 left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
    891 eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
    892 care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
    893 left-justified result produces an important intermediate form known as a
    894 {\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
    895 right-justified and left-justified results from the application of the
    896 4-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
    897 stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
    898 the central result \verb:xxbdefhx:, which may easily be converted to a either a
    899 left or a right justified 8-element result by an appropriate shift operation.
    900 
    901 \begin{figure*}
     864\begin{figure*}[tbh]
    902865\begin{center}
    903866\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
     
    917880\end{figure*}
    918881
     882Parallel bit deletion is an important operation that allows string
     883editing operations to be carried out while in parallel bit stream
     884form.  It is also fundamental to UTF-8 to UTF-16 transcoding
     885using parallel bit streams, allowing the excess code unit
     886positions for UTF-8 two-, three- and four-byte sequences to
     887be deleted once the sixteen parallel bit streams of UTF-16 have
     888been computed \cite{PPoPP08}.
     889
     890Parallel bit deletion is specified using a deletion mask.
     891A deletion mask is defined as a bit stream consisting of 1s at positions identifying bits
     892to be deleted and 0s at positions identifying bits to be retained.
     893For example, consider an 8-bit deletion mask \verb:10100010: and two corresponding 8-element parallel
     894bit streams \verb:abcdefgh: and \verb:ABCDEFGH:.  Parallel deletion of elements from both bit streams in
     895accordance with the mask yields two five element streams, i.e., \verb:bdefh: and \verb:BDEFH:.
     896
     897Bit deletion may be performed using
     898the parallel-prefix compress algorithm documented by
     899Warren and attributed to Steele \cite{HackersDelight}.  This algorithm uses
     900only logic and shifts with a constant parameter to carry
     901out the deletion process.  However, it requires $k^2$
     902preprocessing steps for a final field width parameter
     903of size $2^k$, as well as 4 operations per deletion step
     904per stream.  Using the inductive doubling instruction set architecture
     905it is possible to carry out bit deletion much more efficiently.
     906
     907Deletion within fixed size fields or registers may produce results that are either
     908left justified or right-justified.  For example, a five-element stream \verb:bdefh: within an
     909eight-element field may be represented as either \verb:bdefhxxx: or \verb:xxxbdefh:, with don't
     910care positions marked `\verb:x:'.  Concatenating an adjacent right-justified result with a
     911left-justified result produces an important intermediate form known as a
     912{\em central deletion result}.  For example, \verb:xxbd: and \verb:efhx: may be respective
     913right-justified and left-justified results from the application of the
     9144-bit deletion masks \verb:1010: and \verb:0010: to the two consecutive 4-element
     915stream segments \verb:abcd: and \verb:efgh:.  Concatenation of \verb:xxbd: and \verb:efhx: produces
     916the central result \verb:xxbdefhx:, which may easily be converted to a either a
     917left or a right justified 8-element result by an appropriate shift operation.
     918
     919
    919920The observation about how two $n$-bit central deletion results can
    920921combine to yield a $2n$ central deletion result provides the basis
     
    10031004\begin{verbatim}
    10041005x = x - 1;
    1005 
    10061006x = x | (x >> 1);
    1007 
    10081007x = x | (x >> 2);
    1009 
    10101008x = x | (x >> 4);
    1011 
    10121009x = x | (x >> 8);
    1013 
    10141010x = x | (x >>16);
    1015 
    10161011x = x + 1;
    10171012\end{verbatim}
     
    13851380\subsection{Operand Fetch Unit Logic}
    13861381
    1387 
    1388 
    1389 
    1390 
     1382Discussion of gate-level implementation.
    13911383
    13921384
    13931385\section{Conclusions}
    13941386
    1395 This paper has considered the issue of architectural support for
    1396 SIMD text processing using the method of parallel bit streams and has
    1397 argued that this architectural support can best be provided
    1398 through a SIMD instruction set architecture that implements
    1399 features for direct support of inductive doubling algorithms.
    1400 Four key features of the inductive doubling architecture have
    1401 been identified include support for operations at all
    1402 power-of-2 field widths, half-operand modifiers and
    1403 pack and merge operations.  The principle innovation is the
    1404 notion of half-operand modifiers to support efficient
    1405 transition between successive power-of-two field widths.
     1387In considering the architectural support for
     1388SIMD text processing using the method of parallel bit streams,
     1389this paper has presented the principle of inductive doubling
     1390and a realization of that principle in the form of
     1391IDISA, a modified SWAR instruction set architecture.
     1392IDISA features support for SWAR operations at all power-of-2
     1393field widths, including 2-bit and 4-bit widths, in particular,
     1394as well as half-operand modifiers and pack and merge operations
     1395to support efficient transition between successive power-of-two
     1396field widths.  The principle innovation is the notion of
     1397half-operand modifiers that eliminate the cost associated
     1398with the explicit mask and shift operations required for
     1399such transitions.
    14061400
    14071401Several algorithms key to parallel bit stream methods
    14081402have been examined and shown to benefit from dramatic
    14091403reductions in instruction count compared to the best
    1410 known algorithms on existing architecture.  In the case
     1404known algorithms on existing architectures.  In the case
    14111405of transposition algorithms to and from parallel bit stream
    14121406form, the architecture has been shown to make possible
    14131407straightforward inductive doubling algorithms with the
    1414 lowest total number of operations that can be achieved by any
    1415 possible 3-register SIMD architecture.
    1416 
    1417 The inductive doubling architecture also has considerable
    1418 benefits beyond its role in supporting SIMD programming
    1419 with parallel bit streams.  Notable among these is
    1420 that the architecture provides a framework for systematic
    1421 support of horizontal SIMD operations.
    1422 
    1423 
     1408lowest total number of operations that can be achieved by
     1409any possible 3-register SWAR architecture.
     1410
     1411Applications of IDISA in other areas have also been
     1412examined.  The support for 2-bit and 4-bit field widths
     1413in SWAR processing is beneficial for packed DNA representations
     1414and packed decimal representations respectively.  Additional
     1415inductive doubling examples are presented and the phenomenon
     1416of power-of-2 transitions discussed more broadly.
     1417Most significantly, IDISA supports a fully general approach
     1418to horizontal SWAR operations, offering a considerable
     1419improvement over the {\em ad hoc} sets of special-purpose
     1420horizontal operations found in existing SWAR instruction sets.
     1421
     1422An IDISA implementation model has been presented employing
     1423a customized operand fetch unit to implement the half-operand
     1424modifier logic.  Gate-level implementation of this unit
     1425has been analyzed and showed to be quite reasonable.
     1426
     1427Many special-purpose operations that are found in existing
     1428processors could instead be implemented through efficient
     1429IDISA sequences.  These include examples such as population
     1430count, count leading and/or trailing zeroes and parity.
     1431They also include specialized horizontal SIMD operations.
     1432Thus, design tradeoffs can be made with the potential of
     1433reducing the chip area devoted to special purpose instructions
     1434in favor of more general IDISA features.
     1435
     1436Other tradeoffs may be possible in IDISA implementation itself.
     1437Full support of IDISA features to the largest field widths
     1438are not necessary in many cases.   For example, a given 128-bit
     1439SIMD facility may support IDISA features only up to 32-bit
     1440or 64-bit fields, similar to the current Altivec and SSE
     1441architectures, respectively.   It may also be possible to
     1442reduce the complexity of operand fetch circuitry by a factor
     1443of two by dedicating one operand to a possible high half-operand
     1444modifier and the other to a possible low half-operand modifier.
     1445
     1446Future research may consider the extension of inductive doubling
     1447support in further ways.   For example, it may be possible
     1448to develop a pipelined architecture supporting two or three
     1449steps of inductive doubling in a single operation.
    14241450
    14251451%\appendix
Note: See TracChangeset for help on using the changeset viewer.