Changeset 258 for docs/ASPLOS09


Ignore:
Timestamp:
Jan 4, 2009, 6:51:36 PM (11 years ago)
Author:
cameron
Message:

Spelling corrections

Location:
docs/ASPLOS09
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/ASPLOS09/asplos094-cameron.tex

    r257 r258  
    7373to parallelize, concluding that nothing seems to help.  Indeed,
    7474the study even speculates that applications in this area may simply be
    75 ``embarrasingly sequential,'' easy to tackle for traditional
     75``embarrassingly sequential,'' easy to tackle for traditional
    7676sequential processing approaches suitable for uniprocessors,
    7777but perhaps fundamentally unsuited to parallel methods.
     
    104104required a mere 15 cycles per byte, while a range of other technologies
    105105required from 25 to 200 cycles per byte \cite{Herdy}.
    106 Ongoing work is further applying the parallel bit stream methds to
     106Ongoing work is further applying the parallel bit stream methods to
    107107parallel hash value computation and parallel regular expression matching
    108108for the purpose of validating XML datatype declarations in
     
    409409primary concern for performance improvement of these applications.
    410410
    411 The additional circuity complexity to realize IDISA-A and
     411The additional circuit complexity to realize IDISA-A and
    412412IDISA-B designs over their reference models will be
    413413addressed in the penultimate section.  That discussion
     
    663663by the \verb#simd<16>::pack# with 4 additional operations to
    664664prepare operands.  Essentially, the RefB implementation
    665 uses single permuite instructions to implement the equivalent of
     665uses single permute instructions to implement the equivalent of
    666666\verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#.
    667667The RefA implementation also requires 3 logic operations to implement
     
    674674Using IDISA, it is possible to design
    675675a transposition algorithm that is both easier to understand and requires
    676 many fewer operations than the the bytepack algorithm described above.
     676many fewer operations than the the BytePack algorithm described above.
    677677We call it the inductive halving algorithm for serial to parallel
    678678transposition, because it proceeds by reducing byte streams to
    679679two sets of nybble streams in a first stage, dividing the nybble
    680 streams into streams of bitpairs in a second stage and finally
    681 dividing the bitpair streams into bit streams in the third stage.
     680streams into streams of bit-pairs in a second stage and finally
     681dividing the bit-pair streams into bit streams in the third stage.
    682682
    683683Figure \ref{halvingstep} shows one step in stage 1 of the inductive
     
    685685The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte
    686686from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts
    687 the low nybble of each byte.  As in the bytepack algorithm, this step is
     687the low nybble of each byte.  As in the BytePack algorithm, this step is
    688688applied 4 times in stage 1, for a total of 8 operations.
    689689
     
    794794transform, to separately compute the high and low byte
    795795streams of each UTF-16 code unit.  Those two byte streams
    796 are subsequentely merged to form the final result.
     796are subsequently merged to form the final result.
    797797
    798798Algorithms for performing the inverse transform mirror those
     
    847847operations.  For each of the four pairs of consecutive
    848848even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
    849 two consecutive registers of bitpair data are produced.
     849two consecutive registers of bit-pair data are produced.
    850850In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
    851 are then applied to merge to bitpair streams to produce streams
     851are then applied to merge to bit-pair streams to produce streams
    852852of nybbles for the high and low nybble of each byte.  Finally,
    853853the nybble streams are merged in stage 3 to produce the
     
    965965
    966966At each inductive doubling level, it requires 4 operations to compute the
    967 required deletion infomation and one operation per bit stream to perform deletion.
     967required deletion information and one operation per bit stream to perform deletion.
    968968Note that, if deletion is to be applied to a set of eight parallel bit streams,
    969969the computed deletion information is used for each stream without recomputation,
     
    10831083\begin{center}\small
    10841084\begin{verbatim}
    1085 t1=simd<8>:const(10)
    1086 t2=simd<16>:const(100)
    1087 t3=simd<32>:const(10000)
     1085t1=simd<8>:constant(10)
     1086t2=simd<16>:constant(100)
     1087t3=simd<32>:constant(10000)
    10881088b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d)
    10891089b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b)
     
    11311131Further applications of IDISA can often be found
    11321132by searching for algorithms employing the magic masks
    1133 s \verb:0x55555555:,  \verb:0x33333333:, and so on.
    1134 Examples include the bitslice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}
     1133 \verb:0x55555555:,  \verb:0x33333333:, and so on.
     1134Examples include the bit-slice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}
    11351135and integer contraction and dilation for quadtrees and
    1136 octtrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
     1136octrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
    11371137Pixel packing from 32 bit fields into a 5:5:5 representation
    11381138is a further application of parallel bit deletion.
     
    12031203be realized through $h$ inductive double steps
    12041204in a similar fashion.
    1205 Thus, IDISA esssentially offers systematic support
     1205Thus, IDISA essentially offers systematic support
    12061206for horizontal operations entirely through the
    12071207use of \verb:<h,l>: half-operand modifier
     
    13731373are required for combination with the $\neg h$  and $r[i]$ terms.  The terms for
    13741374the low and high half-operand modifiers are then combined with an
    1375 additional 127 2-input or gates.   Thus, the circuity complexity
     1375additional 127 2-input or gates.   Thus, the circuitry complexity
    13761376for the combinational logic implementation of half-operand
    13771377modifiers within the SOFU is 1279 2-input gates per operand,
     
    14021402Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
    14031403as a modification to that for 8-bit fields by incorporating logic to
    1404 disable carry propagation at the 16 midfield boundaries.  For 2-bit
     1404disable carry propagation at the 16 mid-field boundaries.  For 2-bit
    14051405fields, disabling carry propagation at 32 additional boundaries suffices,
    14061406although it may be simpler to directly implement the simple logic
     
    14651465as well as half-operand modifiers and pack and merge operations
    14661466to support efficient transition between successive power-of-two
    1467 field widths.  The principle innovation is the notion of
     1467field widths.  The principal innovation is the notion of
    14681468half-operand modifiers that eliminate the cost associated
    14691469with the explicit mask and shift operations required for
     
    15801580\bibitem{DBLP:conf/cans/RebeiroSD06}
    15811581Chester Rebeiro, A.~David Selvakumar, and A.~S.~L. Devi.
    1582 \newblock Bitslice implementation of aes.
     1582\newblock Bitslice implementation of AES.
    15831583\newblock In David Pointcheval, Yi~Mu, and Kefei Chen, editors, {\em CANS},
    15841584  volume 4301 of {\em Lecture Notes in Computer Science}, pages 203--212.
Note: See TracChangeset for help on using the changeset viewer.