Changeset 258 for docs/ASPLOS09/asplos094cameron.tex
 Timestamp:
 Jan 4, 2009, 6:51:36 PM (11 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r257 r258 73 73 to parallelize, concluding that nothing seems to help. Indeed, 74 74 the study even speculates that applications in this area may simply be 75 ``embarras ingly sequential,'' easy to tackle for traditional75 ``embarrassingly sequential,'' easy to tackle for traditional 76 76 sequential processing approaches suitable for uniprocessors, 77 77 but perhaps fundamentally unsuited to parallel methods. … … 104 104 required a mere 15 cycles per byte, while a range of other technologies 105 105 required from 25 to 200 cycles per byte \cite{Herdy}. 106 Ongoing work is further applying the parallel bit stream meth ds to106 Ongoing work is further applying the parallel bit stream methods to 107 107 parallel hash value computation and parallel regular expression matching 108 108 for the purpose of validating XML datatype declarations in … … 409 409 primary concern for performance improvement of these applications. 410 410 411 The additional circuit ycomplexity to realize IDISAA and411 The additional circuit complexity to realize IDISAA and 412 412 IDISAB designs over their reference models will be 413 413 addressed in the penultimate section. That discussion … … 663 663 by the \verb#simd<16>::pack# with 4 additional operations to 664 664 prepare operands. Essentially, the RefB implementation 665 uses single permu ite instructions to implement the equivalent of665 uses single permute instructions to implement the equivalent of 666 666 \verb#simd<16>::pack<h,h>(s0, s1)# and \verb#simd<16>::pack<l,l>(s0, s1)#. 667 667 The RefA implementation also requires 3 logic operations to implement … … 674 674 Using IDISA, it is possible to design 675 675 a transposition algorithm that is both easier to understand and requires 676 many fewer operations than the the bytepack algorithm described above.676 many fewer operations than the the BytePack algorithm described above. 677 677 We call it the inductive halving algorithm for serial to parallel 678 678 transposition, because it proceeds by reducing byte streams to 679 679 two sets of nybble streams in a first stage, dividing the nybble 680 streams into streams of bit pairs in a second stage and finally681 dividing the bit pair streams into bit streams in the third stage.680 streams into streams of bitpairs in a second stage and finally 681 dividing the bitpair streams into bit streams in the third stage. 682 682 683 683 Figure \ref{halvingstep} shows one step in stage 1 of the inductive … … 685 685 The \verb#simd<8>::pack<h,h># operation extracts the high nybble of each byte 686 686 from the input registers, while the \verb#simd<8>::pack<l,l># operation extracts 687 the low nybble of each byte. As in the bytepack algorithm, this step is687 the low nybble of each byte. As in the BytePack algorithm, this step is 688 688 applied 4 times in stage 1, for a total of 8 operations. 689 689 … … 794 794 transform, to separately compute the high and low byte 795 795 streams of each UTF16 code unit. Those two byte streams 796 are subsequent ely merged to form the final result.796 are subsequently merged to form the final result. 797 797 798 798 Algorithms for performing the inverse transform mirror those … … 847 847 operations. For each of the four pairs of consecutive 848 848 even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7), 849 two consecutive registers of bit pair data are produced.849 two consecutive registers of bitpair data are produced. 850 850 In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel# 851 are then applied to merge to bit pair streams to produce streams851 are then applied to merge to bitpair streams to produce streams 852 852 of nybbles for the high and low nybble of each byte. Finally, 853 853 the nybble streams are merged in stage 3 to produce the … … 965 965 966 966 At each inductive doubling level, it requires 4 operations to compute the 967 required deletion info mation and one operation per bit stream to perform deletion.967 required deletion information and one operation per bit stream to perform deletion. 968 968 Note that, if deletion is to be applied to a set of eight parallel bit streams, 969 969 the computed deletion information is used for each stream without recomputation, … … 1083 1083 \begin{center}\small 1084 1084 \begin{verbatim} 1085 t1=simd<8>:const (10)1086 t2=simd<16>:const (100)1087 t3=simd<32>:const (10000)1085 t1=simd<8>:constant(10) 1086 t2=simd<16>:constant(100) 1087 t3=simd<32>:constant(10000) 1088 1088 b=simd<8>::add<x,l>(simd<8>::mult<h,x>(d,t1), d) 1089 1089 b=simd<16>::add<x,l>(simd<16>::mult<h,x>(b,t2), b) … … 1131 1131 Further applications of IDISA can often be found 1132 1132 by searching for algorithms employing the magic masks 1133 s\verb:0x55555555:, \verb:0x33333333:, and so on.1134 Examples include the bit slice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}1133 \verb:0x55555555:, \verb:0x33333333:, and so on. 1134 Examples include the bitslice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06} 1135 1135 and integer contraction and dilation for quadtrees and 1136 oct trees\cite{Stocco95} and Mortonordered arrays \cite{Raman08}.1136 octrees\cite{Stocco95} and Mortonordered arrays \cite{Raman08}. 1137 1137 Pixel packing from 32 bit fields into a 5:5:5 representation 1138 1138 is a further application of parallel bit deletion. … … 1203 1203 be realized through $h$ inductive double steps 1204 1204 in a similar fashion. 1205 Thus, IDISA ess sentially offers systematic support1205 Thus, IDISA essentially offers systematic support 1206 1206 for horizontal operations entirely through the 1207 1207 use of \verb:<h,l>: halfoperand modifier … … 1373 1373 are required for combination with the $\neg h$ and $r[i]$ terms. The terms for 1374 1374 the low and high halfoperand modifiers are then combined with an 1375 additional 127 2input or gates. Thus, the circuit y complexity1375 additional 127 2input or gates. Thus, the circuitry complexity 1376 1376 for the combinational logic implementation of halfoperand 1377 1377 modifiers within the SOFU is 1279 2input gates per operand, … … 1402 1402 Addition for 4bit fields in a 128bit SWAR processor may be implemented 1403 1403 as a modification to that for 8bit fields by incorporating logic to 1404 disable carry propagation at the 16 mid field boundaries. For 2bit1404 disable carry propagation at the 16 midfield boundaries. For 2bit 1405 1405 fields, disabling carry propagation at 32 additional boundaries suffices, 1406 1406 although it may be simpler to directly implement the simple logic … … 1465 1465 as well as halfoperand modifiers and pack and merge operations 1466 1466 to support efficient transition between successive poweroftwo 1467 field widths. The princip leinnovation is the notion of1467 field widths. The principal innovation is the notion of 1468 1468 halfoperand modifiers that eliminate the cost associated 1469 1469 with the explicit mask and shift operations required for … … 1580 1580 \bibitem{DBLP:conf/cans/RebeiroSD06} 1581 1581 Chester Rebeiro, A.~David Selvakumar, and A.~S.~L. Devi. 1582 \newblock Bitslice implementation of aes.1582 \newblock Bitslice implementation of AES. 1583 1583 \newblock In David Pointcheval, Yi~Mu, and Kefei Chen, editors, {\em CANS}, 1584 1584 volume 4301 of {\em Lecture Notes in Computer Science}, pages 203212.
Note: See TracChangeset
for help on using the changeset viewer.