# Changeset 258

Ignore:
Timestamp:
Jan 4, 2009, 6:51:36 PM (10 years ago)
Message:

Spelling corrections

Location:
docs/ASPLOS09
Files:
2 edited

### Legend:

Unmodified
 r257 to parallelize, concluding that nothing seems to help.  Indeed, the study even speculates that applications in this area may simply be embarrasingly sequential,'' easy to tackle for traditional embarrassingly sequential,'' easy to tackle for traditional sequential processing approaches suitable for uniprocessors, but perhaps fundamentally unsuited to parallel methods. required a mere 15 cycles per byte, while a range of other technologies required from 25 to 200 cycles per byte \cite{Herdy}. Ongoing work is further applying the parallel bit stream methds to Ongoing work is further applying the parallel bit stream methods to parallel hash value computation and parallel regular expression matching for the purpose of validating XML datatype declarations in primary concern for performance improvement of these applications. The additional circuity complexity to realize IDISA-A and The additional circuit complexity to realize IDISA-A and IDISA-B designs over their reference models will be addressed in the penultimate section.  That discussion by the \verb#simd<16>::pack# with 4 additional operations to prepare operands.  Essentially, the RefB implementation uses single permuite instructions to implement the equivalent of uses single permute instructions to implement the equivalent of \verb#simd<16>::pack(s0, s1)# and \verb#simd<16>::pack(s0, s1)#. The RefA implementation also requires 3 logic operations to implement Using IDISA, it is possible to design a transposition algorithm that is both easier to understand and requires many fewer operations than the the bytepack algorithm described above. many fewer operations than the the BytePack algorithm described above. We call it the inductive halving algorithm for serial to parallel transposition, because it proceeds by reducing byte streams to two sets of nybble streams in a first stage, dividing the nybble streams into streams of bitpairs in a second stage and finally dividing the bitpair streams into bit streams in the third stage. streams into streams of bit-pairs in a second stage and finally dividing the bit-pair streams into bit streams in the third stage. Figure \ref{halvingstep} shows one step in stage 1 of the inductive The \verb#simd<8>::pack# operation extracts the high nybble of each byte from the input registers, while the \verb#simd<8>::pack# operation extracts the low nybble of each byte.  As in the bytepack algorithm, this step is the low nybble of each byte.  As in the BytePack algorithm, this step is applied 4 times in stage 1, for a total of 8 operations. transform, to separately compute the high and low byte streams of each UTF-16 code unit.  Those two byte streams are subsequentely merged to form the final result. are subsequently merged to form the final result. Algorithms for performing the inverse transform mirror those operations.  For each of the four pairs of consecutive even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7), two consecutive registers of bitpair data are produced. two consecutive registers of bit-pair data are produced. In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel# are then applied to merge to bitpair streams to produce streams are then applied to merge to bit-pair streams to produce streams of nybbles for the high and low nybble of each byte.  Finally, the nybble streams are merged in stage 3 to produce the At each inductive doubling level, it requires 4 operations to compute the required deletion infomation and one operation per bit stream to perform deletion. required deletion information and one operation per bit stream to perform deletion. Note that, if deletion is to be applied to a set of eight parallel bit streams, the computed deletion information is used for each stream without recomputation, \begin{center}\small \begin{verbatim} t1=simd<8>:const(10) t2=simd<16>:const(100) t3=simd<32>:const(10000) t1=simd<8>:constant(10) t2=simd<16>:constant(100) t3=simd<32>:constant(10000) b=simd<8>::add(simd<8>::mult(d,t1), d) b=simd<16>::add(simd<16>::mult(b,t2), b) Further applications of IDISA can often be found by searching for algorithms employing the magic masks s \verb:0x55555555:,  \verb:0x33333333:, and so on. Examples include the bitslice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06} \verb:0x55555555:,  \verb:0x33333333:, and so on. Examples include the bit-slice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06} and integer contraction and dilation for quadtrees and octtrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}. octrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}. Pixel packing from 32 bit fields into a 5:5:5 representation is a further application of parallel bit deletion. be realized through $h$ inductive double steps in a similar fashion. Thus, IDISA esssentially offers systematic support Thus, IDISA essentially offers systematic support for horizontal operations entirely through the use of \verb:: half-operand modifier are required for combination with the $\neg h$  and $r[i]$ terms.  The terms for the low and high half-operand modifiers are then combined with an additional 127 2-input or gates.   Thus, the circuity complexity additional 127 2-input or gates.   Thus, the circuitry complexity for the combinational logic implementation of half-operand modifiers within the SOFU is 1279 2-input gates per operand, Addition for 4-bit fields in a 128-bit SWAR processor may be implemented as a modification to that for 8-bit fields by incorporating logic to disable carry propagation at the 16 midfield boundaries.  For 2-bit disable carry propagation at the 16 mid-field boundaries.  For 2-bit fields, disabling carry propagation at 32 additional boundaries suffices, although it may be simpler to directly implement the simple logic as well as half-operand modifiers and pack and merge operations to support efficient transition between successive power-of-two field widths.  The principle innovation is the notion of field widths.  The principal innovation is the notion of half-operand modifiers that eliminate the cost associated with the explicit mask and shift operations required for \bibitem{DBLP:conf/cans/RebeiroSD06} Chester Rebeiro, A.~David Selvakumar, and A.~S.~L. Devi. \newblock Bitslice implementation of aes. \newblock Bitslice implementation of AES. \newblock In David Pointcheval, Yi~Mu, and Kefei Chen, editors, {\em CANS}, volume 4301 of {\em Lecture Notes in Computer Science}, pages 203--212.