Changes between Version 3 and Version 4 of ParabixTransform

Apr 22, 2014, 3:23:43 PM (4 years ago)



  • ParabixTransform

    v3 v4  
    1010Transform may be implemented.  Although the transformation can be performed in a serial
    1111fashion extracting one bit at a time from each byte, the overhead of the serial approach greatly limits its usefulness.   In this section, we concentrate instead on parallel methods that impose a relatively small overhead on the processing of character data streams.
     13=== Ideal Three-Stage Parallel Transposition for Byte Streams ===
     15The Parabix Transform to transform byte-oriented character stream data to bit-parallel data streams
     16can be implemented in a parallel fashion using the following
     17three-stage transposition strategy.
     19  1.  Transform the byte stream data into two parallel nybble streams, one for the high nybble of each byte, and one for the low nybble of each byte.   (A nybble is one-half of a byte, i.e., 4 bits).
     20  2.  Transform each of the nybble streams into two parallel bit-pair streams.  Each bit-pair stream consists of a stream of 2-bit units in one-to-one correspondence with the input byte stream.   The bit-pairs in the first such stream consist of bits 0 and 1 of each byte, the bit-pairs in the second stream consist of bits 2 and 3 of each byte, the bit-pairs in the third stream consist of bits 4 and 5 of each byte and the bit-pairs in the fourth and final bit-pair stream consist of the bits 6 and 7 of each byte.
     21  3.  Transform each of the bit-pair streams into two individual bit streams.
    1324=== Ideal Implementation Using SIMD Packing ===
    3546||{{{hsimd<8>::packh(e1, e2)}}}||{{{CdDdGgHh}}}||{{{LlMmQqRr}}}||
    37 Using these operations it is possible to perform transposition in a
     48Using these operations it is possible to implement the ideal transposition
     49strategy in a
    3850straightforward fashion.   Given a 2^k^-byte sequence held consecutively
    3951in 8 registers {{{s0}}}, {{{s1}}}, … {{{s7}}}, the following
    40523-step transformation process performs transposition to parallel bit streams.
     55// Transpose to 2 nybble streams
     56lo_nybble0 = hsimd::packl(s0, s1);
     57lo_nybble1 = hsimd::packl(s2, s3);
     58lo_nybble3 = hsimd::packl(s4, s5);
     59lo_nybble4 = hsimd::packl(s6, s7);
     60hi_nybble0 = hsimd::packh(s0, s1);
     61hi_nybble1 = hsimd::packh(s2, s3);
     62hi_nybble3 = hsimd::packh(s4, s5);
     63hi_nybble4 = hsimd::packh(s6, s7);
     64// Transpose 2 nybble streams to 4 bit-pair streams.
     65bit01pair_0 = hsimd::packl(lo_nybble0, lo_nybble1);
     66bit01pair_1 = hsimd::packl(lo_nybble2, lo_nybble3);
     67bit23pair_0 = hsimd::packh(lo_nybble0, lo_nybble1);
     68bit23pair_1 = hsimd::packh(lo_nybble2, lo_nybble3);
     69bit45pair_0 = hsimd::packl(hi_nybble0, hi_nybble1);
     70bit45pair_1 = hsimd::packl(hi_nybble2, hi_nybble3);
     71bit67pair_0 = hsimd::packh(hi_nybble0, hi_nybble1);
     72bit67pair_1 = hsimd::packh(hi_nybble2, hi_nybble3);
     73// Transpose 4 bit-pairs streams to 8 bit streams.
     74bit0 = hsimd::packl(bit01pair_0, bit01pair_1);
     75bit1 = hsimd::packh(bit01pair_0, bit01pair_1);
     76bit2 = hsimd::packl(bit23pair_0, bit23pair_1);
     77bit3 = hsimd::packh(bit23pair_0, bit23pair_1);
     78bit4 = hsimd::packl(bit45pair_0, bit45pair_1);
     79bit5 = hsimd::packh(bit45pair_0, bit45pair_1);
     80bit6 = hsimd::packl(bit67pair_0, bit67pair_1);
     81bit7 = hsimd::packh(bit67pair_0, bit67pair_1);
     84Overall, transposition requires 8 pack operations for each of the three transposition steps, for a total
     85of 24 operations for the entire process.