Ignore:
Timestamp:
Aug 21, 2011, 4:20:30 PM (8 years ago)
Author:
ashriram
Message:

Working on evaluation. Fixed Figure sizes

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/07-avx.tex

    r1302 r1335  
    4141\subsection{256-bit AVX Operations}
    4242
    43 With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding
    44 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX
    45 implementation, Intel has focused primarily on floating point operations
    46 as opposed to the integer based operations. 
    47 256-bit SIMD is available for loads, stores, bitwise logic and
    48 floating operations, whereas SIMD integer operations and shifts are
    49 only available in the 128-bit form.  Nevertheless, with loads, stores
    50 and bitwise logic comprising a major portion of the Parabix2
    51 SIMD instruction mix, a substantial reduction in instruction count
    52 and consequent performance improvement was anticipated but not achieved.
     43With the introduction of 256-bit SIMD registers, and under ideal
     44conditions, one would anticipate a corresponding 50\% reduction in the
     45SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX
     46implementation, Intel has focused primarily on floating point
     47operations as opposed to the integer based operations.  256-bit SIMD
     48is available for loads, stores, bitwise logic and floating operations,
     49whereas SIMD integer operations and shifts are only available in the
     50128-bit form.  Nevertheless, with loads, stores and bitwise logic
     51comprising a major portion of the Parabix2 SIMD instruction mix, a
     52substantial reduction in instruction count and consequent performance
     53improvement was anticipated but not achieved.
    5354
    5455\subsection{Performance Results}
     
    7879256-bit AVX technology.
    7980
    80 Note that, in each workload, the number of non-SIMD instructions
    81 remains relatively constant with each workload.  As may be
    82 expected, however, the number of ``bitwise SIMD'' operations
    83 remains the same for both SSE and 128-bit while dropping
    84 dramatically when operating 256-bits at a time.   Ideally
    85 one one may expect up to a 50\% reduction in these instructions versus
    86 the 128-bit AVX.  The actual reduction measured was 32\%--39\%
    87 depending on workload.   Because some bitwise logic is needed
    88 in implementation of simulated 256-bit operations, the full 50\%
    89 reduction in bitwise logic was not achieved.
     81Note that, in each workload, the number of non-SIMD instructions
     82remains relatively constant with each workload.  As may be expected,
     83however, the number of ``bitwise SIMD'' operations remains the same
     84for both SSE and 128-bit while dropping dramatically when operating
     85256-bits at a time.  Ideally one one may expect up to a 50\% reduction
     86in these instructions versus the 128-bit AVX.  The actual reduction
     87measured was 32\%--39\% depending on workload.  Because some bitwise
     88logic is needed in implementation of simulated 256-bit operations, the
     89full 50\% reduction in bitwise logic was not achieved.
    9090
    9191The ``other SIMD'' class shows a substantial 30\%-35\% reduction
     
    9898While the successive reductions in SIMD instruction counts are quite
    9999dramatic with the two AVX implementations of Parabix2, the performance
    100 benefits are another story.   As shown in Figure \ref{avx}, the
    101 benefits of the reduced SIMD instruction count are achieved only
    102 in the AVX 128-bit version.  In this case, the benefits of 3-operand
    103 form seem to fully translate to performance benefits. 
    104 Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.
    105 Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation
    106 does not improve significantly and actually degrades for files with
    107 higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%.
    108 We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{},
    109 with significant latency in many of the 256-bit instructions in comparison to their
    110 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed
    111 in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2.
     100benefits are another story.  As shown in Figure \ref{avx}, the
     101benefits of the reduced SIMD instruction count are achieved only in
     102the AVX 128-bit version.  In this case, the benefits of 3-operand form
     103seem to fully translate to performance benefits.  Based on the
     104reduction of overall Bitwise-SIMD instructions we expected a 11\%
     105improvement in performance.  Instead, perhaps bizzarely, the
     106performance of Parabix2 in the 256-bit AVX implementation does not
     107improve significantly and actually degrades for files with higher
     108markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD
     109instructions reduced by 39\%, saw a performance improvement of 8\%.
     110We believe that this is primarily due to the intricacies of the first
     111generation AVX implemention in \SB{}, with significant latency in many
     112of the 256-bit instructions in comparison to their 128-bit
     113counterparts. The 256-bit instructions also have different scheduling
     114constraints that seem to reduce overall SIMD throughput.  If these
     115latency issues can be addressed in future AVX implementations, further
     116substantial performance and energy benefits could be realized in XML
     117parsing with Parabix2.
Note: See TracChangeset for help on using the changeset viewer.