# Changeset 1335 for docs/HPCA2012/07-avx.tex

Ignore:
Timestamp:
Aug 21, 2011, 4:20:30 PM (8 years ago)
Message:

Working on evaluation. Fixed Figure sizes

File:
1 edited

Unmodified
Removed
• ## docs/HPCA2012/07-avx.tex

 r1302 \subsection{256-bit AVX Operations} With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX implementation, Intel has focused primarily on floating point operations as opposed to the integer based operations. 256-bit SIMD is available for loads, stores, bitwise logic and floating operations, whereas SIMD integer operations and shifts are only available in the 128-bit form.  Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated but not achieved. With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX implementation, Intel has focused primarily on floating point operations as opposed to the integer based operations.  256-bit SIMD is available for loads, stores, bitwise logic and floating operations, whereas SIMD integer operations and shifts are only available in the 128-bit form.  Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated but not achieved. \subsection{Performance Results} 256-bit AVX technology. Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected, however, the number of bitwise SIMD'' operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.   Ideally one one may expect up to a 50\% reduction in these instructions versus the 128-bit AVX.  The actual reduction measured was 32\%--39\% depending on workload.   Because some bitwise logic is needed in implementation of simulated 256-bit operations, the full 50\% reduction in bitwise logic was not achieved. Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected, however, the number of bitwise SIMD'' operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.  Ideally one one may expect up to a 50\% reduction in these instructions versus the 128-bit AVX.  The actual reduction measured was 32\%--39\% depending on workload.  Because some bitwise logic is needed in implementation of simulated 256-bit operations, the full 50\% reduction in bitwise logic was not achieved. The other SIMD'' class shows a substantial 30\%-35\% reduction While the successive reductions in SIMD instruction counts are quite dramatic with the two AVX implementations of Parabix2, the performance benefits are another story.   As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits. Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance. Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%. We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2. benefits are another story.  As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits.  Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.  Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%, saw a performance improvement of 8\%. We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.  If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2.
Note: See TracChangeset for help on using the changeset viewer.