Ignore:
Timestamp:
Dec 14, 2011, 2:27:41 PM (8 years ago)
Author:
ashriram
Message:

Final pass

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/final_ieee/07-avx.tex

    r1778 r1783  
    6363
    6464
    65 Note that, in each workload, the number of non-SIMD instructions
    66 remains relatively constant with each implementation.  As expected,
    67 the number of bitwise SIMD operations remains the same
    68 for both SSE and 128-bit AVX while dropping dramatically when operating
    69 256-bits at a time. The reduction was measured at 32\%--39\% depending
    70 on markup density of the workload. The ``other SIMD'' class
    71 shows a substantial 30\%--35\% reduction with AVX 128-bit technology
    72 compared to SSE. This reduction is due to elimination of register
    73 unloading and reloading when SIMD operations are compiled using
    74 3-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
    75 reduction is also observed when Parabix-XML utilized the AVX runtime
    76 library.
     65The number of non-SIMD instructions remains relatively constant with
     66each implementation.  The number of bitwise SIMD
     67operations remains the same for both SSE and 128-bit AVX while
     68dropping dramatically when operating 256-bits at a time. The reduction
     69was measured at 32\%--39\% depending on markup density of the
     70workload. The ``other SIMD'' class shows a substantial 30\%--35\%
     71reduction with AVX 128-bit technology compared to SSE. This reduction
     72is due to elimination of register unloading and reloading when SIMD
     73operations are compiled using 3-operand AVX form versus 2-operand SSE
     74form.  A further 10\%--20\% reduction is also observed when
     75Parabix-XML utilized the AVX runtime library.
    7776
    7877
    7978%[AS] Check numbers.
    80 The reductions in instruction counts are quite dramatic with the AVX
    81 extensions in Parabix demonstrating the ability of our runtime
    82 framework to exploit the available hardware resources. As shown in
    83 Figure \ref{avx}, the benefits of the reduced SIMD instruction count
    84 are achieved only in the AVX 128-bit version.  In this case, the
    85 benefits of 3-operand form seem to fully translate to performance
    86 benefits.  Based on the reduction of overall Bitwise-SIMD instructions
    87 we expected a 11\% improvement in performance. 
    88 Surprisingly, the performance of Parabix in the 256-bit AVX
    89 implementation does not improve significantly and actually degrades
    90 for files with higher markup density ($\sim11\%$). dew.xml, on
    91 which bitwise-SIMD instructions were reduced by 39\%, saw a performance
    92 improvement of 8\%.  We believe that this is primarily due to the
    93 intricacies of the first generation AVX implementation in \SB{}, with
    94 significant latency in many of the 256-bit instructions in comparison
    95 to their 128-bit counterparts. The 256-bit instructions also have
    96 different scheduling constraints that seem to reduce overall
    97 throughput.  If these latency issues can be addressed in future AVX
    98 implementations, further performance and energy benefits
    99 could be realized in Parabix-XML.
     79The reductions in instruction counts are significant with the AVX
     80extensions demonstrating the ability of Parabix to
     81exploit wider SIMD extensions. Figure
     82\ref{avx} shows the benefits of the reduced SIMD instruction count are
     83achieved only in the AVX 128-bit version; The 3-operand form seems to fully translate to performance benefits.
     84Based on the reduction of overall Bitwise-SIMD instructions we
     85expected a 11\% improvement in performance.  Surprisingly, the
     86performance of Parabix in the 256-bit AVX implementation does not
     87improve significantly and actually degrades for files with higher
     88markup density ($\sim11\%$). dew.xml, on which bitwise-SIMD
     89instructions were reduced by 39\%, saw a performance improvement of
     908\%.  We believe that this is primarily due to the intricacies of the
     91first generation AVX implementation in \SB{}, with significant latency
     92in many of the 256-bit instructions in comparison to their 128-bit
     93counterparts. The 256-bit instructions also have different scheduling
     94constraints that seem to reduce overall throughput.  If these latency
     95issues can be addressed in future AVX implementations, further
     96performance and energy benefits could be realized by Parabix.
    10097
    10198
Note: See TracChangeset for help on using the changeset viewer.