Changeset 993


Ignore:
Timestamp:
Mar 24, 2011, 11:31:48 PM (8 years ago)
Author:
lindanl
Message:

section 7

Location:
docs/PACT2011
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/07-avx.tex

    r983 r993  
    11\section{AVX}
    22
    3 In this section, we briefly highlight the improvements made in the Advanced Vector Extensions (AVX) extension to the x86 instruction set architecture and discuss the impact of these improvements on Parabix2. As neither Expat nor Xerces-C benefit from AVX, we do not discuss them in this section.
     3In this section, we briefly highlight the improvements made in the Advanced Vector Extensions (AVX) extension to the x86 instruction set architecture
     4and discuss the impact of these improvements on Parabix2. As neither Expat nor Xerces-C benefit from AVX, we do not discuss them in this section.
    45%The results of our experiments with the AVX and Sandy Bridge architecture can be seen in Figure \ref{avx}.
    56
     
    1415\end{figure}
    1516
     17\begin{figure}
     18\begin{center}
     19\includegraphics[width=85mm]{plots/InsMix.pdf}
     20\end{center}
     21\caption{Instructions per byte on Sandybridge}
     22\label{insmix}
     23\end{figure}
     24
    1625\subsection{Three Operand Form}
    1726
    18 Originally, SIMD SSE instructions operated using a two-operand form. This meant that given any SIMD instruction $a~\texttt{[op]}~b$ the result of that instruction would replace the value of $a$ or $b$ with the result. Thus whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be either reconstructed, or an additional store and load operation was required to recover that value. Utilizing the new VEX instruction coding scheme \textbf{[citation needed]}, Intel now allows the use of non-destructive three-operand operations in their SSE and AVX instruction sets. As shown in Figure \ref{avx}, simply enabling three-operand form on the existing 128-bit SSE instructions reduced the overall cycle count by between 11.7\% and 13.5\%. While this is a one-time savings, it provided a significant performance improvement that traditional parsers cannot leverage.
     27Originally, SIMD SSE instructions operated using a two-operand form.
     28This meant that given any SIMD instruction $a~\texttt{[op]}~b$ the result of that instruction would replace the value of $a$ or $b$ with the result.
     29Thus whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be either reconstructed,
     30or an additional store and load operation was required to recover that value.
     31Utilizing the new VEX instruction coding scheme \textbf{[citation needed]},
     32Intel now allows the use of non-destructive three-operand operations in their SSE and AVX instruction sets.
     33As shown in Figure \ref{insmix}, the total number of non-bitwise logic SIMD operations, which involve many memory movements is 32\% to 34\% less.
     34Simply enabling three-operand form on the existing 128-bit SSE instructions reduced the overall cycle count by between 11.7\% and 13.5\%, which is shown in Figure \ref{avx}.
     35While this is a one-time savings, it provided a significant performance improvement that traditional parsers cannot leverage since they cannot benefit from the three-operand form designed for SIMD instruction set and as shown in Figure \ref{insmix}, the total number of non-vector instructions does not change.
    1936
    2037\subsection{256-bit Operations}
    2138
    22 Although the AVX instruction set provided on the Sandy Bridge allows the use of 256-bit SIMD registers, Intel focused on implementing floating point operations as opposed to the integer based operations. This proved to be a significant challenge when porting Parabix2 from the 128-bit SSE to the 256-bit AVX instruction set. Even though we forsaw a gain in terms of memory throughput, many of the 128-bit SSE instructions used in Parabix2 did not have a corresponding 256-bit AVX instruction. Bitwise logic, which represented $30\%$ of the executed instructions in our test cases \textbf{[need more accurate figures here]}, was directly ported into pure AVX. The remaining $70\%$ of the instructions had to be simulated by breaking the 256-bit register into two 128-bit registers, performing the SSE version of the operation on both registers then combining the results back into the 256-bit register. As Figure \ref{avx} shows, this resulted in only a 0.4\% improvement in the case of dew.xml---which had the lowest markup density and therefore executed the fewest simulated 256-bit instructions---over the three-operand SSE implementation but incurred a performance penalty in the other four test cases. We expect that we could gain a significant performance improvement if future implementations of AVX incorporated integer-based shift and arithmetic operations. %Additionally, if we could efficiently switch between two- and three-operand form
     39The AVX instruction set provided on the Sandy Bridge allows the use of 256-bit SIMD registers.
     40Ideally, we only need half of the SIMD instructions compared with the version that uses SSE instruction set (three-operand form).
     41Therefore, Parabix2 should be able to achieve 50\% performance improvement on SIMD operations, which means 26\% to 38\% improvement of total processing time simply by using AVX intruction set instead of SSE instruction set.
     42However, Intel focused on implementing floating point operations as opposed to the integer based operations, we only gain from bitwise logic operations and SIMD loading operations.
     43As shown in Figure \ref{insmix}, the total number of SIMD instructions executed with AVX instruction set is 71\% to 79\% of the SIMD instructions with SSE instruction set.
     44The number of bitwise logic operations, which is expected to be 50\% less, only goes down by 33\% to 39\% because they are used to simulate some other 256-bit operations that exsit on SSE but is not provided by AVX instruction set.
     45As the total number of instructions goes down by 11\% to 23\%, we should be able to see less processing time and better performance.
     46However, as shown in Figure \ref{avx}, the processing time is longer except the one with 23\% less instructions.
     47The reason is that AVX instruction has longer latency. (cite Agner Fog?)
     48
Note: See TracChangeset for help on using the changeset viewer.