source: docs/PACT2011/07-avx.tex @ 991

Last change on this file since 991 was 983, checked in by lindanl, 9 years ago

section 7

File size: 3.0 KB
Line 
1\section{AVX}
2
3In this section, we briefly highlight the improvements made in the Advanced Vector Extensions (AVX) extension to the x86 instruction set architecture and discuss the impact of these improvements on Parabix2. As neither Expat nor Xerces-C benefit from AVX, we do not discuss them in this section.
4%The results of our experiments with the AVX and Sandy Bridge architecture can be seen in Figure \ref{avx}.
5
6% Following AMD's announcement of their SSE5 architecture, Intel announced their intention to develop the AVX
7
8\begin{figure}
9\begin{center}
10\includegraphics[width=85mm]{plots/avx.pdf}
11\end{center}
12\caption{Total CPU cycles /KB on AVX}
13\label{avx}
14\end{figure}
15
16\subsection{Three Operand Form}
17
18Originally, SIMD SSE instructions operated using a two-operand form. This meant that given any SIMD instruction $a~\texttt{[op]}~b$ the result of that instruction would replace the value of $a$ or $b$ with the result. Thus whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be either reconstructed, or an additional store and load operation was required to recover that value. Utilizing the new VEX instruction coding scheme \textbf{[citation needed]}, Intel now allows the use of non-destructive three-operand operations in their SSE and AVX instruction sets. As shown in Figure \ref{avx}, simply enabling three-operand form on the existing 128-bit SSE instructions reduced the overall cycle count by between 11.7\% and 13.5\%. While this is a one-time savings, it provided a significant performance improvement that traditional parsers cannot leverage.
19
20\subsection{256-bit Operations}
21
22Although the AVX instruction set provided on the Sandy Bridge allows the use of 256-bit SIMD registers, Intel focused on implementing floating point operations as opposed to the integer based operations. This proved to be a significant challenge when porting Parabix2 from the 128-bit SSE to the 256-bit AVX instruction set. Even though we forsaw a gain in terms of memory throughput, many of the 128-bit SSE instructions used in Parabix2 did not have a corresponding 256-bit AVX instruction. Bitwise logic, which represented $30\%$ of the executed instructions in our test cases \textbf{[need more accurate figures here]}, was directly ported into pure AVX. The remaining $70\%$ of the instructions had to be simulated by breaking the 256-bit register into two 128-bit registers, performing the SSE version of the operation on both registers then combining the results back into the 256-bit register. As Figure \ref{avx} shows, this resulted in only a 0.4\% improvement in the case of dew.xml---which had the lowest markup density and therefore executed the fewest simulated 256-bit instructions---over the three-operand SSE implementation but incurred a performance penalty in the other four test cases. We expect that we could gain a significant performance improvement if future implementations of AVX incorporated integer-based shift and arithmetic operations. %Additionally, if we could efficiently switch between two- and three-operand form
Note: See TracBrowser for help on using the repository browser.