# Changeset 999

Ignore:
Timestamp:
Mar 25, 2011, 3:16:33 PM (8 years ago)
Message:

Section 7

File:
1 edited

### Legend:

Unmodified
 r993 \section{AVX} \section{Scaling Parabix2 for AVX Technology} In this section, we briefly highlight the improvements made in the Advanced Vector Extensions (AVX) extension to the x86 instruction set architecture and discuss the impact of these improvements on Parabix2. As neither Expat nor Xerces-C benefit from AVX, we do not discuss them in this section. %The results of our experiments with the AVX and Sandy Bridge architecture can be seen in Figure \ref{avx}. Parabix2 was originally developed for 128-bit SSE2 technology widely available on all 64-bit Intel and AMD processors.  In this section, we discuss the scalability and performance of Parabix2 to take advantage of the new 256-bit AVX (Advanced Vector Extensions) technology that has just become commercially available in the latest Intel processors based on the Sandy Bridge microarchitecture. % Following AMD's announcement of their SSE5 architecture, Intel announced their intention to develop the AVX \begin{figure*} \begin{center} \includegraphics[height=0.25\textheight]{plots/InsMix.pdf} \end{center} \caption{Parabix2 Instruction Counts (y-axis: Instructions per Byte)} \label{insmix} \end{figure*} \begin{figure} \begin{center} \includegraphics[width=85mm]{plots/avx.pdf} \includegraphics[width=0.5\textwidth]{plots/avx.pdf} \end{center} \caption{Total CPU cycles /KB on AVX} \caption{Parabix2 Performance (y-axis: CPU cycles per KB)} \label{avx} \end{figure} \begin{figure} \begin{center} \includegraphics[width=85mm]{plots/InsMix.pdf} \end{center} \caption{Instructions per byte on Sandybridge} \label{insmix} \end{figure} \subsection{Three Operand Form} Originally, SIMD SSE instructions operated using a two-operand form. This meant that given any SIMD instruction $a~\texttt{[op]}~b$ the result of that instruction would replace the value of $a$ or $b$ with the result. Thus whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be either reconstructed, or an additional store and load operation was required to recover that value. Utilizing the new VEX instruction coding scheme \textbf{[citation needed]}, Intel now allows the use of non-destructive three-operand operations in their SSE and AVX instruction sets. As shown in Figure \ref{insmix}, the total number of non-bitwise logic SIMD operations, which involve many memory movements is 32\% to 34\% less. Simply enabling three-operand form on the existing 128-bit SSE instructions reduced the overall cycle count by between 11.7\% and 13.5\%, which is shown in Figure \ref{avx}. While this is a one-time savings, it provided a significant performance improvement that traditional parsers cannot leverage since they cannot benefit from the three-operand form designed for SIMD instruction set and as shown in Figure \ref{insmix}, the total number of non-vector instructions does not change. In addition to the introduction of 256-bit operations, AVX technology also makes a change in the structure of the base SSE instructions, moving from a destructive 2-operand form long used with SSE technologies to a nondestructive 3-operand form.   In the 2-operand form, one register is used as both a source and destination register, equivalent to the assignment $a = a~\texttt{[op]}~b$. Thus, whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be copied beforehand, or reconstituted or reloaded afterwards in order to recover the value. With 3-operand form, output may be directed to a third register independent of the source operands, as reflected by the assignment $c = a~\texttt{[op]}~b$. By avoiding the copying or reconstituting of operand values, a considerable reduction in instruction count may be possible. AVX technology makes available the 3-operand form both with the new 256-bit operations as well as with base 128-bit operations of SSE. \subsection{256-bit Operations} The AVX instruction set provided on the Sandy Bridge allows the use of 256-bit SIMD registers. Ideally, we only need half of the SIMD instructions compared with the version that uses SSE instruction set (three-operand form). Therefore, Parabix2 should be able to achieve 50\% performance improvement on SIMD operations, which means 26\% to 38\% improvement of total processing time simply by using AVX intruction set instead of SSE instruction set. However, Intel focused on implementing floating point operations as opposed to the integer based operations, we only gain from bitwise logic operations and SIMD loading operations. As shown in Figure \ref{insmix}, the total number of SIMD instructions executed with AVX instruction set is 71\% to 79\% of the SIMD instructions with SSE instruction set. The number of bitwise logic operations, which is expected to be 50\% less, only goes down by 33\% to 39\% because they are used to simulate some other 256-bit operations that exsit on SSE but is not provided by AVX instruction set. As the total number of instructions goes down by 11\% to 23\%, we should be able to see less processing time and better performance. However, as shown in Figure \ref{avx}, the processing time is longer except the one with 23\% less instructions. The reason is that AVX instruction has longer latency. (cite Agner Fog?) With the introduction of 256-bit SIMD registers with AVX technology, one might ideally expect up to a 50\% reduction in the instruction count for the SIMD workload of Parabix2.   However, in the Sandy Bridge implementation, Intel has focused on implementing floating point operations as opposed to the integer based operations.  That is, 256-bit SIMD is available for loads, stores, bitwise logic and floating operations, while SIMD integer operations and shifts are only available in 128-bit form.   Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated. \subsection{Performance Results} We implemented two versions of Parabix2 using AVX technology.   The first was simply the recompilation of the existing Parabix2 source code to take advantage of the 3-operand form of AVX instructions while retaining a uniform 128-bit SIMD processing width.  The second involved rewriting core library functions for Parabix2 to use 256-bit AVX operations wherever possible and to simulate the remaining operations using pairs of 128-bit operations. Figure \ref{insmix} shows the reduction in instruction counts achieved in these two versions.   For each workload, the base instruction count of the Parabix2 binary compiled in SSE-only mode is shown with the caption sse,'' the version obtained by simple recompilation with AVX-mode enabled is labeled avx 128-bit,'' and the version reimplemented to use 256-bit operations wherever possible is labelled avx 256-bit.''    The instruction counts are divided into three classes.  The non-SIMD'' operations are the general purpose instructions that use neither SSE nor AVX technology.   The bitwise SIMD'' class comprises the bitwise logic operations, that are available in both 128-bit form and 256-bit form.  The other SIMD'' class comprises all other SIMD operations, primarily comprising the integer SIMD operations that are available only at 128-bit widths even with 256-bit AVX technology. Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected, however, the number of bitwise SIMD'' operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.   Ideally one one may expect up to a 50\% reduction in these instructions versus the 128-bit AVX.  The actual reduction measured was 32\%--39\% depending on workload.   Because some bitwise logic is needed in implementation of simulated 256-bit operations, the full 50\% reduction in bitwise logic was not achieved. The other SIMD'' class shows a substantial 30\%-35\%'' reduction with AVX 128-bit technology compared to SSE.  This reduction is due to eliminated copies or reloads when SIMD operations are compiled using 3-operand AVX form versus 2-operand SSE form. A further 10\%--20\% reduction is observed with Parabix2 version rewritten to use 256-bit operations. While the successive reductions in SIMD instruction counts are quite dramatic with the two AVX implementations of Parabix2, the performance benefits are another story.   As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits.   Bizarrely, perhaps, the performance of Parabix2 in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density.  We believe that this is primarily due to the current AVX implemention in Sandy Bridge, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts.   If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2