Changeset 1120 for docs

Ignore:
Timestamp:
Apr 11, 2011, 4:45:23 PM (9 years ago)
Message:

Minor edits.

File:
1 edited

Legend:

Unmodified
 r1116 \section{Scaling Parabix2 for AVX} Parabix2 was originally developed for 128-bit SSE2 technology widely and is available on all 64-bit Intel and AMD processors.  In this section, we discuss the scalability and performance of Parabix2 to take advantage of the new 256-bit AVX (Advanced Vector Extensions) technology that has just become commercially available in the latest Intel processors based on the \SB\ microarchitecture. In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix2 port. Parabix2 originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but has recently been ported to AVX. AVX technology is commercially available on the latest the \SB\ microarchitecture Intel processors. \begin{figure*} \subsection{Three Operand Form} In addition to the introduction of 256-bit operations, AVX technology also makes a change in the structure of the base SSE instructions, moving from a destructive 2-operand form long used with SSE technologies to a nondestructive 3-operand form.   In the 2-operand form, one register is used as both a source and destination register, equivalent to the assignment $a = a~\texttt{[op]}~b$. Thus, whenever the subsequent instructions used the value of both $a$ and $b$, one of them had to be copied beforehand, or reconstituted or reloaded afterwards in order to recover the value. With 3-operand form, output may be directed to a third register independent of the source operands, as reflected by the assignment $c = a~\texttt{[op]}~b$. In addition to the widening of 128-bit operations to 256-bit operations, AVX technology uses a nondestructive 3-operand instruction format. Previous SSE implementations used a destructive 2-operand instruction format. In the 2-operand format a single register is used as both a source and destination register. For example, $a = a~\texttt{[op]}~b$. As such, 2-operand instructions that require the value of both $a$ and $b$, must either copy an additional register value beforehand, or reconstitute or reload a register value afterwards to recover the value. With the 3-operand format, output may now be directed to the third register independently of the source operands. For example, $c = a~\texttt{[op]}~b$. By avoiding the copying or reconstituting of operand values, a considerable reduction in instruction count may be possible. AVX technology makes available the 3-operand form both with the new 256-bit operations as well as with base 128-bit operations of SSE. reduction in instruction count in the form of reduced load and store instructions is possible. AVX technology makes available the 3-operand form for both the new 256-bit operations as well as the base 128-bit SSE operations. \subsection{256-bit AVX Operations} With the introduction of 256-bit SIMD registers with AVX technology, one might ideally expect up to a 50\% reduction in the instruction count for the SIMD workload of Parabix2.   However, in the \SB\ implementation, Intel has focused on implementing floating point operations as opposed to the integer based operations.  That is, With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX implementation, Intel has focused primarily on floating point operations as opposed to the integer based operations. 256-bit SIMD is available for loads, stores, bitwise logic and floating operations, while SIMD integer operations and shifts are only available in 128-bit form.   Nevertheless, with loads, stores floating operations, whereas SIMD integer operations and shifts are only available in the 128-bit form.  Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated. and consequent performance improvement was anticipated but not achieved. \subsection{Performance Results} We implemented two versions of Parabix2 using AVX technology.   The first We implemented two versions of Parabix2 using AVX technology.  The first was simply the recompilation of the existing Parabix2 source code to take advantage of the 3-operand form of AVX instructions while retaining a uniform 128-bit SIMD processing width.  The second involved rewriting core library functions for Parabix2 to use 256-bit AVX operations wherever written to take advantage of the 3-operand form of AVX instructions while retaining a uniform 128-bit SIMD processing width.  The second involved rewriting the core library functions of Parabix2 to leverage the 256-bit AVX operations wherever possible and to simulate the remaining operations using pairs of 128-bit operations.