# Changeset 1410

Ignore:
Timestamp:
Aug 31, 2011, 5:57:04 PM (8 years ago)
Message:

edits

File:
1 edited

### Legend:

Unmodified
 r1407 \label{section:avx} In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The Parabix SIMD libraries originally targeted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but has recently been ported to AVX. AVX technology is commercially of our 256-bit AVX (Advanced Vector Extensions) Parabix-XML port.  The Parabix runtime libraries originally targeted the 128-bit SSE2 SIMD technology, available on all modern 64-bit Intel and AMD processors. It was recently been ported to AVX, which is commercially available on the latest the \SB\ microarchitecture Intel processors. While we have to port our runtime framework the application didn't need to be modified. processors. Although the Parabix runtime framework had to be ported to the new ISA, no modifications to Parabix-XML itself were needed. \paragraph{3-Operand Form} In addition to the widening of 128-bit operations to 256-bit operations, AVX technology uses a nondestructive 3-operand instruction \subsection{3-Operand Form} In addition to widening the 128-bit operations to 256-bit, AVX technology uses a nondestructive 3-operand instruction format. Previous SSE implementations used a destructive 2-operand instruction format. In the 2-operand format a single register is used as both a source and destination register. For example, $a = a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the as both a source and destination register. As such, 2-operand instructions that require the value of both $a$ and $b$, must either copy an additional register value beforehand, or reconstitute or reload a register value afterwards to recover the value.  With the 3-operand format, output may now be directed to the third register independently of the source operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the copying or reconstituting of operand values, a considerable reduction operands. By avoiding the need to copy or reconstitute operand values, a considerable reduction in instructions required for unloading from and loading into registers.  AVX technology makes available the 3-operand form for both \subsection{Performance Results} We implemented two versions of Parabix using AVX technology.  The first was simply the recompilation of the existing Parabix source code written to take advantage of the 3-operand form of AVX instructions We implemented two versions of Parabix-XML using AVX technology.  The first was simply the recompilation of the existing Parabix-XML source code to take advantage of the 3-operand form of AVX instructions while retaining a uniform 128-bit SIMD processing width.  The second involved rewriting the internal library functions of Parabix to leverage the 256-bit AVX operations wherever possible and to simulate the remaining operations using pairs of 128-bit operations.Figure involved rewriting the Parabix runtime library to leverage the 256-bit AVX instructions wherever possible and to simulate the remaining operations using pairs of 128-bit operations. Figure \ref{insmix} shows the reduction in instruction counts achieved in these two versions.  For each workload, the base instruction count of the Parabix binary compiled in SSE-only mode is indicated by sse,'' the version which only takes advantage of the AVX 3-operand mode is labeled 128-bit avx,'' and the version reimplemented to use 256-bit these two versions. For each workload, the base instruction count of the Parabix binary compiled in 2-operand SSE-only mode is indicated by sse;'' the version that only takes advantage of the AVX 3-operand mode is labeled 128-bit avx,'' and the version uses the 256-bit operations wherever possible is labelled 256-bit avx.''  The instruction counts are divided into three classes: non-SIMD'' operations are the general purpose instructions.  The bitwise SIMD'' class comprises the bitwise logic operations, that are available in both 128-bit form and 256-bit form.  The other SIMD'' class both 128-bit form and 256-bit form --- excluding bitwise shifts which are only available in 128-bit form.  The other SIMD'' class comprises all other SIMD operations, primarily comprising the integer SIMD operations that are available only at 128-bit widths even under Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected the number of \textit{bit-parallel SIMD} operations remains the same remains relatively constant with each workload.  As expected, the number of bitwise SIMD operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.  The reduction measured was 32\%--39\% depending on workload because some bitwise logic needed in implementation is composed of 128-bit operations. The limits the performance gains achieved when using the AVX instructions.  The other SIMD'' class shows a substantial 30\%-35\% reduction with AVX 128-bit technology compared to SSE.  This reduction is due to elimination of register 256-bits at a time. The reduction was measured at 32\%--39\% depending on markup density of the workload. The other SIMD'' class shows a substantial 30\%--35\% reduction with AVX 128-bit technology compared to SSE. This reduction is due to elimination of register unloading and reloading when SIMD operations are compiled using 3-operand AVX form versus 2-operand SSE form.  A further 10\%--20\% reduction is observed with Parabix version rewritten to use 256-bit operations. reduction is also observed when Parabix-XML utilized the AVX runtime library. %[AS] Check numbers. bizarrely, the performance of Parabix in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 11\%). dew.xml, on for files with higher markup density ($\sim11\%$). dew.xml, on which bitwise-SIMD instructions reduced by 39\%, saw a performance improvement of 8\%.  We believe that this is primarily due to the different scheduling constraints that seem to reduce overall throughput.  If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix. implementations, further performance and energy benefits could be realized in Parabix-XML.