# Changeset 1408

Ignore:
Timestamp:
Aug 31, 2011, 4:40:04 PM (8 years ago)
Message:

edits and corrects to performance subsection

File:
1 edited

Unmodified
Removed
• ## docs/HPCA2012/06-scalability.tex

 r1407 \section{Parabix on various hardware} \section{Evaluating Parabix on Hardware} \label{section:scalability} \subsection{Performance} \label{section:scalability:intel} In this section, we study the performance of the XML parsers across three generations of Intel architectures.  Figure \ref{Scalability} (a) shows the average execution time of Parabix.  We analyze the execution time in terms of SIMD operations that operate on bitstreams (\textit{bit-space}) and scalar operations that perform post processing on the original character bytes.  In Parabix a significant fraction of the overall execution time is spent in SIMD operations. three generations of Intel architectures.  Figure \ref{ScalabilityA} shows the average execution time of Parabix-XML (over all workloads).  We analyze the execution time in terms of SIMD operations that operate on bit streams'' (\textit{bit-space}) and scalar operations that perform post processing'' on the original source bytes.  In Parabix-XML, a significant fraction of the overall execution time is spent on SIMD operations. Our results demonstrate that Parabix's optimizations are complementary to hardware improvements and seem to further improve the efficiency of newer microarchitectures.  For Parabix's bit-stream processing, \CITHREE{} results in an 40\% performance improvement over \CO{}, whereas \SB{} results in a 20\% improvement compared to \CITHREE{}. The improvements in the bit-space SIMD operations is stable across the different input files. Postprocessing operations demonstrate data dependent variance. \CITHREE{} gains between 27\%---40\% compared to \CO{} and \SB{} gains between 16\%---39\% Our results demonstrate that Parabix-XML's optimizations complement newer hardware improvements. For bit-stream processing, \CITHREE{} has a 40\% performance increase over \CO{}; similarly, \SB{} has a 20\% improvement compared to \CITHREE{}. These gains appear to be independent of the markup density of the input file. Postprocessing operations demonstrate data dependent variance. Performance on the \CITHREE{} increases by 27\%--40\% compared to \CO{} whereas \SB{} increases by 16\%--29\% compared to \CITHREE{}. For the purpose of comparison, Figure \ref{Scalability} (b) shows the performance of the Expat parser; \CITHREE\ improves performance only by 5\% over \CO\ while \SB\ improves performance by less than 10\% over\CITHREE{}. Not that the \ref{ScalabilityB} shows the performance of the Expat parser. \CITHREE\ improves performance only by 29\% over \CO\ while \SB\ improves performance by less than 6\% over \CITHREE{}. Note that the gains of \CITHREE\ over \CO\ includes an improvement both in the clock frequency and microarchitecture improvements while \SB{}'s gains can Figure \ref{power_Parabix2} shows the average power consumption of Parabix over each workload and as executed on each of the processor cores --- \CO{}, \CITHREE\ and \SB{}.  Overall the last three generation of processors seem to bring with them 25---30\% improvement in power consumption with every generation. Parabix on \SB\ consumes less than 15W.  Overall, Parabix on \SB\ consumes 72\% to 75\% less energy than \CO{}. Parabix-XML over each workload and as executed on each of the processor cores: \CO{}, \CITHREE\ and \SB{}.  Each generation of processor seem to bring with them 25--30\% improvement in power consumption over the previous generation. Overall, Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}. \subfigure[Parabix]{ \includegraphics[width=0.40\textwidth]{plots/P2_scalability.pdf} \label{ScalabilityA} } \subfigure[Expat]{ \includegraphics[width=0.40\textwidth]{plots/Expat_scalability.pdf} \label{ScalabilityB} } \caption{Average Performance Parabix vs. Expat (y-axis: ns per kB)} \subsection{Parabix on Mobile processors} \label{section:neon} \label{section:scalability:\NEON{}} Our experience with the generation of Intel processors led us to contemplate about mobile processors such as the ARM \CORTEXA8\ which also includes SIMD units.  ARM NEON makes available a 128-bit SIMD also includes SIMD units.  ARM \NEON{} makes available a 128-bit SIMD instruction set similar in functionality to Intel SSE3 instruction set. In this section, we present our performance comparison of a NEON-based port of Parabix versus the Expat parser. Xerces is excluded \NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded from this portion of our study due to the complexity of the cross-platform build process for C++ applications. 512kB L2 shared cache.  Migration of Parabix to the Android platform began with the retargeting of a subset of the Parabix SIMD library for ARM NEON.  The majority of the Parabix SIMD functionality ported for ARM \NEON{}.  The majority of the Parabix SIMD functionality ported directly. However, for a small subset of the SIMD functions (e.g., bit packing) of NEON equivalents did not exist. In such cases we simply packing) of \NEON{} equivalents did not exist. In such cases we simply emulated logical equivalent instructions using the available the scalar instruction set. This library code was cross-compiled for moderately better than Expat.  Investigating causes for this performance degradation for Parabix led us to investigate the latency of Neon SIMD operations. of \NEON{} SIMD operations. \begin{figure}[!h] instruction latency for low markup density files; it appears that the latency of SIMD operations is relatively higher on the \CORTEXA8{} processor.  This is possibly because the Neon SIMD extensions are processor.  This is possibly because the \NEON{} SIMD extensions are implemented as a coprocessor on \CORTEXA8{} which imposes higher overhead for applications that frequently inter-operate between scalar and SIMD registers. Future performance enhancement to ARM NEON that implement the Neon within the core microarchitecture could and SIMD registers. Future performance enhancement to ARM \NEON{} that implement the \NEON{} within the core microarchitecture could substantially improve the efficiency of Parabix.
Note: See TracChangeset for help on using the changeset viewer.