Changeset 1408 for docs


Ignore:
Timestamp:
Aug 31, 2011, 4:40:04 PM (8 years ago)
Author:
ksherdy
Message:

edits and corrects to performance subsection

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/06-scalability.tex

    r1407 r1408  
    1 \section{Parabix on various hardware}
     1\section{Evaluating Parabix on Hardware}
    22\label{section:scalability}
    33\subsection{Performance}
     4\label{section:scalability:intel}
    45In this section, we study the performance of the XML parsers across
    5 three generations of Intel architectures.  Figure \ref{Scalability}
    6 (a) shows the average execution time of Parabix.  We analyze the
    7 execution time in terms of SIMD operations that operate on bitstreams
    8 (\textit{bit-space}) and scalar operations that perform post
    9 processing on the original character bytes.  In Parabix a significant
    10 fraction of the overall execution time is spent in SIMD operations. 
     6three generations of Intel architectures.  Figure \ref{ScalabilityA}
     7shows the average execution time of Parabix-XML (over all workloads).  We analyze the
     8execution time in terms of SIMD operations that operate on ``bit streams''
     9(\textit{bit-space}) and scalar operations that perform ``post
     10processing'' on the original source bytes.  In Parabix-XML, a significant
     11fraction of the overall execution time is spent on SIMD operations. 
    1112
    12 Our results demonstrate that Parabix's optimizations are complementary
    13 to hardware improvements and seem to further improve the efficiency of
    14 newer microarchitectures.  For Parabix's bit-stream processing,
    15 \CITHREE{} results in an 40\% performance improvement over \CO{},
    16 whereas \SB{} results in a 20\% improvement compared to
    17 \CITHREE{}. The improvements in the bit-space SIMD operations is
    18 stable across the different input files. Postprocessing operations
    19 demonstrate data dependent variance. \CITHREE{} gains between
    20 27\%---40\% compared to \CO{} and \SB{} gains between 16\%---39\%
     13Our results demonstrate that Parabix-XML's optimizations complement
     14newer hardware improvements. For bit-stream processing,
     15\CITHREE{} has a 40\% performance increase over \CO{};
     16similarly, \SB{} has a 20\% improvement compared to
     17\CITHREE{}. These gains appear to be independent of the markup
     18density of the input file.
     19Postprocessing operations
     20demonstrate data dependent variance. Performance on the \CITHREE{} increases by
     2127\%--40\% compared to \CO{} whereas \SB{} increases by 16\%--29\%
    2122compared to \CITHREE{}. For the purpose of comparison, Figure
    22 \ref{Scalability} (b) shows the performance of the Expat parser;
    23 \CITHREE\ improves performance only by 5\% over \CO\ while \SB\
    24 improves performance by less than 10\% over\CITHREE{}. Not that the
     23\ref{ScalabilityB} shows the performance of the Expat parser.
     24\CITHREE\ improves performance only by 29\% over \CO\ while \SB\
     25improves performance by less than 6\% over \CITHREE{}. Note that the
    2526gains of \CITHREE\ over \CO\ includes an improvement both in the clock
    2627frequency and microarchitecture improvements while \SB{}'s gains can
     
    2829
    2930Figure \ref{power_Parabix2} shows the average power consumption of
    30 Parabix over each workload and as executed on each of the processor
    31 cores --- \CO{}, \CITHREE\ and \SB{}.  Overall the last three
    32 generation of processors seem to bring with them 25---30\% improvement
    33 in power consumption with every generation. Parabix on \SB\ consumes
    34 less than 15W.  Overall, Parabix on \SB\ consumes 72\% to 75\% less
    35 energy than \CO{}.
     31Parabix-XML over each workload and as executed on each of the processor
     32cores: \CO{}, \CITHREE\ and \SB{}.  Each
     33generation of processor seem to bring with them 25--30\% improvement
     34in power consumption over the previous generation. Overall,
     35Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
    3636
    3737
     
    4040\subfigure[Parabix]{
    4141\includegraphics[width=0.40\textwidth]{plots/P2_scalability.pdf}
     42\label{ScalabilityA}
    4243}
    4344\subfigure[Expat]{
    4445\includegraphics[width=0.40\textwidth]{plots/Expat_scalability.pdf}
     46\label{ScalabilityB}
    4547}
    4648\caption{Average Performance Parabix vs. Expat (y-axis: ns per kB)}
     
    6769
    6870\subsection{Parabix on Mobile processors}
    69 \label{section:neon}
     71\label{section:scalability:\NEON{}}
    7072Our experience with the generation of Intel processors led us to
    7173contemplate about mobile processors such as the ARM \CORTEXA8\ which
    72 also includes SIMD units.  ARM NEON makes available a 128-bit SIMD
     74also includes SIMD units.  ARM \NEON{} makes available a 128-bit SIMD
    7375instruction set similar in functionality to Intel SSE3 instruction
    7476set. In this section, we present our performance comparison of a
    75 NEON-based port of Parabix versus the Expat parser. Xerces is excluded
     77\NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded
    7678from this portion of our study due to the complexity of the
    7779cross-platform build process for C++ applications.
     
    8284512kB L2 shared cache.  Migration of Parabix to the Android platform
    8385began with the retargeting of a subset of the Parabix SIMD library
    84 for ARM NEON.  The majority of the Parabix SIMD functionality ported
     86for ARM \NEON{}.  The majority of the Parabix SIMD functionality ported
    8587directly. However, for a small subset of the SIMD functions (e.g., bit
    86 packing) of NEON equivalents did not exist. In such cases we simply
     88packing) of \NEON{} equivalents did not exist. In such cases we simply
    8789emulated logical equivalent instructions using the available the
    8890scalar instruction set. This library code was cross-compiled for
     
    98100moderately better than Expat.  Investigating causes for this
    99101performance degradation for Parabix led us to investigate the latency
    100 of Neon SIMD operations.
     102of \NEON{} SIMD operations.
    101103
    102104\begin{figure}[!h]
     
    134136instruction latency for low markup density files; it appears that the
    135137latency of SIMD operations is relatively higher on the \CORTEXA8{}
    136 processor.  This is possibly because the Neon SIMD extensions are
     138processor.  This is possibly because the \NEON{} SIMD extensions are
    137139implemented as a coprocessor on \CORTEXA8{} which imposes higher
    138140overhead for applications that frequently inter-operate between scalar
    139 and SIMD registers. Future performance enhancement to ARM NEON that
    140 implement the Neon within the core microarchitecture could
     141and SIMD registers. Future performance enhancement to ARM \NEON{} that
     142implement the \NEON{} within the core microarchitecture could
    141143substantially improve the efficiency of Parabix.
    142144
Note: See TracChangeset for help on using the changeset viewer.