Ignore:
Timestamp:
Aug 25, 2011, 1:56:51 PM (8 years ago)
Author:
ashriram
Message:

Done evaluation

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/06-scalability.tex

    r1370 r1380  
    1 \section{Scalability}
     1\section{Parabix on various hardware}
    22\label{section:scalability}
    33\subsection{Performance}
    4 Figure \ref{Scalability} (a) demonstrates the average XML
    5 well-formedness checking performance of Parabix2 for each of the
    6 workloads and as executed on each of the processor cores --- \CO\,
    7 \CITHREE\ and \SB{}.  Processing time is shown in terms of bit stream
    8 based operations executed in `bit-space' and postprocessing operations
    9 executed in `byte-space'.  In the Parabix2 parser, bit-space parallel
    10 bit stream parser operations consist primarily of SIMD instructions;
    11 byte-space operations consist of byte comparisons across arrays of
    12 values. Executing Parabix2 on \CITHREE{} over \CO\ results in an
    13 average performance improvement of 17\% in bit stream processing
    14 whereas migrating Parabix2 from \CITHREE{} to \SB{} results in a 22\%
    15 average performance gain. Bit space measurements are stable and
    16 consistent across each of the source inputs and cores. Postprocessing
    17 operations demonstrate data dependent variance. Performance gains from
    18 18\% to 31\% performance are observered in migrating Parabix2 from
    19 \CO\ to \CITHREE{}; 0\% to 17\% performance from \CITHREE\ to
    20 \SB{}. For the purpose of comparison, Figure \ref{Scalability} (b)
    21 shows the performance of the Expat parser on each of the processor
    22 cores.  A performance improvement of less than 5\% is observed when
    23 executing Expat on \CITHREE\ over \CO\ and less than 10\% on \SB\ over
    24 \CITHREE{}.
     4In this section, we study the performance of the XML parsers across
     5three generations of intel architectures.  Figure \ref{Scalability}
     6(a) shows the average execution time of Parabix.  We analyze the
     7execution time in terms of SIMD operations that operate on bitstreams
     8(\textit{bit-space}) and scalar operations that perform post
     9processing on the original character bytes.  In Parabix a significant
     10fraction of the overall execution time is spent in SIMD operations. 
    2511
    26 Overall, Parabix2 scales better than Expat. Simply executing identical
    27 Parabix2 object code on \SB\ results in an overall performance
    28 improvement up to 26\%. Additional performance aspects of Parabix2 on
    29 \SB\ with AVX instructions are discussed in the following sections.
     12Our results demonstrate that Parabix's optimizations are complementary
     13to hardware improvements and seem to further improve the efficiency of
     14newer microarchitectures.  For Parabix's bit-stream processing,
     15\CITHREE{} results in an 40\% performance improvement over \CO{},
     16whereas \SB{} results in a 20\% improvement compared to
     17\CITHREE{}. The improvements in the bit-space SIMD operations is
     18stable across the different input files. Postprocessing operations
     19demonstrate data dependent variance. \CITHREE{} gains between
     2027\%---40\% compared to \CO{} and \SB{} gains between 16\%---39\%
     21compared to \CITHREE{}. For the purpose of comparison, Figure
     22\ref{Scalability} (b) shows the performance of the Expat parser;
     23\CITHREE\ improves performance only by 5\% over \CO\ while \SB\
     24improves performance by less than 10\% over\CITHREE{}. Not that the
     25gains of \CITHREE\ over \CO\ includes an improvement both in the clock
     26frequency and microarchitecture improvements while \SB{}'s gains can
     27be mainly attributed to the architecture.
     28
     29Figure \ref{power_Parabix2} shows the average power consumption of
     30Parabix over each workload and as executed on each of the processor
     31cores --- \CO{}, \CITHREE\ and \SB{}.  Overall the last three
     32generation of processors seem to bring with them 25---30\% improvement
     33in power consumption with every generation. Parabix on \SB\ consumes
     34less than 15W.  Overall, Parabix on \SB\ consumes 72\% to 75\% less
     35energy than \CO{}.
     36
    3037
    3138\begin{figure}
     
    4148\end{figure}
    4249
    43 
    44 \subsection{Power and Energy}
    45 
    46 Figure \ref{power_Parabix2} shows the average power consumption of
    47 Parabix2 over each workload and as executed on each of the processor
    48 cores --- \CO{}, \CITHREE\ and \SB{}.  Average power consumption on
    49 \CO{} is 32 watts. Execution on \CITHREE\ results in 30\% power saving
    50 over \CO{}.  \SB\ saves 25\% of the power compared with \CITHREE\ and
    51 consumes only 15 watts.
    52 
    53 In XML parsing we observe energy consumption is dependent on processing time. That is, a reduction in processing time results in a directly proportional reduction in energy consumption.
    54 With newer processor cores comes improvements in application performance. As a result, Parabix2 executed on \SB\ consumes 72\% to 75\% less energy than Parabix2 on \CO{}.
    55 
    56 
    57 
    58 
    5950\begin{figure}
    6051\centering
     
    6960\label{energy_Parabix2}
    7061}
     62\caption{Energy Profile of Parabix on various hardware platforms}
    7163\end{figure}
     64
     65
     66\def\CORTEXA8{Cortex-A8}
     67
     68\subsection{Parabix on Mobile processors}
     69\label{section:neon}
     70Our experience with the generation of Intel processors led us to
     71contemplate about mobile processors such as the ARM \CORTEXA8\ which
     72also includes SIMD units.  ARM NEON makes available a 128-bit SIMD
     73instruction set similar in functionality to Intel SSE3 instruction
     74set. In this section, we present our performance comparison of a
     75NEON-based port of Parabix versus the Expat parser. Xerces is excluded
     76from this portion of our study due to the complexity of the
     77cross-platform build process for C++ applications.
     78
     79The platform we use is the Samsung Galaxy Android Tablet that houses a
     80Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
     81superscalar microprocessor. It includes a 32kB L1 data cache and a
     82512kB L2 shared cache.  Migration of Parabix to the Android platform
     83began with the retargeting of a subset of the Parabix SIMD library
     84for ARM NEON.  The majority of the Parabix SIMD functionality ported
     85directly. However, for a small subset of the SIMD functions (e.g., bit
     86packing) of NEON equivalents did not exist. In such cases we simply
     87emulated logical equivalent instructions using the available the
     88scalar instruction set. This library code was cross-compiled for
     89Android using the Android NDK.
     90
     91A comparison of Figure \ref{arm_processing_time} and Figure
     92\ref{corei3_TOT} demonstrates that the performance of both Parabix and
     93Expat degrades substantially on \CORTEXA8{} (?$\times$---?$\times$).
     94This result was expected given the comparably performance limited
     95\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
     96on each of the lower markup density workloads, dew.xml and jaw.xml. On
     97the remaining higher-density workloads, Parabix performs only
     98moderately better than Expat.  Investigating causes for this
     99performance degradation for Parabix led us to investigate the latency
     100of Neon SIMD operations.
     101
     102
     103
     104Figure \ref{relative_performance_arm} investigates the performance of
     105Expat and Parabix for the various input workloads on the \CORTEXA8{};
     106Figure~\ref{relative_performance_intel} plots the performance for
     107\CITHREE{}. The results demonstrate that that the execution time of
     108each parser varies in a linear fashion with respect to the markup
     109density of the file. On the both \CORTEXA8{} and \CITHREE{} both
     110parsers demonstrate the same trend. For lower mark up density files
     111for which the fraction of SIMD operations and hence the potential for
     112parallelism is limited, the overheads of SIMD instructions affect
     113overall execution time. Figure~\ref{relative_performance_arm} provides
     114insight into the problem, Parabix's performance is hindered by SIMD
     115instruction latency for low markup density files; it appears that the
     116latency of SIMD operations is relatively higher on the \CORTEXA8{}
     117processor.  This is possibly because the Neon SIMD extensions are
     118implemented as a coprocessor on \CORTEXA8{} which imposes higher
     119overhead for applications that frequently inter-operate between scalar
     120and SIMD registers. Future performance enhancement to ARM NEON that
     121implement the Neon within the core microarchitecture could
     122substantially improve the efficiency of Parabix.
     123
     124
     125\begin{figure}
     126\subfigure[ARM Neon Performance]{
     127\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
     128\label{arm_processing_time}
     129}
     130\hfill
     131\subfigure[ARM Neon]{
     132\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
     133\label{relative_performance_arm}
     134}
     135\hfill
     136\subfigure[Core i3]{
     137\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
     138\label{relative_performance_intel}
     139}
     140\caption{Parabix performance on mobile platforms}
     141\end{figure}
     142
     143
     144
Note: See TracChangeset for help on using the changeset viewer.