# Changeset 1380 for docs/HPCA2012

Ignore:
Timestamp:
Aug 25, 2011, 1:56:51 PM (8 years ago)
Message:

Done evaluation

Location:
docs/HPCA2012
Files:
6 edited

Unmodified
Removed

• ## docs/HPCA2012/05-corei3.tex

 r1378 requires less than a single cycle per byte. \begin{figure}[b] \subfigure[Instruction Breakdown (\% SIMD Instructions)]{ \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf} \begin{figure}[htbp] \begin{minipage}{0.5\linewidth} \centering \includegraphics[width=\textwidth]{plots/corei3_INS_p2.pdf} \caption{Instruction Breakdown (\% SIMD Instructions)} \label{corei3_INS_p2} } \end{minipage}% \hfill \subfigure[Performance (CPU Cycles per kB)]{ \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf} \begin{minipage}{0.5\linewidth} \centering \includegraphics[width=\textwidth]{plots/corei3_TOT.pdf} \caption{Performance (CPU Cycles per kB)} \label{corei3_TOT} } \end{figure} \end{minipage} \end{figure} \subsection{Power and Energy} In this section, we study the power and energy consumption of Parabix in comparison with Expat and Xerces on \CITHREE{}. The average power of \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the In this section, we study the power and energy consumption of Parabix in comparison with Expat and Xerces on \CITHREE{}. The average power of \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the average power consumed by each parser.  Parabix, dominated by SIMD instructions which uses approximately 5\% additional power. While the pipeline. More importantly by using data parallel operations Parabix amortizes the fetch and data access overheads. This results in minimal power increase compared to the conventional parsers. Perhaps the energy trends shown in Figure \ref{corei3_energy} reveal an interesting trend. Parabix consumes substantially less energy than the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.  Although Parabix requires slightly more power (per instruction), the processing time of Parabix is significantly lower. power increase compared to the conventional parsers.  Perhaps the energy trends shown in Figure \ref{corei3_energy} reveal an interesting trend. Parabix consumes substantially less energy than the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively. Although Parabix requires slightly more power (per instruction), the processing time of Parabix is significantly lower. \label{corei3_energy} } \end{figure} \caption{Power profile of Parabix on \CITHREE{}} \end{figure}
• ## docs/HPCA2012/06-scalability.tex

 r1370 \section{Scalability} \section{Parabix on various hardware} \label{section:scalability} \subsection{Performance} Figure \ref{Scalability} (a) demonstrates the average XML well-formedness checking performance of Parabix2 for each of the workloads and as executed on each of the processor cores --- \CO\, \CITHREE\ and \SB{}.  Processing time is shown in terms of bit stream based operations executed in bit-space' and postprocessing operations executed in byte-space'.  In the Parabix2 parser, bit-space parallel bit stream parser operations consist primarily of SIMD instructions; byte-space operations consist of byte comparisons across arrays of values. Executing Parabix2 on \CITHREE{} over \CO\ results in an average performance improvement of 17\% in bit stream processing whereas migrating Parabix2 from \CITHREE{} to \SB{} results in a 22\% average performance gain. Bit space measurements are stable and consistent across each of the source inputs and cores. Postprocessing operations demonstrate data dependent variance. Performance gains from 18\% to 31\% performance are observered in migrating Parabix2 from \CO\ to \CITHREE{}; 0\% to 17\% performance from \CITHREE\ to \SB{}. For the purpose of comparison, Figure \ref{Scalability} (b) shows the performance of the Expat parser on each of the processor cores.  A performance improvement of less than 5\% is observed when executing Expat on \CITHREE\ over \CO\ and less than 10\% on \SB\ over \CITHREE{}. In this section, we study the performance of the XML parsers across three generations of intel architectures.  Figure \ref{Scalability} (a) shows the average execution time of Parabix.  We analyze the execution time in terms of SIMD operations that operate on bitstreams (\textit{bit-space}) and scalar operations that perform post processing on the original character bytes.  In Parabix a significant fraction of the overall execution time is spent in SIMD operations. Overall, Parabix2 scales better than Expat. Simply executing identical Parabix2 object code on \SB\ results in an overall performance improvement up to 26\%. Additional performance aspects of Parabix2 on \SB\ with AVX instructions are discussed in the following sections. Our results demonstrate that Parabix's optimizations are complementary to hardware improvements and seem to further improve the efficiency of newer microarchitectures.  For Parabix's bit-stream processing, \CITHREE{} results in an 40\% performance improvement over \CO{}, whereas \SB{} results in a 20\% improvement compared to \CITHREE{}. The improvements in the bit-space SIMD operations is stable across the different input files. Postprocessing operations demonstrate data dependent variance. \CITHREE{} gains between 27\%---40\% compared to \CO{} and \SB{} gains between 16\%---39\% compared to \CITHREE{}. For the purpose of comparison, Figure \ref{Scalability} (b) shows the performance of the Expat parser; \CITHREE\ improves performance only by 5\% over \CO\ while \SB\ improves performance by less than 10\% over\CITHREE{}. Not that the gains of \CITHREE\ over \CO\ includes an improvement both in the clock frequency and microarchitecture improvements while \SB{}'s gains can be mainly attributed to the architecture. Figure \ref{power_Parabix2} shows the average power consumption of Parabix over each workload and as executed on each of the processor cores --- \CO{}, \CITHREE\ and \SB{}.  Overall the last three generation of processors seem to bring with them 25---30\% improvement in power consumption with every generation. Parabix on \SB\ consumes less than 15W.  Overall, Parabix on \SB\ consumes 72\% to 75\% less energy than \CO{}. \begin{figure} \end{figure} \subsection{Power and Energy} Figure \ref{power_Parabix2} shows the average power consumption of Parabix2 over each workload and as executed on each of the processor cores --- \CO{}, \CITHREE\ and \SB{}.  Average power consumption on \CO{} is 32 watts. Execution on \CITHREE\ results in 30\% power saving over \CO{}.  \SB\ saves 25\% of the power compared with \CITHREE\ and consumes only 15 watts. In XML parsing we observe energy consumption is dependent on processing time. That is, a reduction in processing time results in a directly proportional reduction in energy consumption. With newer processor cores comes improvements in application performance. As a result, Parabix2 executed on \SB\ consumes 72\% to 75\% less energy than Parabix2 on \CO{}. \begin{figure} \centering \label{energy_Parabix2} } \caption{Energy Profile of Parabix on various hardware platforms} \end{figure} \def\CORTEXA8{Cortex-A8} \subsection{Parabix on Mobile processors} \label{section:neon} Our experience with the generation of Intel processors led us to contemplate about mobile processors such as the ARM \CORTEXA8\ which also includes SIMD units.  ARM NEON makes available a 128-bit SIMD instruction set similar in functionality to Intel SSE3 instruction set. In this section, we present our performance comparison of a NEON-based port of Parabix versus the Expat parser. Xerces is excluded from this portion of our study due to the complexity of the cross-platform build process for C++ applications. The platform we use is the Samsung Galaxy Android Tablet that houses a Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue, superscalar microprocessor. It includes a 32kB L1 data cache and a 512kB L2 shared cache.  Migration of Parabix to the Android platform began with the retargeting of a subset of the Parabix SIMD library for ARM NEON.  The majority of the Parabix SIMD functionality ported directly. However, for a small subset of the SIMD functions (e.g., bit packing) of NEON equivalents did not exist. In such cases we simply emulated logical equivalent instructions using the available the scalar instruction set. This library code was cross-compiled for Android using the Android NDK. A comparison of Figure \ref{arm_processing_time} and Figure \ref{corei3_TOT} demonstrates that the performance of both Parabix and Expat degrades substantially on \CORTEXA8{} (?$\times$---?$\times$). This result was expected given the comparably performance limited \CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix on each of the lower markup density workloads, dew.xml and jaw.xml. On the remaining higher-density workloads, Parabix performs only moderately better than Expat.  Investigating causes for this performance degradation for Parabix led us to investigate the latency of Neon SIMD operations. Figure \ref{relative_performance_arm} investigates the performance of Expat and Parabix for the various input workloads on the \CORTEXA8{}; Figure~\ref{relative_performance_intel} plots the performance for \CITHREE{}. The results demonstrate that that the execution time of each parser varies in a linear fashion with respect to the markup density of the file. On the both \CORTEXA8{} and \CITHREE{} both parsers demonstrate the same trend. For lower mark up density files for which the fraction of SIMD operations and hence the potential for parallelism is limited, the overheads of SIMD instructions affect overall execution time. Figure~\ref{relative_performance_arm} provides insight into the problem, Parabix's performance is hindered by SIMD instruction latency for low markup density files; it appears that the latency of SIMD operations is relatively higher on the \CORTEXA8{} processor.  This is possibly because the Neon SIMD extensions are implemented as a coprocessor on \CORTEXA8{} which imposes higher overhead for applications that frequently inter-operate between scalar and SIMD registers. Future performance enhancement to ARM NEON that implement the Neon within the core microarchitecture could substantially improve the efficiency of Parabix. \begin{figure} \subfigure[ARM Neon Performance]{ \includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf} \label{arm_processing_time} } \hfill \subfigure[ARM Neon]{ \includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf} \label{relative_performance_arm} } \hfill \subfigure[Core i3]{ \includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf} \label{relative_performance_intel} } \caption{Parabix performance on mobile platforms} \end{figure}
• ## docs/HPCA2012/09-pipeline.tex

 r1362 We adopt a contrasting approach to parallelizing the Parabix XML parser.  As described in Section~\ref{} Parabix consists of multiple parser.  As described in Section~\ref{section:parser} Parabix consists of multiple passes that on every chunk of input data and each of these stages interact in sequence with no data movement from later to earlier
• ## docs/HPCA2012/main.tex

 r1363 \input{06-scalability.tex} \input{07-avx.tex} \input{08-arm.tex} %\input{08-arm.tex} \input{09-pipeline.tex} \input{10-related.tex}
Note: See TracChangeset for help on using the changeset viewer.