source: docs/HPCA2012/06-scalability.tex @ 1781

Last change on this file since 1781 was 1734, checked in by ksherdy, 8 years ago

Removed duplicated that that.

File size: 6.2 KB
RevLine 
[1411]1\section{Evaluation of Parabix across different Hardware}
[1339]2\label{section:scalability}
[1302]3\subsection{Performance}
[1408]4\label{section:scalability:intel}
[1380]5In this section, we study the performance of the XML parsers across
[1408]6three generations of Intel architectures.  Figure \ref{ScalabilityA}
7shows the average execution time of Parabix-XML (over all workloads).  We analyze the
8execution time in terms of SIMD operations that operate on ``bit streams''
9(\textit{bit-space}) and scalar operations that perform ``post
10processing'' on the original source bytes.  In Parabix-XML, a significant
11fraction of the overall execution time is spent on SIMD operations. 
[1302]12
[1408]13Our results demonstrate that Parabix-XML's optimizations complement
[1418]14newer hardware improvements. For bit stream processing,
[1408]15\CITHREE{} has a 40\% performance increase over \CO{};
16similarly, \SB{} has a 20\% improvement compared to
17\CITHREE{}. These gains appear to be independent of the markup
18density of the input file.
19Postprocessing operations
20demonstrate data dependent variance. Performance on the \CITHREE{} increases by
2127\%--40\% compared to \CO{} whereas \SB{} increases by 16\%--29\%
[1380]22compared to \CITHREE{}. For the purpose of comparison, Figure
[1408]23\ref{ScalabilityB} shows the performance of the Expat parser.
24\CITHREE\ improves performance only by 29\% over \CO\ while \SB\
25improves performance by less than 6\% over \CITHREE{}. Note that the
[1380]26gains of \CITHREE\ over \CO\ includes an improvement both in the clock
27frequency and microarchitecture improvements while \SB{}'s gains can
28be mainly attributed to the architecture.
29Figure \ref{power_Parabix2} shows the average power consumption of
[1408]30Parabix-XML over each workload and as executed on each of the processor
31cores: \CO{}, \CITHREE\ and \SB{}.  Each
32generation of processor seem to bring with them 25--30\% improvement
[1418]33in power consumption over the previous generation. Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
[1380]34
35
[1302]36\begin{figure}
37\centering
[1381]38\subfigure[Parabix]{
[1302]39\includegraphics[width=0.40\textwidth]{plots/P2_scalability.pdf}
[1408]40\label{ScalabilityA}
[1302]41}
42\subfigure[Expat]{
43\includegraphics[width=0.40\textwidth]{plots/Expat_scalability.pdf}
[1408]44\label{ScalabilityB}
[1302]45}
[1370]46\caption{Average Performance Parabix vs. Expat (y-axis: ns per kB)}
[1302]47\label{Scalability}
48\end{figure}
49
50\begin{figure}
[1335]51\centering
52\subfigure[Avg. Power of Parabix on various hardware (Watts)]{
[1302]53\includegraphics[width=85mm]{plots/power_Parabix2.pdf}
54\label{power_Parabix2}
[1335]55}
56\hfill
57\centering
58\subfigure[Avg. Energy Consumption on various hardware (nJ per kB)]{
[1302]59\includegraphics[width=85mm]{plots/energy_Parabix2.pdf}
60\label{energy_Parabix2}
[1335]61}
[1380]62\caption{Energy Profile of Parabix on various hardware platforms}
[1302]63\end{figure}
[1380]64
65
66\def\CORTEXA8{Cortex-A8}
67
68\subsection{Parabix on Mobile processors}
[1408]69\label{section:scalability:\NEON{}}
[1409]70Our experience with Intel processors led us to
71question whether mobile processors with SIMD support, such as the ARM \CORTEXA8{},
72could benefit from Parabix technology. ARM \NEON{} provides a 128-bit SIMD
[1380]73instruction set similar in functionality to Intel SSE3 instruction
74set. In this section, we present our performance comparison of a
[1408]75\NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded
[1380]76from this portion of our study due to the complexity of the
77cross-platform build process for C++ applications.
78
79The platform we use is the Samsung Galaxy Android Tablet that houses a
80Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
81superscalar microprocessor. It includes a 32kB L1 data cache and a
[1409]82512kB L2 shared cache.  Migration of Parabix-XML to the Android platform
[1650]83only required developing a Parabix runtime library for ARM \NEON{}.
[1409]84The majority of the runtime functionality was ported
85directly. However, a small subset of key SIMD instructions (e.g., bit
86packing) did not exist on \NEON{}. In such cases, the
87logical equivalent of those instructions was emulated using the available
88ISA. The resulting application was cross-compiled for
89Android using the Android NDK.
[1380]90
91A comparison of Figure \ref{arm_processing_time} and Figure
92\ref{corei3_TOT} demonstrates that the performance of both Parabix and
[1409]93Expat degrades substantially on \CORTEXA8{} (5--17$\times$).
[1380]94This result was expected given the comparably performance limited
95\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
96on each of the lower markup density workloads, dew.xml and jaw.xml. On
97the remaining higher-density workloads, Parabix performs only
98moderately better than Expat.  Investigating causes for this
99performance degradation for Parabix led us to investigate the latency
[1408]100of \NEON{} SIMD operations.
[1380]101
[1692]102\begin{figure*}[htbp]
[1409]103\subfigure[ARM Neon Performance (cycles per kB)]{
[1407]104\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
105\label{arm_processing_time}
106}
107\hfill
108\subfigure[ARM Neon]{
109\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
110\label{relative_performance_arm}
111}
112\hfill
113\subfigure[Core i3]{
114\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
115\label{relative_performance_intel}
116}
[1409]117\caption{Comparison of Parabix-XML on ARM vs. Intel.}
[1692]118\end{figure*}
[1380]119
120
[1407]121
122
[1380]123Figure \ref{relative_performance_arm} investigates the performance of
124Expat and Parabix for the various input workloads on the \CORTEXA8{};
125Figure~\ref{relative_performance_intel} plots the performance for
[1734]126\CITHREE{}. The results demonstrate that the execution time of
[1380]127each parser varies in a linear fashion with respect to the markup
128density of the file. On the both \CORTEXA8{} and \CITHREE{} both
[1409]129parsers demonstrate the same trend: files with a lower markup density
130exhibit higher levels of parallelism; consequently, the overhead of SIMD
131instructions has a greater impact on the overall execution time for
132those files.
133The contrast between Figure~\ref{relative_performance_arm} and~\ref{relative_performance_intel} provides
134insight into the problem: Parabix-XML's performance is hindered by SIMD
135instruction latency.  This is possibly because the \NEON{} SIMD extensions are
136implemented as a coprocessor on the \CORTEXA8{}, which imposes a higher
[1380]137overhead for applications that frequently inter-operate between scalar
[1408]138and SIMD registers. Future performance enhancement to ARM \NEON{} that
139implement the \NEON{} within the core microarchitecture could
[1409]140substantially improve the efficiency of Parabix-XML.
[1380]141
142
143
144
Note: See TracBrowser for help on using the repository browser.