# source:docs/HPCA2012/06-scalability.tex@1407

Last change on this file since 1407 was 1407, checked in by ashriram, 8 years ago

Minor bug fixes

File size: 6.3 KB
RevLine
[1380]1\section{Parabix on various hardware}
[1339]2\label{section:scalability}
[1302]3\subsection{Performance}
[1380]4In this section, we study the performance of the XML parsers across
[1389]5three generations of Intel architectures.  Figure \ref{Scalability}
[1380]6(a) shows the average execution time of Parabix.  We analyze the
7execution time in terms of SIMD operations that operate on bitstreams
8(\textit{bit-space}) and scalar operations that perform post
9processing on the original character bytes.  In Parabix a significant
10fraction of the overall execution time is spent in SIMD operations.
[1302]11
[1380]12Our results demonstrate that Parabix's optimizations are complementary
13to hardware improvements and seem to further improve the efficiency of
14newer microarchitectures.  For Parabix's bit-stream processing,
15\CITHREE{} results in an 40\% performance improvement over \CO{},
16whereas \SB{} results in a 20\% improvement compared to
17\CITHREE{}. The improvements in the bit-space SIMD operations is
18stable across the different input files. Postprocessing operations
19demonstrate data dependent variance. \CITHREE{} gains between
2027\%---40\% compared to \CO{} and \SB{} gains between 16\%---39\%
21compared to \CITHREE{}. For the purpose of comparison, Figure
22\ref{Scalability} (b) shows the performance of the Expat parser;
23\CITHREE\ improves performance only by 5\% over \CO\ while \SB\
24improves performance by less than 10\% over\CITHREE{}. Not that the
25gains of \CITHREE\ over \CO\ includes an improvement both in the clock
26frequency and microarchitecture improvements while \SB{}'s gains can
27be mainly attributed to the architecture.
[1302]28
[1380]29Figure \ref{power_Parabix2} shows the average power consumption of
30Parabix over each workload and as executed on each of the processor
31cores --- \CO{}, \CITHREE\ and \SB{}.  Overall the last three
32generation of processors seem to bring with them 25---30\% improvement
33in power consumption with every generation. Parabix on \SB\ consumes
34less than 15W.  Overall, Parabix on \SB\ consumes 72\% to 75\% less
35energy than \CO{}.
36
37
[1302]38\begin{figure}
39\centering
[1381]40\subfigure[Parabix]{
[1302]41\includegraphics[width=0.40\textwidth]{plots/P2_scalability.pdf}
42}
43\subfigure[Expat]{
44\includegraphics[width=0.40\textwidth]{plots/Expat_scalability.pdf}
45}
[1370]46\caption{Average Performance Parabix vs. Expat (y-axis: ns per kB)}
[1302]47\label{Scalability}
48\end{figure}
49
50\begin{figure}
[1335]51\centering
52\subfigure[Avg. Power of Parabix on various hardware (Watts)]{
[1302]53\includegraphics[width=85mm]{plots/power_Parabix2.pdf}
54\label{power_Parabix2}
[1335]55}
56\hfill
57\centering
58\subfigure[Avg. Energy Consumption on various hardware (nJ per kB)]{
[1302]59\includegraphics[width=85mm]{plots/energy_Parabix2.pdf}
60\label{energy_Parabix2}
[1335]61}
[1380]62\caption{Energy Profile of Parabix on various hardware platforms}
[1302]63\end{figure}
[1380]64
65
66\def\CORTEXA8{Cortex-A8}
67
68\subsection{Parabix on Mobile processors}
69\label{section:neon}
70Our experience with the generation of Intel processors led us to
71contemplate about mobile processors such as the ARM \CORTEXA8\ which
72also includes SIMD units.  ARM NEON makes available a 128-bit SIMD
73instruction set similar in functionality to Intel SSE3 instruction
74set. In this section, we present our performance comparison of a
75NEON-based port of Parabix versus the Expat parser. Xerces is excluded
76from this portion of our study due to the complexity of the
77cross-platform build process for C++ applications.
78
79The platform we use is the Samsung Galaxy Android Tablet that houses a
80Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
81superscalar microprocessor. It includes a 32kB L1 data cache and a
82512kB L2 shared cache.  Migration of Parabix to the Android platform
83began with the retargeting of a subset of the Parabix SIMD library
84for ARM NEON.  The majority of the Parabix SIMD functionality ported
85directly. However, for a small subset of the SIMD functions (e.g., bit
86packing) of NEON equivalents did not exist. In such cases we simply
87emulated logical equivalent instructions using the available the
88scalar instruction set. This library code was cross-compiled for
89Android using the Android NDK.
90
91A comparison of Figure \ref{arm_processing_time} and Figure
92\ref{corei3_TOT} demonstrates that the performance of both Parabix and
[1386]93Expat degrades substantially on \CORTEXA8{} (5$\times$---17$\times$).
[1380]94This result was expected given the comparably performance limited
95\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
96on each of the lower markup density workloads, dew.xml and jaw.xml. On
97the remaining higher-density workloads, Parabix performs only
98moderately better than Expat.  Investigating causes for this
99performance degradation for Parabix led us to investigate the latency
100of Neon SIMD operations.
101
[1407]102\begin{figure}[!h]
103\subfigure[ARM Neon Performance]{
104\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
105\label{arm_processing_time}
106}
107\hfill
108\subfigure[ARM Neon]{
109\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
110\label{relative_performance_arm}
111}
112\hfill
113\subfigure[Core i3]{
114\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
115\label{relative_performance_intel}
116}
117\caption{Comparaing Parabix on ARM and Intel.}
118\end{figure}
[1380]119
120
[1407]121
122
[1380]123Figure \ref{relative_performance_arm} investigates the performance of
124Expat and Parabix for the various input workloads on the \CORTEXA8{};
125Figure~\ref{relative_performance_intel} plots the performance for
126\CITHREE{}. The results demonstrate that that the execution time of
127each parser varies in a linear fashion with respect to the markup
128density of the file. On the both \CORTEXA8{} and \CITHREE{} both
129parsers demonstrate the same trend. For lower mark up density files
130for which the fraction of SIMD operations and hence the potential for
131parallelism is limited, the overheads of SIMD instructions affect
132overall execution time. Figure~\ref{relative_performance_arm} provides
133insight into the problem, Parabix's performance is hindered by SIMD
134instruction latency for low markup density files; it appears that the
135latency of SIMD operations is relatively higher on the \CORTEXA8{}
136processor.  This is possibly because the Neon SIMD extensions are
137implemented as a coprocessor on \CORTEXA8{} which imposes higher
138overhead for applications that frequently inter-operate between scalar
139and SIMD registers. Future performance enhancement to ARM NEON that
140implement the Neon within the core microarchitecture could
141substantially improve the efficiency of Parabix.
142
143
144
145
Note: See TracBrowser for help on using the repository browser.