source: docs/HPCA2012/06-scalability.tex @ 1408

Last change on this file since 1408 was 1408, checked in by ksherdy, 8 years ago

edits and corrects to performance subsection

File size: 6.3 KB
Line 
1\section{Evaluating Parabix on Hardware}
2\label{section:scalability}
3\subsection{Performance}
4\label{section:scalability:intel}
5In this section, we study the performance of the XML parsers across
6three generations of Intel architectures.  Figure \ref{ScalabilityA}
7shows the average execution time of Parabix-XML (over all workloads).  We analyze the
8execution time in terms of SIMD operations that operate on ``bit streams''
9(\textit{bit-space}) and scalar operations that perform ``post
10processing'' on the original source bytes.  In Parabix-XML, a significant
11fraction of the overall execution time is spent on SIMD operations. 
12
13Our results demonstrate that Parabix-XML's optimizations complement
14newer hardware improvements. For bit-stream processing,
15\CITHREE{} has a 40\% performance increase over \CO{};
16similarly, \SB{} has a 20\% improvement compared to
17\CITHREE{}. These gains appear to be independent of the markup
18density of the input file.
19Postprocessing operations
20demonstrate data dependent variance. Performance on the \CITHREE{} increases by
2127\%--40\% compared to \CO{} whereas \SB{} increases by 16\%--29\%
22compared to \CITHREE{}. For the purpose of comparison, Figure
23\ref{ScalabilityB} shows the performance of the Expat parser.
24\CITHREE\ improves performance only by 29\% over \CO\ while \SB\
25improves performance by less than 6\% over \CITHREE{}. Note that the
26gains of \CITHREE\ over \CO\ includes an improvement both in the clock
27frequency and microarchitecture improvements while \SB{}'s gains can
28be mainly attributed to the architecture.
29
30Figure \ref{power_Parabix2} shows the average power consumption of
31Parabix-XML over each workload and as executed on each of the processor
32cores: \CO{}, \CITHREE\ and \SB{}.  Each
33generation of processor seem to bring with them 25--30\% improvement
34in power consumption over the previous generation. Overall,
35Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
36
37
38\begin{figure}
39\centering
40\subfigure[Parabix]{
41\includegraphics[width=0.40\textwidth]{plots/P2_scalability.pdf}
42\label{ScalabilityA}
43}
44\subfigure[Expat]{
45\includegraphics[width=0.40\textwidth]{plots/Expat_scalability.pdf}
46\label{ScalabilityB}
47}
48\caption{Average Performance Parabix vs. Expat (y-axis: ns per kB)}
49\label{Scalability}
50\end{figure}
51
52\begin{figure}
53\centering
54\subfigure[Avg. Power of Parabix on various hardware (Watts)]{
55\includegraphics[width=85mm]{plots/power_Parabix2.pdf}
56\label{power_Parabix2}
57}
58\hfill
59\centering
60\subfigure[Avg. Energy Consumption on various hardware (nJ per kB)]{
61\includegraphics[width=85mm]{plots/energy_Parabix2.pdf}
62\label{energy_Parabix2}
63}
64\caption{Energy Profile of Parabix on various hardware platforms}
65\end{figure}
66
67
68\def\CORTEXA8{Cortex-A8}
69
70\subsection{Parabix on Mobile processors}
71\label{section:scalability:\NEON{}}
72Our experience with the generation of Intel processors led us to
73contemplate about mobile processors such as the ARM \CORTEXA8\ which
74also includes SIMD units.  ARM \NEON{} makes available a 128-bit SIMD
75instruction set similar in functionality to Intel SSE3 instruction
76set. In this section, we present our performance comparison of a
77\NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded
78from this portion of our study due to the complexity of the
79cross-platform build process for C++ applications.
80
81The platform we use is the Samsung Galaxy Android Tablet that houses a
82Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
83superscalar microprocessor. It includes a 32kB L1 data cache and a
84512kB L2 shared cache.  Migration of Parabix to the Android platform
85began with the retargeting of a subset of the Parabix SIMD library
86for ARM \NEON{}.  The majority of the Parabix SIMD functionality ported
87directly. However, for a small subset of the SIMD functions (e.g., bit
88packing) of \NEON{} equivalents did not exist. In such cases we simply
89emulated logical equivalent instructions using the available the
90scalar instruction set. This library code was cross-compiled for
91Android using the Android NDK.
92
93A comparison of Figure \ref{arm_processing_time} and Figure
94\ref{corei3_TOT} demonstrates that the performance of both Parabix and
95Expat degrades substantially on \CORTEXA8{} (5$\times$---17$\times$).
96This result was expected given the comparably performance limited
97\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
98on each of the lower markup density workloads, dew.xml and jaw.xml. On
99the remaining higher-density workloads, Parabix performs only
100moderately better than Expat.  Investigating causes for this
101performance degradation for Parabix led us to investigate the latency
102of \NEON{} SIMD operations.
103
104\begin{figure}[!h]
105\subfigure[ARM Neon Performance]{
106\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
107\label{arm_processing_time}
108}
109\hfill
110\subfigure[ARM Neon]{
111\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
112\label{relative_performance_arm}
113}
114\hfill
115\subfigure[Core i3]{
116\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
117\label{relative_performance_intel}
118}
119\caption{Comparaing Parabix on ARM and Intel.}
120\end{figure}
121
122
123
124
125Figure \ref{relative_performance_arm} investigates the performance of
126Expat and Parabix for the various input workloads on the \CORTEXA8{};
127Figure~\ref{relative_performance_intel} plots the performance for
128\CITHREE{}. The results demonstrate that that the execution time of
129each parser varies in a linear fashion with respect to the markup
130density of the file. On the both \CORTEXA8{} and \CITHREE{} both
131parsers demonstrate the same trend. For lower mark up density files
132for which the fraction of SIMD operations and hence the potential for
133parallelism is limited, the overheads of SIMD instructions affect
134overall execution time. Figure~\ref{relative_performance_arm} provides
135insight into the problem, Parabix's performance is hindered by SIMD
136instruction latency for low markup density files; it appears that the
137latency of SIMD operations is relatively higher on the \CORTEXA8{}
138processor.  This is possibly because the \NEON{} SIMD extensions are
139implemented as a coprocessor on \CORTEXA8{} which imposes higher
140overhead for applications that frequently inter-operate between scalar
141and SIMD registers. Future performance enhancement to ARM \NEON{} that
142implement the \NEON{} within the core microarchitecture could
143substantially improve the efficiency of Parabix.
144
145
146
147
Note: See TracBrowser for help on using the repository browser.