# source:docs/HPCA2012/final_ieee/06-scalability.tex

Last change on this file was 1783, checked in by ashriram, 8 years ago

Final pass

File size: 5.7 KB
Line
1\section{Parabix on different platforms}
2\label{section:scalability}
3\subsection{Performance}
4\label{section:scalability:intel}
5In this section, we study the performance of the XML parsers across
6three generations of Intel architectures.  Figure \ref{Parabix_all_platform}
7shows the average execution time of Parabix-XML over all workloads.  We analyze the
8execution time in terms of SIMD operations that operate on bit streams''
9in \textit{bit-space} and scalar operations used to perform post
10processing'' operations on the source input.
11
12
13\begin{figure}[htb]
14\begin{center}
15{
16\includegraphics[width=0.5\textwidth]{plots/Parabix2_all_platform.pdf}
17}
18\end{center}
19\caption{Parabix on various hardware platforms}
20\label{Parabix_all_platform}
21\end{figure}
22
23
24\begin{figure*}[!htbp]
25\begin{center}
26{
27\subfigure[ARM Neon Performance (cycles per kB)]{
28\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
29\label{arm_processing_time}
30}
31\hfill
32\subfigure[ARM Neon]{
33\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
34\label{relative_performance_arm}
35}
36\hfill
37\subfigure[Core i3]{
38\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
39\label{relative_performance_intel}
40}
41}
42\end{center}
43\caption{Comparison of Parabix-XML on ARM vs. Intel.}
44\end{figure*}
45
46Our results demonstrate that Parabix-XML's optimizations complement
47newer hardware improvements. For bit stream processing,
48\CITHREE{} has a 40\% performance increase over \CO{};
49similarly, \SB{} has a 20\% improvement compared to
50\CITHREE{}. These gains appear independent of the markup.
51Postprocessing operations
52demonstrate data dependent variance. Performance on the \CITHREE{} increases by
5327\%--40\% compared to \CO{} whereas \SB{} increases by 16\%--29\%
54compared to \CITHREE{}.
55\CITHREE\ improves performance only by 29\% over \CO\ while \SB\
56improves performance by less than 6\% over \CITHREE{}. Note that the
57gains of \CITHREE\ over \CO\ includes an improvement both in clock
58frequency and microarchitecture while \SB{}'s gains are mainly attributed to the architecture.
59Figure \ref{Parabix_all_platform} also shows the average power consumption of
60Parabix-XML over each workload and as executed on each of the processors:
61\CO{}, \CITHREE\ and \SB{}.  Each generation of processor appears to bring a 25--30\% improvement
62in power consumption over the previous generation. Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
63
64\def\CORTEXA8{Cortex-A8}
65
66\subsection{Parabix on Mobile Processors}
67\label{section:scalability:\NEON{}}
68Our experience with Intel processors led us to
69question whether mobile processors with SIMD support, such as the ARM \CORTEXA8{},
70could benefit from Parabix technology. ARM \NEON{} provides a 128-bit SIMD
71instruction set similar in functionality to the Intel SSE3 instruction
72set. In this section, we present our performance comparison of a
73\NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded
74from this portion of our study due to the complexity of the
75cross-platform build process for C++ applications.
76
77The platform we use is the Samsung Galaxy Android Tablet that houses a
78Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
79superscalar microprocessor. This device includes a 32kB L1 data cache and a
80512kB L2 shared cache.  Migration of Parabix-XML to the Android platform
81only required developing a Parabix runtime library for ARM \NEON{}.
82The majority of the runtime functionality was ported
83directly. However, a small subset of key SIMD instructions (e.g., bit
84packing) did not exist on \NEON{}. In such cases, the
85logical equivalents of those instructions were emulated using the available
86ISA. The resulting application was cross-compiled for
87Android using the Android NDK.
88
89A comparison of Figure \ref{arm_processing_time} and Figure
90\ref{corei3_TOT} demonstrates that the performance of both Parabix and
91Expat degrades substantially on \CORTEXA8{} (5--17$\times$).
92This result was expected given the comparably performance limited
93\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
94on each of the lower markup density workloads, dew.xml and jaw.xml. On
95the remaining higher-density workloads, Parabix performs only
96moderately better than Expat.  Investigating causes for this
97performance degradation for Parabix led us to investigate the latency
98of \NEON{} SIMD operations.
99
100
101
102
103Figure \ref{relative_performance_arm} investigates the performance of
104Expat and Parabix for the various input workloads on the \CORTEXA8{};
105Figure~\ref{relative_performance_intel} plots the performance for
106\CITHREE{}. The results demonstrate that the execution time of
107each parser varies in a linear fashion with respect to the markup
108density of the file. On the both \CORTEXA8{} and \CITHREE{} both
109parsers demonstrate the same trend: files with a lower markup density
110exhibit higher levels of parallelism; consequently, the overhead of SIMD
111instructions has a greater impact on the overall execution time for
112those files.
113The contrast between Figure~\ref{relative_performance_arm} and~\ref{relative_performance_intel} provides
114insight into the problem: Parabix-XML's performance is hindered by SIMD
115instruction latency.  This is possibly because the \NEON{} SIMD extensions are
116implemented as a coprocessor on the \CORTEXA8{}, which imposes a higher
117overhead for applications that frequently inter-operate between scalar
118and SIMD registers. Future performance enhancements to the \NEON{} ISA on
119ARM could substantially improve the efficiency of Parabix.
120
121
122\begin{figure*}[!htbp]
123\begin{center}
124\includegraphics[trim = 2mm 1mm 1mm 2mm, clip, height=0.25\textheight]{plots/InsMix.pdf}
125\end{center}
126\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
127\label{insmix}
128\end{figure*}
129
130
131
132
133
Note: See TracBrowser for help on using the repository browser.