source: docs/HPCA2012/final_ieee/06-scalability.tex @ 1743

Last change on this file since 1743 was 1743, checked in by ashriram, 8 years ago

First pass final version [ashriram]

File size: 5.8 KB
Line 
1\section{Parabix on different platforms}
2\label{section:scalability}
3\subsection{Performance}
4\label{section:scalability:intel}
5In this section, we study the performance of the XML parsers across
6three generations of Intel architectures.  Figure \ref{Parabix_all_platform}
7shows the average execution time of Parabix-XML (over all workloads).  We analyze the
8execution time in terms of SIMD operations that operate on ``bit streams''
9(\textit{bit-space}) and scalar operations that perform ``post
10processing'' on the original source bytes.  In Parabix-XML, a significant
11fraction of the overall execution time is spent on SIMD operations. 
12
13Our results demonstrate that Parabix-XML's optimizations complement
14newer hardware improvements. For bit stream processing,
15\CITHREE{} has a 40\% performance increase over \CO{};
16similarly, \SB{} has a 20\% improvement compared to
17\CITHREE{}. These gains appear to be independent of the markup
18density of the input file.
19Postprocessing operations
20demonstrate data dependent variance. Performance on the \CITHREE{} increases by
2127\%--40\% compared to \CO{} whereas \SB{} increases by 16\%--29\%
22compared to \CITHREE{}.
23\CITHREE\ improves performance only by 29\% over \CO\ while \SB\
24improves performance by less than 6\% over \CITHREE{}. Note that the
25gains of \CITHREE\ over \CO\ includes an improvement both in the clock
26frequency and microarchitecture improvements while \SB{}'s gains can
27be mainly attributed to the architecture.
28Figure \ref{Parabix_all_platform} also shows the average power consumption of
29Parabix-XML over each workload and as executed on each of the processor
30cores: \CO{}, \CITHREE\ and \SB{}.  Each
31generation of processor seem to bring with them 25--30\% improvement
32in power consumption over the previous generation. Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
33
34\begin{figure}[!htb]
35\begin{center}
36{
37\includegraphics[width=0.5\textwidth]{plots/Parabix2_all_platform.pdf}
38}
39\end{center}
40\caption{Parabix on various hardware platforms}
41\label{Parabix_all_platform}
42\end{figure}
43
44
45\begin{figure*}[!htbp]
46\begin{center}
47{
48\subfigure[ARM Neon Performance (cycles per kB)]{
49\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
50\label{arm_processing_time}
51}
52\hfill
53\subfigure[ARM Neon]{
54\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
55\label{relative_performance_arm}
56}
57\hfill
58\subfigure[Core i3]{
59\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
60\label{relative_performance_intel}
61}
62}
63\end{center}
64\caption{Comparison of Parabix-XML on ARM vs. Intel.}
65\end{figure*}
66
67
68\def\CORTEXA8{Cortex-A8}
69
70\subsection{Parabix on Mobile processors}
71\label{section:scalability:\NEON{}}
72Our experience with Intel processors led us to
73question whether mobile processors with SIMD support, such as the ARM \CORTEXA8{},
74could benefit from Parabix technology. ARM \NEON{} provides a 128-bit SIMD
75instruction set similar in functionality to Intel SSE3 instruction
76set. In this section, we present our performance comparison of a
77\NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded
78from this portion of our study due to the complexity of the
79cross-platform build process for C++ applications.
80
81The platform we use is the Samsung Galaxy Android Tablet that houses a
82Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
83superscalar microprocessor. It includes a 32kB L1 data cache and a
84512kB L2 shared cache.  Migration of Parabix-XML to the Android platform
85only required developing a Parabix runtime library for ARM \NEON{}.
86The majority of the runtime functionality was ported
87directly. However, a small subset of key SIMD instructions (e.g., bit
88packing) did not exist on \NEON{}. In such cases, the
89logical equivalent of those instructions was emulated using the available
90ISA. The resulting application was cross-compiled for
91Android using the Android NDK.
92
93A comparison of Figure \ref{arm_processing_time} and Figure
94\ref{corei3_TOT} demonstrates that the performance of both Parabix and
95Expat degrades substantially on \CORTEXA8{} (5--17$\times$).
96This result was expected given the comparably performance limited
97\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
98on each of the lower markup density workloads, dew.xml and jaw.xml. On
99the remaining higher-density workloads, Parabix performs only
100moderately better than Expat.  Investigating causes for this
101performance degradation for Parabix led us to investigate the latency
102of \NEON{} SIMD operations.
103
104
105
106
107Figure \ref{relative_performance_arm} investigates the performance of
108Expat and Parabix for the various input workloads on the \CORTEXA8{};
109Figure~\ref{relative_performance_intel} plots the performance for
110\CITHREE{}. The results demonstrate that that the execution time of
111each parser varies in a linear fashion with respect to the markup
112density of the file. On the both \CORTEXA8{} and \CITHREE{} both
113parsers demonstrate the same trend: files with a lower markup density
114exhibit higher levels of parallelism; consequently, the overhead of SIMD
115instructions has a greater impact on the overall execution time for
116those files.
117The contrast between Figure~\ref{relative_performance_arm} and~\ref{relative_performance_intel} provides
118insight into the problem: Parabix-XML's performance is hindered by SIMD
119instruction latency.  This is possibly because the \NEON{} SIMD extensions are
120implemented as a coprocessor on the \CORTEXA8{}, which imposes a higher
121overhead for applications that frequently inter-operate between scalar
122and SIMD registers. Future performance enhancement to ARM \NEON{} that
123implement the \NEON{} within the core microarchitecture could
124substantially improve the efficiency of Parabix-XML.
125
126
127\begin{figure*}[!htbp]
128\begin{center}
129\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
130\end{center}
131\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
132\label{insmix}
133\end{figure*}
134
135
136
137
138
Note: See TracBrowser for help on using the repository browser.