source: docs/HPCA2012/final_ieee/08-arm.tex @ 3121

Last change on this file since 3121 was 1743, checked in by ashriram, 8 years ago

First pass final version [ashriram]

File size: 3.5 KB
5\subfigure[ARM Neon Performance]{
10\subfigure[Performance ARM Neon vs Core i3 SSE.]{
[1733]17\section {Parabix on Mobile Platforms}
19The Samsung Galaxy Tab GT-P1000M device houses a Samsung S5PC110 ARM
20\CORTEXA8{} 1Ghz single-core, dual-issue, superscalar
21microprocessor. It includes a 32kB L1 data cache and a 512kB L2 shared
22cache. In addition to the standard feature set found in such low-power
2332-bit microprocessors, the S5PC110 includes the ARM NEON
24general-purpose SIMD engine. ARM NEON makes available a 128-bit SIMD
25instruction set similar in functionality to Intel SSE3 instruction
26set. In this section, we present our performance comparison of a
27NEON-based port of Parabix2 versus the Expat parser, and executed on
28the Samsung Galaxy Tab GT-P1000M hardware.  Xerces is excluded from
29this portion of our study due to the complexity of the cross-platform
30build process in porting native C/C++ applications to the Android
34\subsection{Performance Results}
36Migration of Parabix2 to the Android platform began with the
37re-targeting of a subset of the Parabix2 IDISA SIMD library for ARM
38NEON.  This library code was cross-compiled for Android using the
39Android NDK. The Android NDK is a companion tool to the Android SDK
40that allows developers to build performance-critical portions of
41applications in native code. The majority of the Parabix2 SIMD
42functionality ported directly. However, for a small subset of the SIMD
43functions of Parabix2 NEON equivalents did not exist. In such cases we
44simply simulated logical equivalencies using the available the
45instruction set.
49A comparison of Figure \ref{arm_processing_time} and Figure
50\ref{corei3_TOT} demonstrates that the performance of both Parabix2
51and Expat degrades substantially on \CORTEXA8{}.  This result was
52expected given the combarably performance limited \CORTEXA8{} hardware
53architecture.  Surprisingly on \CORTEXA8{} Expat outperforms Parabix2
54on each of the lower markup density workloads, dew.xml and jaw.xm. On
55the remaining higher-density workloads, Parabix2 performs only
56moderately better than Expat.  The higher latency of the NEON
57instructions on \CORTEXA8{} is the likely contributor to this loss in
58performance. A more interesting aspect of this result is demonstrated
59in a comparison of Figure \ref{relative_performance_arm_vs_i3} and
60Figure \ref{relative_performance_arm_vs_i3}. These figure demonstrate
61that the relative performance of each parser degrades in a relatively
62constant manner.  That is, compared to the \CITHREE{}, on the
63GT-P1000M, Parabix2 and Expat operate at approximately 17.2\% and
6455.7\% efficiency respectively. Figure
65\ref{relative_performance_arm_vs_i3} shows that the baseline cost of
66Parabix2 operations implemented using the NEON instruction set--- and
67thereby the baseline cost of Parabix2---is substantially higher on the
68\CORTEXA8{} processor.  Given that Parabix2 was not designed with the
69limitations of the \CORTEXA8{} in mind, in the future a careful
70analysis of the cost of each instruction provided in the ARMv7 ISA may
71allow us to better utilize the hardware resources provided. In
72particular, future performance enhancement to ARM NEON could result in
73substantial overall improvement in Parabix2 execution time.
Note: See TracBrowser for help on using the repository browser.