source: docs/HPCA2012/05-corei3.tex @ 1371

Last change on this file since 1371 was 1371, checked in by lindanl, 8 years ago

modifications

File size: 8.9 KB
Line 
1\section{Efficiency of the Parabix}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on a fixed platform, the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behaviour}
11The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB)
13are private per core, while the 4MB L3 is shared by all the
14cores. Figure \ref{cache_misses} shows the cache misses per kilobyte
15of input data. Analytically, the cache misses for the Expat and Xerces
16parsers represent a 0.5 cycle per XML byte processed. This overhead
17does not necessarily reflect in the overall performance of these
18parsers as they experience other overheads related to branch
19mispredictions. Parabix's data reorganization significantly improves
20the overall cache miss rate. We experience 7$\times$ less misses than
21Expat and 25$\times$ less misses than Xerces at the L1 and 104$\times$ less misses than
22Expat and 15$\times$ less misses than Xerces at the L2 level. The improved cache
23utilization keeps the SIMD units busy and prevent memory related
24stalls. Note that cache misses also cause increased application energy
25consumption due to increased energy required to access higher levels
26in the cache hierarchy. We estimated with microbenchmarks that the L1,
27L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ
28respectively. For a 1GB XML file Expat and Xerces would consume over
290.6J and 0.9J respectively due to cache misses alone.
30%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
31%would consume 0.9J on cache misses alone.
32
33
34\begin{figure}
35\subfigure[L1 Misses]{
36\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
37\label{corei3_L1DM}
38}
39\subfigure[L2 Misses]{
40\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
41\label{corei3_L2DM}
42}
43\subfigure[L3 Misses]{
44\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
45\label{corei3_L3DM}
46}
47\caption{Cache Misses per kB of input data.}
48\label{cache_misses}
49\end{figure}
50
51\subsection{Branch Mispredictions}
52In general, reducing the branch misprediction rate is difficult in
53text-based XML parsing applications. This is due to (1) variable
54length nature of the syntactic elements contained within XML
55documents, (2) a data dependent characteristic, and (3) the extensive
56set of syntax constraints imposed by the XML. Traditional
57byte-at-a-time XML parser's performance is limited by the number of
58branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces
59averages up to 13 branches per XML byte processed on high density
60markup. On modern commodity processors the cost of a single branch
61misprediction is incur over 10s of CPU cycles to restart the processor
62pipeline. The high miss prediction rate in conventional parsers add
63significant overhead. In Parabix the transformation to SIMD operation
64eliminates many branches. Further optimizations take advantage of
65Parabix's data organization and replace condition branches with {\em
66  bit scan} operations that can process up to 128 characters worth of
67branches with one operation. In many cases, we also replace the
68branches with logical predicate operations. Our predicate are cheaper
69to compute since they involve only bit parallel SIMD operations.
70
71 As shown in Figure \ref{corei3_BR},
72Parabix processing is almost branch free. Parabix exhibits minimal
73dependence on source XML markup density; it experiences between 19.5 and
7430.7 branch mispredictions per thousand of XML byte. The cost of
75branch mispredictions for the Expat parser can be over 7 cycles per
76XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher
77than the average latency of a byte processed by Parabix.
78
79
80
81
82\begin{figure}
83\subfigure[Branch Instructions]{
84\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
85\label{corei3_BR}
86}
87\hfill
88\subfigure[Branch Misses]{
89\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
90\label{corei3_BM}
91}
92\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
93\end{figure}
94
95\subsection{SIMD Instructions vs. Total Instructions}
96
97In Parabix, bit streams are both computed and
98predominately operated upon using the SIMD instructions of commodity
99processors.  The ratio of retired SIMD instructions to total
100instructions provides insight into the relative degree to which
101Parabix achieves parallelism over the byte-at-a-time approach.
102
103
104Using the Intel Pin tool, we gather the dynamic instruction mix for
105each XML workload, and classify instructions as either vector (SIMD)
106or non-vector instructions.  Figures~\ref{corei3_INS_p2} show the
107percentage of SIMD instructions for Parabix. The ratio of executed
108SIMD instructions over total instructions indicates the amount of
109parallel processing we were able to extract.
110%(Expat and Xerce do not use any SIMD instructions)
111The Parabix instruction mix is made up of 60\% to 80\% SIMD
112instructions.  The markup density of the files influence the number of
113scalar instructions needed to handle the tag processing which affects
114the overall parallelism that can be extracted by Parabix.  We find
115that degradation rate is low and thus the performance
116penalty incurred by increasing the markup density is minimal.
117%Expat and Xerce do not use any SIMD instructions and were not
118%included in this portion of the study.
119
120% Parabix gains its performance by using parallel bitstreams, which
121% are mostly generated and calculated by SIMD instructions.  We use Intel
122% pin, a dynamic binary instrumentation tool, to gather instruction
123% mix.  Then we adds up all the vector instructions that have been
124% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
125% show the percentage of SIMD instructions of Parabix1 and Parabix
126% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
127% 18\% to 40\% of the executed instructions consists of SIMD
128% instructions.  By using bistream addition for parallel scanning,
129% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
130% decrease as the markup density increase for both Parabix1 and
131% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
132% performance degradation caused by increasing markup density is
133% smaller.
134
135\subsection{CPU Cycles}
136
137Figure \ref{corei3_TOT} shows overall parser performance evaluated in
138terms of CPU cycles per kilobyte.  Parabix is 2.5$\times$
139to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster
140on data-oriented input.  Traditional parsers can be dramatically
141slowed by dense markup, while Parabix2 is generally unaffected.  The
142results presented are not entirely fair to the Xerces parser since it
143first transcodes input from UTF-8 to UTF-16 before processing. In
144Xerces, this transcoding requires several cycles per byte.  However,
145transcoding using parallel bit streams is significantly faster and
146requires less than a single cycle per byte.
147
148
149\begin{figure}
150\subfigure[Performance : \# Cycles/kb]{
151\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
152\label{corei3_TOT}
153}
154\hfill
155\subfigure[SIMD Instruction Breakdown. Y Axis :  \% SIMD Instruction/kb]{
156\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
157\label{corei3_INS_p2}
158}
159\end{figure}
160
161
162\subsection{Power and Energy}
163In this section, we study the power and energy consumption of Parabix in
164comparison with Expat and Xerces on \CITHREE{}. The average power of
165\CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
166average power consumed by each parser.  Parabix, dominated by SIMD
167instructions which uses approximately 5\% additional power. While the
168SIMD functional units are significantly wider than the scalar
169counterparts; register width and functional unit power account only
170for a small fraction of the overall power consumption in a processor
171pipeline. More importantly by using data parallel operations Parabix
172amortizes the fetch and data access overheads. This results in minimal
173power increase compared to the conventional parsers. 
174Perhaps the energy trends shown in Figure
175\ref{corei3_energy} reveal an interesting trend. Parabix consumes
176substantially less energy than the other parsers. Parabix consumes 50
177to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
178140nJ to 370nJ per byte respectively.  Although Parabix
179requires slightly more power (per instruction), the processing time of
180Parabix is significantly lower.
181
182
183
184
185
186
187\begin{figure}
188\subfigure[Avg. Power (Watts)]{
189\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
190\label{corei3_power}
191}
192\hfill
193\subfigure[Energy Consumption ($\mu$J per kB)]{
194\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
195\label{corei3_energy}
196}
197\end{figure}
198
199
Note: See TracBrowser for help on using the repository browser.