source: docs/PACT2011/05-corei3.tex @ 996

Last change on this file since 996 was 980, checked in by lindanl, 9 years ago

some changes

File size: 7.0 KB
Line 
1\section{Evaluation on Corei3}
2
3%some of the numbers are roughly calculated, needs to be recalculated for final version
4\subsection{Cache behavior}
5Core i3 has a three level cache hierarchy.
6The miss penalty for each level is about 4 cycles, 11 cycles, and 36 cycles.
7Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses of all the four parsers.
8Although XML parsing is not a memory intensive application,
9the cost of cache miss for Expat and Xerces can be about half cycle per byte while the performance of Parabix is hardly affected by cache misses.
10Cache miss isn't just a problem for performance but also energy consumption.
11L1 cache miss cost about 8.3nJ; L2 cache miss cost about 19nJ; L3 cache miss cost about 40nJ.
12With a 1GB input file, Expat would consume more than 0.6J and Xerces would consume 0.9J on cache miss.
13
14
15\begin{figure}
16\begin{center}
17\includegraphics[width=85mm]{plots/corei3_L1DM.pdf}
18\end{center}
19\caption{L1 Data Cache Misses/ KB on core i3}
20\label{corei3_L1DM}
21\end{figure}
22
23\begin{figure}
24\begin{center}
25\includegraphics[width=85mm]{plots/corei3_L2DM.pdf}
26\end{center}
27\caption{L2 Data Cache Misses/ KB on core i3}
28\label{corei3_L2DM}
29\end{figure}
30
31\begin{figure}
32\begin{center}
33\includegraphics[width=85mm]{plots/corei3_L3CM.pdf}
34\end{center}
35\caption{L3 Cache Misses/ KB on core i3}
36\label{corei3_L3TM}
37\end{figure}
38
39\subsection{Branch Mispredictions}
40Despite years of improvement, branch misprediction is still a significant bottleneck of performance.
41The penalty of a branch misprediction is generally more than 10 CPU cycles.
42As shown in Figure \ref{corei3_BM}, the cost of branch mispredictions for Expat can be more than 7 cycles per byte,
43which is as much as the processing time of Parabix2 on the same workload.
44
45Reducing the branch misprediction rate is difficult for text-based applications due to the variable-length nature of syntactic elements.
46Therefore, the alternative solution of reducing branches becomes more attractive.
47However, the traditional byte-at-a-time method of XML parsing usually involves large amount of inevitable branches.
48As shown in Figure \ref{corei3_BR}, Xerces can have an average of 13 branches for each byte it processed on the high markup density file.
49Parabix substantially eliminate the branches by using parallel bit streams.
50Parabix1 still have a few branches for each block of 128 bytes (SSE) due to the sequential scanning.
51But with the new parallel scanning technique, Parabix2 is essentially branch-free as shown in the Figure \ref{corei3_BR}.
52As a result, Parabix2 has much less dependencies on markup density of the workloads.
53
54\begin{figure}
55\begin{center}
56\includegraphics[width=85mm]{plots/corei3_BR.pdf}
57\end{center}
58\caption{Branches / KB on core i3}
59\label{corei3_BR}
60\end{figure}
61
62\begin{figure}
63\begin{center}
64\includegraphics[width=85mm]{plots/corei3_BM.pdf}
65\end{center}
66\caption{Branch Mispredictions/ KB on core i3}
67\label{corei3_BM}
68\end{figure}
69
70\subsection{SIMD/Total Instructions}
71
72Parabix gains its performance by using parallel bitstreams, which are mostly generated and calculated by SIMD instructions.
73The ratio of executed SIMD instructions over total instructions indicates the amount of parallel processing we were able to achieve.
74We use Intel pin, a dynamic binary instrumentation tool, to gather instruction mix.
75Then we adds up all the vector instructions that have been executed.
76Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the percentage of SIMD instructions
77of Parabix1 and Parabix2 (Expat and Xerce do not use any SIMD instructions).
78For Parabix1, 18\% to 40\% of the executed instructions consists of SIMD instructions.
79By using bistream addition for parallel scanning, Parabix2 uses 60\% to 80\% SIMD instructions.
80Although the ratio decrease as the markup density increase for both Parabix1 and Parabix2,
81the decreasing rate of Parabix2 is much lower and thus
82the performance degradation caused by increasing markup density is smaller.
83
84\begin{figure}
85\begin{center}
86\includegraphics[width=85mm]{plots/corei3_INS_p1.pdf}
87\end{center}
88\caption{Vector instruction vs. non-vertor instruction for Parabix1 on core i3}
89\label{corei3_INS_p1}
90\end{figure}
91
92\begin{figure}
93\begin{center}
94\includegraphics[width=85mm]{plots/corei3_INS_p2.pdf}
95\end{center}
96\caption{Vector instruction vs. non-vector instruction for Parabix2 on core i3}
97\label{corei3_INS_p2}
98\end{figure}
99
100\subsection{CPU Cycles}
101
102Figure \ref{corei3_TOT} shows the result of the overall performance evaluated as CPU cycles per thousands input bytes.
103Parabix1 is 1.5 to 2.5 times faster on document-oriented input and 2 to 3 times faster on data-oriented input compared with Expat and Xerces.
104Parabix2 is 2.5 to 4 times faster on document-oriented input and 4.5 to 7 times faster on data-oriented input.
105Traditional parsers can be dramatically slowed down by higher markup density while Parabix with parallel processing is less affected.
106The comparison is not entirely fair for Xerces that transcodes input into UTF-16, which typically takes several cycles per byte.
107However, transcoding using parallel bitstreams can be much faster and
108it takes less than a cycle per byte to transcode ASCII files such as road.gml, po.xml and soap.xml \cite{Cameron2008}.
109
110\begin{figure}
111\begin{center}
112\includegraphics[width=85mm]{plots/corei3_TOT.pdf}
113\end{center}
114\caption{Total CPU Cycles/ KB on core i3}
115\label{corei3_TOT}
116\end{figure}
117
118\subsection{Power and Energy}
119There is a growing concern of power consumption and energy efficiency.
120Chip producers not only work on improving the performance but also have worked hard to develop power efficient chips.
121We studied the power and energy consumption of Parabix in comparison with Expat and Xerces on corei3.
122We use a clamp to measure the real current of CPU power supply line and a meter to sample and record the results every 10ms.
123 
124Figure \ref{corei3_power} shows the average power consumed by the four different parsers.
125The average power of corei3-530 is about 21 watts.
126This model released by Intel last year has a good reputation for power efficiency.
127Parabix2 dominated by SIMD instructions uses only about 5\% higher power than the other parsers.
128The power range of SIMD instructions .....
129
130\begin{figure}
131\begin{center}
132\includegraphics[width=85mm]{plots/corei3_power.pdf}
133\end{center}
134\caption{Average Power on core i3 (watts)}
135\label{corei3_power}
136\end{figure}
137
138Figure \ref{corei3_energy} shows the energy consumption of the four different parsers.
139Although Parabix2 needs slight higer power, its processing time is much shorter and therefore consumes much less energy.
140Parabix2 consumes 50 to 75 nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to 370nJ per byte seperately.
141
142\begin{figure}
143\begin{center}
144\includegraphics[width=85mm]{plots/corei3_energy.pdf}
145\end{center}
146\caption{Energy consumption on core i3 (nJ/B)}
147\label{corei3_energy}
148\end{figure}
149
Note: See TracBrowser for help on using the repository browser.