source: docs/PACT2011/05-corei3.tex @ 1047

Last change on this file since 1047 was 1042, checked in by lindanl, 9 years ago

macros corei3

File size: 7.1 KB
Line 
1\section{Baseline Evaluation on \CI{}}
2
3%some of the numbers are roughly calculated, needs to be recalculated for final version
4\subsection{Cache behavior}
5\CI\ has a three level cache hierarchy.  The miss penalty for each
6level is about 4 cycles, 11 cycles, and 36 cycles.  Figure
7\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses of all the
9four parsers.  Although XML parsing is not a memory intensive
10application, the cost of cache miss for Expat and Xerces can be about
11half cycle per byte while the performance of Parabix is hardly
12affected by cache misses.  Cache miss isn't just a problem for
13performance but also energy consumption.  L1 cache miss cost about
148.3nJ; L2 cache miss cost about 19nJ; L3 cache miss cost about 40nJ.
15With a 1GB input file, Expat would consume more than 0.6J and Xerces
16would consume 0.9J on cache misses alone.
17
18
19\begin{figure}
20\begin{center}
21\includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf}
22\end{center}
23\caption{L1 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
24\label{corei3_L1DM}
25\end{figure}
26
27\begin{figure}
28\begin{center}
29\includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf}
30\end{center}
31\caption{L2 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
32\label{corei3_L2DM}
33\end{figure}
34
35\begin{figure}
36\begin{center}
37\includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf}
38\end{center}
39\caption{L3 Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
40\label{corei3_L3TM}
41\end{figure}
42
43\subsection{Branch Mispredictions}
44Despite years of improvement, branch misprediction is still a
45significant bottleneck of performance.  The penalty of a branch
46misprediction is generally more than 10 CPU cycles.  As shown in
47Figure \ref{corei3_BM}, the cost of branch mispredictions for Expat
48can be more than 7 cycles per byte, which is as much as the processing
49time of Parabix2 on the same workload.
50
51Reducing the branch misprediction rate is difficult for text-based
52applications due to the variable-length nature of syntactic elements.
53Therefore, the alternative solution of reducing branches becomes more
54attractive.  However, the traditional byte-at-a-time method of XML
55parsing usually involves large amount of inevitable branches.  As
56shown in Figure \ref{corei3_BR}, Xerces can have an average of 13
57branches for each byte it processed on the high markup density file.
58Parabix substantially eliminate the branches by using parallel bit
59streams.  Parabix1 still have a few branches for each block of 128
60bytes (SSE) due to the sequential scanning.  But with the new parallel
61scanning technique, Parabix2 is essentially branch-free as shown in
62the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
63dependency on the markup density of the workloads.
64
65\begin{figure}
66\begin{center}
67\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
68\end{center}
69\caption{Branches on \CI\ (y-axis: Branches per KByte)}
70\label{corei3_BR}
71\end{figure}
72
73\begin{figure}
74\begin{center}
75\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
76\end{center}
77\caption{Branch Mispredictions on \CI\ (y-axis: Branch Mispredictions per KByte)}
78\label{corei3_BM}
79\end{figure}
80
81\subsection{SIMD/Total Instructions}
82
83Parabix gains its performance by using parallel bitstreams, which are
84mostly generated and calculated by SIMD instructions.  The ratio of
85executed SIMD instructions over total instructions indicates the
86amount of parallel processing we were able to achieve.  We use Intel
87pin, a dynamic binary instrumentation tool, to gather instruction mix.
88Then we adds up all the vector instructions that have been executed.
89Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
90percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
91Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
92of the executed instructions consists of SIMD instructions.  By using
93bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
94SIMD instructions.  Although the ratio decrease as the markup density
95increase for both Parabix1 and Parabix2, the decreasing rate of
96Parabix2 is much lower and thus the performance degradation caused by
97increasing markup density is smaller.
98
99\begin{figure}
100\begin{center}
101\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p1.pdf}
102\end{center}
103\caption{Parabix1 SIMD Instruction Ratio (y-axis: percent)}
104\label{corei3_INS_p1}
105\end{figure}
106
107\begin{figure}
108\begin{center}
109\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
110\end{center}
111\caption{Parabix2 SIMD Instruction Ratio (y-axis: percent)}
112\label{corei3_INS_p2}
113\end{figure}
114
115\subsection{CPU Cycles}
116
117Figure \ref{corei3_TOT} shows the result of the overall performance
118evaluated as CPU cycles per thousands input bytes.  Parabix1 is 1.5 to
1192.5 times faster on document-oriented input and 2 to 3 times faster on
120data-oriented input compared with Expat and Xerces.  Parabix2 is 2.5
121to 4 times faster on document-oriented input and 4.5 to 7 times faster
122on data-oriented input.  Traditional parsers can be dramatically
123slowed down by higher markup density while Parabix with parallel
124processing is less affected.  The comparison is not entirely fair for
125Xerces that transcodes input into UTF-16, which typically takes
126several cycles per byte.  However, transcoding using parallel
127bitstreams can be much faster and it takes less than a cycle per byte
128to transcode ASCII files such as road.gml, po.xml and soap.xml
129\cite{Cameron2008}.
130
131\begin{figure}
132\begin{center}
133\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
134\end{center}
135\caption{Processing Time on \CI\ (y-axis: Total CPU Cycles per KByte)}
136\label{corei3_TOT}
137\end{figure}
138
139\subsection{Power and Energy}
140There is a growing concern of power consumption and energy efficiency.
141Chip producers not only work on improving the performance but also
142have worked hard to develop power efficient chips.  We studied the
143power and energy consumption of Parabix in comparison with Expat and
144Xerces on \CI{}
145 
146Figure \ref{corei3_power} shows the average power consumed by the four
147different parsers.  The average power of \CI{} 530 is about 21 watts.
148This model released by Intel last year has a good reputation for power
149efficiency.  Parabix2 dominated by SIMD instructions uses only about
1505\% higher power than the other parsers.
151
152\begin{figure}
153\begin{center}
154\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
155\end{center}
156\caption{Average Power on \CI\ (watts)}
157\label{corei3_power}
158\end{figure}
159
160The more interesting trend is energy, Figure \ref{corei3_energy} shows
161the energy consumption of the four different parsers.  Although
162Parabix2 needs slight higer power, its processing time is much shorter
163and therefore consumes much less energy.  Parabix2 consumes 50 to 75
164nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to
165370nJ per byte seperately.
166
167\begin{figure}
168\begin{center}
169\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
170\end{center}
171\caption{Energy Consumption on \CI\ ($\mu$J per KByte)}
172\label{corei3_energy}
173\end{figure}
174
Note: See TracBrowser for help on using the repository browser.