source: docs/PACT2011/05-corei3.tex @ 1107

Last change on this file since 1107 was 1107, checked in by ksherdy, 8 years ago

Minor edits.

File size: 9.1 KB
Line 
1\section{Baseline Evaluation on \CITHREE{}}
2
3%some of the numbers are roughly calculated, needs to be recalculated for final version
4\subsection{Cache behavior}
5\CITHREE\ has a three level cache hierarchy.  The approximate miss penalty for each cache
6level is 4, 11, and 36 cycles respectively.  Figure
7\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of the parsers.  Although XML parsing is non memory intensive
9application, cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost whereas the performance of the Parabix parsers remains essentially
10unaffected by data cache misses.  Cache misses not only consume additional CPU cycles but increase application energy consumption.  L1, L2, and L3 cache misses consume
11approximately 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as input, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone.
12%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
13%would consume 0.9J on cache misses alone.
14
15
16\begin{figure}
17\begin{center}
18\includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf}
19\end{center}
20\caption{L1 Data Cache Misses on \CITHREE\ (y-axis: Cache Misses per kB)}
21\label{corei3_L1DM}
22\end{figure}
23
24\begin{figure}
25\begin{center}
26\includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf}
27\end{center}
28\caption{L2 Data Cache Misses on \CITHREE\ (y-axis: Cache Misses per kB)}
29\label{corei3_L2DM}
30\end{figure}
31
32\begin{figure}
33\begin{center}
34\includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf}
35\end{center}
36\caption{L3 Cache Misses on \CITHREE\ (y-axis: Cache Misses per kB)}
37\label{corei3_L3TM}
38\end{figure}
39
40\subsection{Branch Mispredictions}
41Despite improvements in branch prediction, branch misprediction penalties contribute
42significantly to XML parsing performance. On modern commodity processors the cost of a single branch
43misprediction is generally cited as over 10 CPU cycles.  As shown in
44Figure \ref{corei3_BM}, the cost of branch mispredictions per XML byte for Expat
45can be over 7 cycles---this cost alone is equal to the total cost for Parabix2 to process each byte of XML when given the same input.
46
47But reducing the branch misprediction rate is difficult for text-based
48applications due to the variable-length nature of syntactic elements.
49Therefore, the goal is to reduce the total number of branches.  However, traditional byte-at-a-time XML
50parsing requires a large number of inevitable branches.  As
51shown in Figure \ref{corei3_BR}, Xerces can have an average of 13
52branches for each byte it processed on the high markup density file.
53Parabix1 minimizes the branches by using parallel bit streams for each 128-bit block but still requires a few
54branches for sequential scanning. Utilizing the new parallel scanning technique, Parabix2 is relatively branch-free, as shown in Figure \ref{corei3_BR}. As a result, Parabix2 has minimal
55dependency on the markup density of the workloads.
56% Parabix1 minimize the branches by using parallel bit
57% streams.  Parabix1 still have a few branches for each block of 128
58% bytes (SSE) due to the sequential scanning.  But with the new parallel
59% scanning technique, Parabix2 is essentially branch-free as shown in
60% the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
61% dependency on the markup density of the workloads.
62
63\begin{figure}
64\begin{center}
65\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
66\end{center}
67\caption{Branches on \CITHREE\ (y-axis: Branches per kB)}
68\label{corei3_BR}
69\end{figure}
70
71\begin{figure}
72\begin{center}
73\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
74\end{center}
75\caption{Branch Mispredictions on \CITHREE\ (y-axis: Branch Mispredictions per kB)}
76\label{corei3_BM}
77\end{figure}
78
79\subsection{SIMD Instructions vs. Total Instructions}
80
81Parabix gains its performance by using parallel bitstreams, which are
82mostly generated and calculated by SIMD instructions.  The ratio of
83executed SIMD instructions over total instructions indicates the
84amount of parallel processing we were able to achieve. 
85Using Intel PIN, a dynamic binary instrumentation tool, we gathered the running instruction mix of each XML workload and classified the instructions as either vector (SIMD-based) instructions or non-vector (Non-SIMD-based) instructions.
86Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} shows the
87percentage of SIMD instructions of Parabix1 and Parabix2
88%(Expat and Xerce do not use any SIMD instructions)
89.  For Parabix1, 18\% to 40\%
90of the executed instructions consists of SIMD instructions.  By using
91bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
92SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
93for both Parabix1 and Parabix2, the degradation rate of
94Parabix2 is much lower and thus the performance penalty incurred by
95increasing the markup density is reduced.
96%Expat and Xerce do not use any SIMD instructions and were not included in this portion of the study.
97
98% Parabix gains its performance by using parallel bitstreams, which are
99% mostly generated and calculated by SIMD instructions.  The ratio of
100% executed SIMD instructions over total instructions indicates the
101% amount of parallel processing we were able to achieve.  We use Intel
102% pin, a dynamic binary instrumentation tool, to gather instruction mix.
103% Then we adds up all the vector instructions that have been executed.
104% Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
105% percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
106% Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
107% of the executed instructions consists of SIMD instructions.  By using
108% bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
109% SIMD instructions.  Although the ratio decrease as the markup density
110% increase for both Parabix1 and Parabix2, the decreasing rate of
111% Parabix2 is much lower and thus the performance degradation caused by
112% increasing markup density is smaller.
113
114
115\begin{figure}
116\begin{center}
117\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p1.pdf}
118\end{center}
119\caption{Parabix1 SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions}
120\label{corei3_INS_p1}
121\end{figure}
122
123\begin{figure}
124\begin{center}
125\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
126\end{center}
127\caption{Parabix2 SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)}
128\label{corei3_INS_p2}
129\end{figure}
130
131\subsection{CPU Cycles}
132
133Figure \ref{corei3_TOT} shows the result of the overall performance
134evaluated as CPU cycles per thousand input bytes.  Parabix1 is 1.5 to
1352.5 times faster on document-oriented input and 2 to 3 times faster on
136data-oriented input compared with Expat and Xerces.  Parabix2 is 2.5
137to 4 times faster on document-oriented input and 4.5 to 7 times faster
138on data-oriented input.  Traditional parsers can be dramatically
139slowed down by higher markup density while Parabix with parallel
140processing is less affected.  The comparison is not entirely fair for
141Xerces that transcodes input into UTF-16, which typically takes
142several cycles per byte.  However, transcoding using parallel
143bitstreams can be much faster and it takes less than a cycle per byte
144to transcode ASCI3I files such as road.gml, po.xml and soap.xml
145\cite{Cameron2008}.
146
147\begin{figure}
148\begin{center}
149\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
150\end{center}
151\caption{Processing Time on \CITHREE\ (y-axis: Total CPU Cycles per kB)}
152\label{corei3_TOT}
153\end{figure}
154
155\subsection{Power and Energy}
156There is a growing concern of power consumption and energy efficiency.
157Chip producers not only work on improving the performance but also
158have worked hard to develop power efficient chips. We studied the
159power and energy consumption of Parabix in comparison with Expat and
160Xerces on \CITHREE{}
161 
162Figure \ref{corei3_power} shows the average power consumed by the four
163different parsers.  The average power of \CITHREE\ 530 is about 21 watts.
164This model released by Intel last year has a good reputation for power
165efficiency.  Parabix2 dominated by SIMD instructions uses only about
1665\% higher power than the other parsers.
167
168\begin{figure}
169\begin{center}
170\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
171\end{center}
172\caption{Average Power on \CITHREE\ (watts)}
173\label{corei3_power}
174\end{figure}
175
176The more interesting trend is energy, Figure \ref{corei3_energy} shows
177the energy consumption of the four different parsers.  Although
178Parabix2 requires slightly more power (per instruction), its processing time is significantly lower
179and therefore consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75
180nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to
181370nJ per byte seperately.
182
183\begin{figure}
184\begin{center}
185\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
186\end{center}
187\caption{Energy Consumption on \CITHREE\ ($\mu$J per kB)}
188\label{corei3_energy}
189\end{figure}
190
Note: See TracBrowser for help on using the repository browser.