source:docs/PACT2011/05-corei3.tex@1048

Last change on this file since 1048 was 1048, checked in by ksherdy, 8 years ago

General edits.

File size: 8.9 KB
RevLine
[1042]1\section{Baseline Evaluation on \CI{}}
[969]2
3%some of the numbers are roughly calculated, needs to be recalculated for final version
4\subsection{Cache behavior}
[1039]5\CI\ has a three level cache hierarchy.  The miss penalty for each
[1048]6level is approximately 4, 11, and 36 cycles respectively.  Figure
[1001]7\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses of all the
9four parsers.  Although XML parsing is not a memory intensive
10application, the cost of cache miss for Expat and Xerces can be about
[1048]11half a cycle per byte while the performance of Parabix is essentially
12unaffected by cache misses.  Cache misses are not just a problem for
[1001]13performance but also energy consumption.  L1 cache miss cost about
148.3nJ; L2 cache miss cost about 19nJ; L3 cache miss cost about 40nJ.
[1048]15With a 1GB input file, Expat and Xerces would consume over 0.6J and 0.9J due to cache misses alone respectively.
16%With a 1GB input file, Expat would consume more than 0.6J and Xerces
17%would consume 0.9J on cache misses alone.
[969]18
19
20\begin{figure}
21\begin{center}
[1004]22\includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf}
[969]23\end{center}
[1039]24\caption{L1 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
[969]25\label{corei3_L1DM}
26\end{figure}
27
28\begin{figure}
29\begin{center}
[1004]30\includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf}
[969]31\end{center}
[1039]32\caption{L2 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
[969]33\label{corei3_L2DM}
34\end{figure}
35
36\begin{figure}
37\begin{center}
[1004]38\includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf}
[969]39\end{center}
[1039]40\caption{L3 Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
[969]41\label{corei3_L3TM}
42\end{figure}
43
44\subsection{Branch Mispredictions}
[1001]45Despite years of improvement, branch misprediction is still a
[1048]46significant bottleneck when it comes to performance.  The cost of a branch
47misprediction is generally over 10 CPU cycles.  As shown in
48Figure \ref{corei3_BM}, the cost of branch mispredictions per byte of XML for Expat
49can be over 7 cycles---which is approximately the number of cycles
50required by Parabix2 to process a byte of XML data using the same workload.
[969]51
[1048]52But reducing the branch misprediction rate is difficult for text-based
[1001]53applications due to the variable-length nature of syntactic elements.
[1048]54Therefore, the goal is to reduce the total number of branches.  However, traditional byte-at-a-time XML
55parsing requires a large number of inevitable branches.  As
[1001]56shown in Figure \ref{corei3_BR}, Xerces can have an average of 13
57branches for each byte it processed on the high markup density file.
[1048]58Parabix1 minimizes the branches by using parallel bit streams for each 128-bit block but still requires a few
59branches for sequential scanning. Utilizing the new parallel scanning technique, Parabix2 is relatively branch-free, as shown in Figure \ref{corei3_BR}. As a result, Parabix2 has minimal
[1001]60dependency on the markup density of the workloads.
[1048]61% Parabix1 minimize the branches by using parallel bit
62% streams.  Parabix1 still have a few branches for each block of 128
63% bytes (SSE) due to the sequential scanning.  But with the new parallel
64% scanning technique, Parabix2 is essentially branch-free as shown in
65% the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
66% dependency on the markup density of the workloads.
[969]67
68\begin{figure}
69\begin{center}
[1004]70\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
[969]71\end{center}
[1039]72\caption{Branches on \CI\ (y-axis: Branches per KByte)}
[969]73\label{corei3_BR}
74\end{figure}
75
76\begin{figure}
77\begin{center}
[1004]78\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
[969]79\end{center}
[1039]80\caption{Branch Mispredictions on \CI\ (y-axis: Branch Mispredictions per KByte)}
[969]81\label{corei3_BM}
82\end{figure}
83
[1048]84\subsection{SIMD Instructions vs. Total Instructions}
[969]85
[1001]86Parabix gains its performance by using parallel bitstreams, which are
87mostly generated and calculated by SIMD instructions.  The ratio of
88executed SIMD instructions over total instructions indicates the
[1048]89amount of parallel processing we were able to achieve.
90Using Intel PIN, a dynamic binary instrumentation tool, we gathered the running instruction mix of each XML workload and classified the instructions as either vector (SIMD-based) instructions or non-vector (Non-SIMD-based) instructions.
91Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} shows the
92percentage of SIMD instructions of Parabix1 and Parabix2
93%(Expat and Xerce do not use any SIMD instructions)
94.  For Parabix1, 18\% to 40\%
[1001]95of the executed instructions consists of SIMD instructions.  By using
96bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
[1048]97SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
98for both Parabix1 and Parabix2, the degradation rate of
99Parabix2 is much lower and thus the performance penalty incurred by
100increasing the markup density is reduced.
101%Expat and Xerce do not use any SIMD instructions and were not included in this portion of the study.
[969]102
[1048]103% Parabix gains its performance by using parallel bitstreams, which are
104% mostly generated and calculated by SIMD instructions.  The ratio of
105% executed SIMD instructions over total instructions indicates the
106% amount of parallel processing we were able to achieve.  We use Intel
107% pin, a dynamic binary instrumentation tool, to gather instruction mix.
108% Then we adds up all the vector instructions that have been executed.
109% Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
110% percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
111% Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
112% of the executed instructions consists of SIMD instructions.  By using
113% bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
114% SIMD instructions.  Although the ratio decrease as the markup density
115% increase for both Parabix1 and Parabix2, the decreasing rate of
116% Parabix2 is much lower and thus the performance degradation caused by
117% increasing markup density is smaller.
118
119
[969]120\begin{figure}
121\begin{center}
[1004]122\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p1.pdf}
[969]123\end{center}
[1004]124\caption{Parabix1 SIMD Instruction Ratio (y-axis: percent)}
[969]125\label{corei3_INS_p1}
126\end{figure}
127
128\begin{figure}
129\begin{center}
[1004]130\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
[969]131\end{center}
[1004]132\caption{Parabix2 SIMD Instruction Ratio (y-axis: percent)}
[969]133\label{corei3_INS_p2}
134\end{figure}
135
136\subsection{CPU Cycles}
137
[1001]138Figure \ref{corei3_TOT} shows the result of the overall performance
[1048]139evaluated as CPU cycles per thousand input bytes.  Parabix1 is 1.5 to
[1001]1402.5 times faster on document-oriented input and 2 to 3 times faster on
141data-oriented input compared with Expat and Xerces.  Parabix2 is 2.5
142to 4 times faster on document-oriented input and 4.5 to 7 times faster
143on data-oriented input.  Traditional parsers can be dramatically
144slowed down by higher markup density while Parabix with parallel
145processing is less affected.  The comparison is not entirely fair for
146Xerces that transcodes input into UTF-16, which typically takes
147several cycles per byte.  However, transcoding using parallel
148bitstreams can be much faster and it takes less than a cycle per byte
149to transcode ASCII files such as road.gml, po.xml and soap.xml
150\cite{Cameron2008}.
[969]151
152\begin{figure}
153\begin{center}
[1004]154\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
[969]155\end{center}
[1039]156\caption{Processing Time on \CI\ (y-axis: Total CPU Cycles per KByte)}
[969]157\label{corei3_TOT}
158\end{figure}
159
160\subsection{Power and Energy}
161There is a growing concern of power consumption and energy efficiency.
[1001]162Chip producers not only work on improving the performance but also
[1048]163have worked hard to develop power efficient chips. We studied the
[1001]164power and energy consumption of Parabix in comparison with Expat and
[1042]165Xerces on \CI{}
[969]166
[1001]167Figure \ref{corei3_power} shows the average power consumed by the four
[1042]168different parsers.  The average power of \CI{} 530 is about 21 watts.
[1001]169This model released by Intel last year has a good reputation for power
170efficiency.  Parabix2 dominated by SIMD instructions uses only about
1715\% higher power than the other parsers.
[969]172
173\begin{figure}
174\begin{center}
[1004]175\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
[969]176\end{center}
[1039]177\caption{Average Power on \CI\ (watts)}
[969]178\label{corei3_power}
179\end{figure}
180
[1001]181The more interesting trend is energy, Figure \ref{corei3_energy} shows
182the energy consumption of the four different parsers.  Although
[1048]183Parabix2 requires slightly more power (per instruction), its processing time is significantly lower
184and therefore consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75
[1001]185nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to
186370nJ per byte seperately.
[969]187
188\begin{figure}
189\begin{center}
[1004]190\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
[969]191\end{center}
[1039]192\caption{Energy Consumption on \CI\ ($\mu$J per KByte)}
[969]193\label{corei3_energy}
194\end{figure}
195
Note: See TracBrowser for help on using the repository browser.