source: docs/PACT2011/05-corei3.tex @ 1108

Last change on this file since 1108 was 1108, checked in by ksherdy, 8 years ago

Significant rewrite.

File size: 9.7 KB
RevLine 
[1079]1\section{Baseline Evaluation on \CITHREE{}}
[969]2
3%some of the numbers are roughly calculated, needs to be recalculated for final version
4\subsection{Cache behavior}
[1107]5\CITHREE\ has a three level cache hierarchy.  The approximate miss penalty for each cache
6level is 4, 11, and 36 cycles respectively.  Figure
[1001]7\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
[1107]8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of the parsers.  Although XML parsing is non memory intensive
9application, cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost whereas the performance of the Parabix parsers remains essentially
10unaffected by data cache misses.  Cache misses not only consume additional CPU cycles but increase application energy consumption.  L1, L2, and L3 cache misses consume
11approximately 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as input, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone.
12%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
[1048]13%would consume 0.9J on cache misses alone.
[969]14
15
16\begin{figure}
17\begin{center}
[1004]18\includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf}
[969]19\end{center}
[1107]20\caption{L1 Data Cache Misses on \CITHREE\ (y-axis: Cache Misses per kB)}
[969]21\label{corei3_L1DM}
22\end{figure}
23
24\begin{figure}
25\begin{center}
[1004]26\includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf}
[969]27\end{center}
[1107]28\caption{L2 Data Cache Misses on \CITHREE\ (y-axis: Cache Misses per kB)}
[969]29\label{corei3_L2DM}
30\end{figure}
31
32\begin{figure}
33\begin{center}
[1004]34\includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf}
[969]35\end{center}
[1107]36\caption{L3 Cache Misses on \CITHREE\ (y-axis: Cache Misses per kB)}
[969]37\label{corei3_L3TM}
38\end{figure}
39
40\subsection{Branch Mispredictions}
[1107]41Despite improvements in branch prediction, branch misprediction penalties contribute
42significantly to XML parsing performance. On modern commodity processors the cost of a single branch
[1108]43misprediction is commonly cited as over 10 CPU cycles.  As shown in
44Figure \ref{corei3_BM}, the cost of branch mispredictions for the Expat parser
45can be over 7 cycles per XML byte---this cost alone is equal to the average total cost for Parabix2 to process each byte of XML.
[969]46
[1108]47In general, reducing the branch misprediction rate is difficult in text-based XML parsing
48applications. This is due in part to the variable length nature of the syntactic elements contained within XML documents, a data dependent characterstic,
49as well as the extensive set of syntax constraints imposed by the XML 1.0 specification. As such, traditional byte-at-a-time XML parsers generate a performance limiting
50number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13
51branches per XML byte processed on high density markup.
52
53The performance improvement of Parabix1 in terms of branch mispredictions results from the veritable elimination of conditional branch instructions in scanning. Leveraging the processor built-in {\em bit scan}
54operation together with parallel bit stream technology Parabix1 can scan up to 64 bytes of source XML with a single {\em bit scan} instruction. In comparison, a byte-at-a-time parser must
55process a conditional branch instruction per XML byte scanned.
56
57As shown in Figure \ref{corei3_BR} Parabix2 processing is almost branch free. Utilizing a new parallel scanning technique based on bit stream addition, Parabix2 exhibits minimal
58conditional branch dependence on source XML markup density. \ref{corei3_BR} shows displays this lack of data dependence via the constant number of branch
59mispredictions produced for each of the source XML files.
[1048]60% Parabix1 minimize the branches by using parallel bit
61% streams.  Parabix1 still have a few branches for each block of 128
62% bytes (SSE) due to the sequential scanning.  But with the new parallel
63% scanning technique, Parabix2 is essentially branch-free as shown in
64% the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
65% dependency on the markup density of the workloads.
[969]66
67\begin{figure}
68\begin{center}
[1004]69\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
[969]70\end{center}
[1107]71\caption{Branches on \CITHREE\ (y-axis: Branches per kB)}
[969]72\label{corei3_BR}
73\end{figure}
74
75\begin{figure}
76\begin{center}
[1004]77\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
[969]78\end{center}
[1107]79\caption{Branch Mispredictions on \CITHREE\ (y-axis: Branch Mispredictions per kB)}
[969]80\label{corei3_BM}
81\end{figure}
82
[1048]83\subsection{SIMD Instructions vs. Total Instructions}
[969]84
[1001]85Parabix gains its performance by using parallel bitstreams, which are
86mostly generated and calculated by SIMD instructions.  The ratio of
87executed SIMD instructions over total instructions indicates the
[1048]88amount of parallel processing we were able to achieve. 
89Using Intel PIN, a dynamic binary instrumentation tool, we gathered the running instruction mix of each XML workload and classified the instructions as either vector (SIMD-based) instructions or non-vector (Non-SIMD-based) instructions.
90Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} shows the
91percentage of SIMD instructions of Parabix1 and Parabix2
92%(Expat and Xerce do not use any SIMD instructions)
93.  For Parabix1, 18\% to 40\%
[1001]94of the executed instructions consists of SIMD instructions.  By using
95bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
[1048]96SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
97for both Parabix1 and Parabix2, the degradation rate of
98Parabix2 is much lower and thus the performance penalty incurred by
99increasing the markup density is reduced.
100%Expat and Xerce do not use any SIMD instructions and were not included in this portion of the study.
[969]101
[1048]102% Parabix gains its performance by using parallel bitstreams, which are
103% mostly generated and calculated by SIMD instructions.  The ratio of
104% executed SIMD instructions over total instructions indicates the
105% amount of parallel processing we were able to achieve.  We use Intel
106% pin, a dynamic binary instrumentation tool, to gather instruction mix.
107% Then we adds up all the vector instructions that have been executed.
108% Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
109% percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
110% Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
111% of the executed instructions consists of SIMD instructions.  By using
112% bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
113% SIMD instructions.  Although the ratio decrease as the markup density
114% increase for both Parabix1 and Parabix2, the decreasing rate of
115% Parabix2 is much lower and thus the performance degradation caused by
116% increasing markup density is smaller.
117
118
[969]119\begin{figure}
120\begin{center}
[1004]121\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p1.pdf}
[969]122\end{center}
[1107]123\caption{Parabix1 SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions}
[969]124\label{corei3_INS_p1}
125\end{figure}
126
127\begin{figure}
128\begin{center}
[1004]129\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
[969]130\end{center}
[1107]131\caption{Parabix2 SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)}
[969]132\label{corei3_INS_p2}
133\end{figure}
134
135\subsection{CPU Cycles}
136
[1001]137Figure \ref{corei3_TOT} shows the result of the overall performance
[1048]138evaluated as CPU cycles per thousand input bytes.  Parabix1 is 1.5 to
[1001]1392.5 times faster on document-oriented input and 2 to 3 times faster on
140data-oriented input compared with Expat and Xerces.  Parabix2 is 2.5
141to 4 times faster on document-oriented input and 4.5 to 7 times faster
142on data-oriented input.  Traditional parsers can be dramatically
143slowed down by higher markup density while Parabix with parallel
144processing is less affected.  The comparison is not entirely fair for
145Xerces that transcodes input into UTF-16, which typically takes
146several cycles per byte.  However, transcoding using parallel
147bitstreams can be much faster and it takes less than a cycle per byte
[1078]148to transcode ASCI3I files such as road.gml, po.xml and soap.xml
[1001]149\cite{Cameron2008}.
[969]150
151\begin{figure}
152\begin{center}
[1004]153\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
[969]154\end{center}
[1107]155\caption{Processing Time on \CITHREE\ (y-axis: Total CPU Cycles per kB)}
[969]156\label{corei3_TOT}
157\end{figure}
158
159\subsection{Power and Energy}
160There is a growing concern of power consumption and energy efficiency.
[1001]161Chip producers not only work on improving the performance but also
[1048]162have worked hard to develop power efficient chips. We studied the
[1001]163power and energy consumption of Parabix in comparison with Expat and
[1079]164Xerces on \CITHREE{}
[969]165 
[1001]166Figure \ref{corei3_power} shows the average power consumed by the four
[1079]167different parsers.  The average power of \CITHREE\ 530 is about 21 watts.
[1001]168This model released by Intel last year has a good reputation for power
169efficiency.  Parabix2 dominated by SIMD instructions uses only about
1705\% higher power than the other parsers.
[969]171
172\begin{figure}
173\begin{center}
[1004]174\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
[969]175\end{center}
[1079]176\caption{Average Power on \CITHREE\ (watts)}
[969]177\label{corei3_power}
178\end{figure}
179
[1001]180The more interesting trend is energy, Figure \ref{corei3_energy} shows
181the energy consumption of the four different parsers.  Although
[1048]182Parabix2 requires slightly more power (per instruction), its processing time is significantly lower
183and therefore consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75
[1001]184nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to
185370nJ per byte seperately.
[969]186
187\begin{figure}
188\begin{center}
[1004]189\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
[969]190\end{center}
[1107]191\caption{Energy Consumption on \CITHREE\ ($\mu$J per kB)}
[969]192\label{corei3_energy}
193\end{figure}
194
Note: See TracBrowser for help on using the repository browser.