source: docs/HPCA2012/final_ieee/05-corei3.tex @ 1738

Last change on this file since 1738 was 1738, checked in by lindanl, 8 years ago

Figure adjustment and some minor changes

File size: 9.0 KB
Line 
1\section{Efficiency of the Parabix-XML Parser}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB)
13are private per core; L3 (4MB) is shared by all the cores.
14Table \ref{cache_misses} shows the cache misses per kilobyte
15of input data. Analytically, the cache misses for the Expat and Xerces
16parsers represent a 0.5 cycle per XML byte cost. This overhead
17does not necessarily impact the overall performance of these
18parsers as they experience additional overheads related to branch mispredictions.
19Compared to Xerces and Expat, the data organization of Parabix-XML significantly
20reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$ 
21fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache
22utilization helps keep the SIMD units busy by minimizing memory-related stalls
23and lowers the overall energy consumption
24by reducing the need to access the higher levels of the cache hierarchy.
25Using microbenchmarks, we estimated that the L1,
26L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ
27respectively. On average, with a 1GB XML file, Expat and Xerces would consume over
280.6J and 0.9J respectively due to cache misses alone.
29%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
30%would consume 0.9J on cache misses alone.
31
32
33\begin{table}[htbp]
34\begin{center}
35\begin{tabular}{|c|c|c|c|}
36\hline
37        & Parabix       & Expat         & Xerces  \\ \hline
38L1      & 4.1           & 31.7          & 104.2   \\ \hline
39L2      & 0.1           & 12.0          & 1.7     \\ \hline
40L3      & 0.03          & 3.9           & 0.3     \\ \hline
41\end{tabular}
42\end{center}
43\caption{Cache Misses per kB of input data} 
44\label{cache_misses}
45\end{table}
46
47\subsection{Branch Mispredictions}
48\label{section:XML-branches}
49In general, performance is limited by branch mispredictions.
50Unfortunately, it is difficult to reduce the branch misprediction rate of
51traditional XML parsers due to:
52(1) the variable length nature of the syntactic elements contained within XML documents;
53(2) a data dependent characteristic, and
54(3) the extensive set of syntax constraints imposed by the XML 1.0/1.1 specifications.
55% Branch mispredictions are known
56% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
57% \cite{CameronHerdyLin2008}.
58As shown in Figure \ref{corei3_BR},
59Xerces averages up to 13 branches per XML byte processed on high density
60markup. On modern commodity processors the cost of a single branch
61misprediction is on the order of 10s of CPU cycles spent to restart the processor
62pipeline.
63
64The high miss prediction rate in conventional parsers is a significant overhead.
65In Parabix-XML, the use of SIMD operations eliminates many branches.
66Most conditional branches can be replaced with
67bitwise operations, which can process up to 128 characters worth of
68branches with one operation
69or with a series of logical predicate operations, which are cheaper
70to compute since they require only SIMD operations.
71
72As shown in Figure \ref{corei3_BR},
73Parabix-XML is nearly branch free and exhibits minimal dependence on the
74source markup density. Specifically, it experiences between 19.5 and
7530.7 branch mispredictions per kB of XML data. Conversely, the cost of
76branch mispredictions for the Expat parser can be over 7 cycles per
77XML byte (see Figure \ref{corei3_BM}) --- which exceeds
78the average latency of a byte processed by Parabix-XML.
79
80
81
82
83\begin{figure}
84\begin{center}
85{
86\subfigure[Branch Instructions / kB]{
87\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
88\label{corei3_BR}
89}
90\hfill
91\subfigure[Branch Misses / kB]{
92\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
93\label{corei3_BM}
94}
95}
96\end{center}
97\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
98\end{figure}
99
100\subsection{SIMD Instructions vs. Total Instructions}
101
102In Parabix-XML, the ratio of retired SIMD instructions to total
103instructions provides insight into the relative degree to which
104Parabix-XML achieves parallelism over the byte-at-a-time approach.
105Using the Intel Pin tool, we gathered the dynamic instruction mix for
106each XML workload and classified the instructions as either SIMD
107or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
108percentage of SIMD instructions in the Parabix-XML parser.
109The ratio of executed SIMD instructions over total instructions indicates
110the amount of available parallelism.
111The resulting instruction mix consists of 60\% to 80\% SIMD
112instructions. The markup density of the files influence the number of
113scalar instructions needed to handle the tag processing which affects
114the overall parallelism that can be extracted by Parabix.  We find
115that degradation rate is low and thus the performance
116penalty incurred by increasing the markup density is minimal.
117%Expat and Xerce do not use any SIMD instructions and were not
118%included in this portion of the study.
119
120% Parabix gains its performance by using parallel bit streams, which
121% are mostly generated and calculated by SIMD instructions.  We use Intel
122% pin, a dynamic binary instrumentation tool, to gather instruction
123% mix.  Then we adds up all the vector instructions that have been
124% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
125% show the percentage of SIMD instructions of Parabix1 and Parabix
126% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
127% 18\% to 40\% of the executed instructions consists of SIMD
128% instructions.  By using bistream addition for parallel scanning,
129% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
130% decrease as the markup density increase for both Parabix1 and
131% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
132% performance degradation caused by increasing markup density is
133% smaller.
134
135\subsection{CPU Cycles}
136
137Figure \ref{corei3_TOT} shows overall parser performance in
138terms of CPU cycles per kB. Parabix-XML  is 2.5
139to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
140on data-oriented input.  Traditional parsers can be dramatically
141slowed by dense markup but Parabix-XML is relatively unaffected.
142Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
143processing it; this requires several cycles per byte. However,
144transcoding using parallel bit streams is significantly faster and
145requires less than a single cycle per byte.
146
147\begin{table}[htbp]
148\begin{center}
149{
150\begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
151\hline
152File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
153SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
154Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
155 \\ \hline
156\end{tabular}
157}
158\end{center}
159\caption{SIMD Instruction Percentage} 
160\label{corei3_INS_p2} 
161\end{table}
162
163
164\begin{figure}[htbp]
165\begin{center}
166{
167\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
168}
169\end{center}
170\caption{Performance (CPU Cycles per kB)}
171\label{corei3_TOT}
172\end{figure}
173
174
175
176\subsection{Power and Energy}
177In this section, we study the power and energy consumption of Parabix-XML
178in comparison with Expat and Xerces on \CITHREE{}.
179Figure \ref{corei3_power} shows the
180average power consumed by each parser. Parabix-XML, dominated by SIMD
181instructions, uses $\sim5\%$ additional power. While the
182SIMD functional units are significantly wider than the scalar
183counterparts, register width and functional unit power account only
184for a small fraction of the overall power consumption in a processor
185pipeline. More importantly by using data parallel operations Parabix
186amortizes the fetch and data access overheads. This results in minimal
187power increase compared to the conventional parsers.  Perhaps the
188energy trends shown in Figure \ref{corei3_energy} reveal an
189interesting trend. Parabix consumes substantially less energy than the
190other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
191Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
192Although Parabix requires slightly more power (per instruction), the
193processing time of Parabix is significantly lower.
194
195
196\begin{figure}
197\begin{center}
198{
199\subfigure[Avg. Power (Watts)]{
200\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
201\label{corei3_power}
202}
203\hfill
204\subfigure[Energy Consumption ($\mu$J per kB)]{
205\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
206\label{corei3_energy}
207}
208}
209\end{center}
210\caption{Power profile of Parabix on \CITHREE{}}
211\end{figure}
212
213
Note: See TracBrowser for help on using the repository browser.