source: docs/HPCA2012/final_ieee/05-corei3.tex @ 1744

Last change on this file since 1744 was 1744, checked in by lindanl, 8 years ago

Minor Changes

File size: 8.5 KB
Line 
1\section{Efficiency of the Parabix-XML Parser}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11
12
13\begin{table}[!htbp]
14\begin{center}
15\begin{tabular}{|c|c|c|c|}
16\hline
17        & Parabix       & Expat         & Xerces  \\ \hline
18L1      & 4.1           & 31.7          & 104.2   \\ \hline
19L2      & 0.1           & 12.0          & 1.7     \\ \hline
20L3      & 0.03          & 3.9           & 0.3     \\ \hline
21\end{tabular}
22\end{center}
23\caption{Cache Misses per kB of input data} 
24\label{cache_misses}
25\end{table}
26
27
28Table \ref{cache_misses} shows the cache misses per kilobyte of input
29data. Analytically, the cache misses for the Expat and Xerces parsers
30represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
314, 11, and 36 cycles respectively.} 
32
33
34
35
36This overhead does not
37necessarily impact the overall performance of these parsers as they
38experience additional overheads related to branch mispredictions.
39Compared to Xerces and Expat, the data organization of Parabix-XML
40significantly reduces the overall cache miss rate; specifically, there
41were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to
42the next best parser tested. The improved cache utilization helps keep
43the SIMD units busy by minimizing memory-related stalls and lowers the
44overall energy consumption by reducing the need to access the higher
45levels of the cache hierarchy.  Using microbenchmarks, we estimated
46that the L1, L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ,
47and $\sim$40nJ respectively. On average, with a 1GB XML file, Expat
48and Xerces would consume over 0.6J and 0.9J respectively due to cache
49misses alone.
50%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
51%would consume 0.9J on cache misses alone.
52
53
54
55\subsection{Branch Mispredictions}
56\label{section:XML-branches}
57% Branch mispredictions are known
58% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
59% \cite{CameronHerdyLin2008}.
60
61The performance of traditional parsers is limited by their branch
62behavior.  Xerces experiences up to 13 branches per input XML
63character on the high markup files; Expat experiences up to 8 branches
64per XML character.  In Parabix-XML, the use of SIMD operations
65eliminates many branches.  Most conditional branches can be replaced
66with bitwise operations, which can process up to 128 characters worth
67of branches with one operation or with a series of logical predicate
68operations, which are cheaper to compute since they require only SIMD
69operations.
70
71
72The high miss prediction rate in conventional parsers is a significant
73overhead. The cost of a single branch misprediction is on the order of
7410s of CPU cycles spent to restart the processor pipeline on a
75misprediction. Parabix-XML is nearly branch free and exhibits minimal
76dependence on the source markup density. Specifically, it experiences
77between 19.5 and 30.7 branch mispredictions per kB of XML
78data. Conversely, the cost of branch mispredictions for the Expat
79parser can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
80--- which exceeds the average latency of a byte processed by
81Parabix-XML.
82
83Unfortunately, it is difficult to reduce the branch misprediction rate
84of traditional XML parsers due to: (1) the variable length nature of
85the syntactic elements contained within XML documents; (2) input data
86dependent characteristic, and (3) the extensive set of syntax
87constraints imposed by the XML specifications.
88
89
90
91\begin{figure}[!h]
92\begin{center}
93{
94%\subfigure[Branch Instructions / kB]{
95%\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
96%\label{corei3_BR}
97%}
98%\hfill
99%\subfigure[Branch Misses / kB]{
100\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
101}
102\end{center}
103\caption{Branch Mispredictions on the \CITHREE{}. (/ 1kB input)}
104\label{corei3_BM}
105
106%}
107\end{figure}
108
109\subsection{SIMD Instructions vs. Total Instructions}
110
111In Parabix-XML, the ratio of retired SIMD instructions to total
112instructions provides insight into the relative degree to which
113Parabix-XML achieves parallelism over the byte-at-a-time approach.
114Using the Intel Pin tool, we gathered the dynamic instruction mix for
115each XML workload and classified the instructions as either SIMD
116or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
117percentage of SIMD instructions in the Parabix-XML parser.
118The ratio of executed SIMD instructions over total instructions indicates
119the amount of available parallelism.
120The resulting instruction mix consists of 60\% to 80\% SIMD
121instructions. The markup density of the files influence the number of
122scalar instructions needed to handle the tag processing which affects
123the overall parallelism that can be extracted by Parabix.  We find
124that degradation rate is low and thus the performance
125penalty incurred by increasing the markup density is minimal.
126%Expat and Xerce do not use any SIMD instructions and were not
127%included in this portion of the study.
128
129% Parabix gains its performance by using parallel bit streams, which
130% are mostly generated and calculated by SIMD instructions.  We use Intel
131% pin, a dynamic binary instrumentation tool, to gather instruction
132% mix.  Then we adds up all the vector instructions that have been
133% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
134% show the percentage of SIMD instructions of Parabix1 and Parabix
135% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
136% 18\% to 40\% of the executed instructions consists of SIMD
137% instructions.  By using bistream addition for parallel scanning,
138% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
139% decrease as the markup density increase for both Parabix1 and
140% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
141% performance degradation caused by increasing markup density is
142% smaller.
143
144\begin{table}[htbp]
145\begin{center}
146{
147\begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
148\hline
149File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
150SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
151Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
152 \\ \hline
153\end{tabular}
154}
155\end{center}
156\caption{SIMD Instruction Percentage} 
157\label{corei3_INS_p2} 
158\end{table}
159
160
161
162\subsection{Performance and Energy Characteristics}
163
164Figure \ref{corei3_TOT} shows overall parser performance in
165terms of CPU cycles per kB. Parabix-XML  is 2.5
166to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
167on data-oriented input.  Traditional parsers can be dramatically
168slowed by dense markup but Parabix-XML is relatively unaffected.
169Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
170processing it; this requires several cycles per byte. However,
171transcoding using parallel bit streams is significantly faster and
172requires less than a single cycle per byte.
173
174
175 The energy trends shown in Figure \ref{corei3_energy} reveal an
176 interesting trend. Parabix consumes substantially less energy than
177 the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat
178 and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte
179 respectively. Parabix-XML experiences minimal increase in power
180 ($\sim5\%$) compared to the conventional parsers. While the SIMD
181 functional units are significantly wider than the scalar
182 counterparts, register width and functional unit power account only
183 for a small fraction of the overall power consumption in a processor
184 pipeline. Parabix amortizes the fetch and data access overheads over
185 multiple data parallel operations. Although Parabix requires
186 slightly more power (per instruction), the processing time of Parabix
187 is significantly lower resulting in an overall improvement in energy.
188
189\begin{figure*}[!htbp]
190\begin{center}
191\subfigure[Performance (CPU Cycles per kB)]{
192\includegraphics[width=0.45\textwidth]{plots/corei3_TOT.pdf}
193\label{corei3_TOT}
194}
195\subfigure[Energy Consumption ($\mu$J per kB)]{
196\includegraphics[width=0.45\textwidth]{plots/corei3_energy.pdf}
197\label{corei3_energy}
198}
199\caption{Performance and Energy profile of Parabix on Core i3} 
200\end{center}
201\end{figure*}
202
203
Note: See TracBrowser for help on using the repository browser.