source: docs/HPCA2012/final_ieee/05-corei3.tex @ 1768

Last change on this file since 1768 was 1768, checked in by ashriram, 8 years ago

Version sent to Martha

File size: 8.5 KB
Line 
1\section{Efficiency of the Parabix-XML Parser}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11
12
13
14Table \ref{cache_misses} shows the cache misses per kilobyte of input
15data. Analytically, the cache misses for the Expat and Xerces parsers
16represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
174, 11, and 36 cycles respectively.} 
18
19
20
21\begin{table}[!htbp]
22\begin{center}
23\begin{tabular}{|c|c|c|c|}
24\hline
25        & Parabix       & Expat         & Xerces  \\ \hline
26L1      & 4.1           & 31.7          & 104.2   \\ \hline
27L2      & 0.1           & 12.0          & 1.7     \\ \hline
28L3      & 0.03          & 3.9           & 0.3     \\ \hline
29\end{tabular}
30\end{center}
31\caption{Cache Misses per kB of input data} 
32\label{cache_misses}
33\end{table}
34
35
36
37This overhead does not
38necessarily impact the overall performance of these parsers as they
39experience additional overheads related to branch mispredictions.
40Compared to Xerces and Expat, the data organization of Parabix-XML
41significantly reduces the overall cache miss rate; specifically, there
42were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to
43the next best parser tested. The improved cache utilization helps keep
44the SIMD units busy by minimizing memory-related stalls and lowers the
45overall energy consumption by reducing the need to access the higher
46levels of the cache hierarchy.  Using microbenchmarks, we estimated
47that the L1, L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ,
48and $\sim$40nJ respectively. On average, with a 1GB XML file, Expat
49and Xerces would consume over 0.6J and 0.9J respectively due to cache
50misses alone.
51%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
52%would consume 0.9J on cache misses alone.
53
54
55
56\subsection{Branch Mispredictions}
57\label{section:XML-branches}
58% Branch mispredictions are known
59% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
60% \cite{CameronHerdyLin2008}.
61
62The performance of traditional parsers is limited by their branch
63behavior.  Xerces experiences up to 13 branches per input XML
64character on the high markup files; Expat experiences up to 8 branches
65per XML character.  In Parabix-XML, the use of SIMD operations
66eliminates many branches.  Most conditional branches can be replaced
67with bitwise operations, which can process up to 128 characters worth
68of branches with one operation or with a series of logical predicate
69operations, which are cheaper to compute since they require only SIMD
70operations.
71
72
73The high miss prediction rate in conventional parsers is a significant
74overhead. The cost of a single branch misprediction is on the order of
7510s of CPU cycles spent to restart the processor pipeline on a
76misprediction. Parabix-XML is nearly branch free and exhibits minimal
77dependence on the source markup density. Specifically, it experiences
78between 19.5 and 30.7 branch mispredictions per kB of XML
79data. Conversely, the cost of branch mispredictions for the Expat
80parser can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
81--- which exceeds the average latency of a byte processed by
82Parabix-XML.
83
84Unfortunately, it is difficult to reduce the branch misprediction rate
85of traditional XML parsers due to: (1) the variable length nature of
86the syntactic elements contained within XML documents; (2) input data
87dependent characteristic, and (3) the extensive set of syntax
88constraints imposed by the XML specifications.
89
90
91
92\begin{figure}[!h]
93\begin{center}
94{
95%\subfigure[Branch Instructions / kB]{
96%\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
97%\label{corei3_BR}
98%}
99%\hfill
100%\subfigure[Branch Misses / kB]{
101\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
102}
103\end{center}
104\caption{Branch Mispredictions on the \CITHREE{}. (/ 1kB input)}
105\label{corei3_BM}
106
107%}
108\end{figure}
109
110\subsection{SIMD Instructions vs. Total Instructions}
111
112In Parabix-XML, the ratio of retired SIMD instructions to total
113instructions provides insight into the relative degree to which
114Parabix-XML achieves parallelism over the byte-at-a-time approach.
115Using the Intel Pin tool, we gathered the dynamic instruction mix for
116each XML workload and classified the instructions as either SIMD
117or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
118percentage of SIMD instructions in the Parabix-XML parser.
119The ratio of executed SIMD instructions over total instructions indicates
120the amount of available parallelism.
121The resulting instruction mix consists of 60\% to 80\% SIMD
122instructions. The markup density of the files influence the number of
123scalar instructions needed to handle the tag processing which affects
124the overall parallelism that can be extracted by Parabix.  We find
125that degradation rate is low and thus the performance
126penalty incurred by increasing the markup density is minimal.
127%Expat and Xerce do not use any SIMD instructions and were not
128%included in this portion of the study.
129
130% Parabix gains its performance by using parallel bit streams, which
131% are mostly generated and calculated by SIMD instructions.  We use Intel
132% pin, a dynamic binary instrumentation tool, to gather instruction
133% mix.  Then we adds up all the vector instructions that have been
134% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
135% show the percentage of SIMD instructions of Parabix1 and Parabix
136% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
137% 18\% to 40\% of the executed instructions consists of SIMD
138% instructions.  By using bistream addition for parallel scanning,
139% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
140% decrease as the markup density increase for both Parabix1 and
141% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
142% performance degradation caused by increasing markup density is
143% smaller.
144
145\begin{table}[htbp]
146\begin{center}
147{
148\begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
149\hline
150File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
151SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
152Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
153 \\ \hline
154\end{tabular}
155}
156\end{center}
157\caption{SIMD Instruction Percentage} 
158\label{corei3_INS_p2} 
159\end{table}
160
161
162
163\subsection{Performance and Energy Characteristics}
164
165Figure \ref{corei3_TOT} shows overall parser performance in
166terms of CPU cycles per kB. Parabix-XML  is 2.5
167to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
168on data-oriented input.  Traditional parsers can be dramatically
169slowed by dense markup but Parabix-XML is relatively unaffected.
170Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
171processing it; this requires several cycles per byte. However,
172transcoding using parallel bit streams is significantly faster and
173requires less than a single cycle per byte.
174
175
176 The energy trends shown in Figure \ref{corei3_energy} reveal an
177 interesting trend. Parabix consumes substantially less energy than
178 the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat
179 and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte
180 respectively. Parabix-XML experiences minimal increase in power
181 ($\sim5\%$) compared to the conventional parsers. While the SIMD
182 functional units are significantly wider than the scalar
183 counterparts, register width and functional unit power account only
184 for a small fraction of the overall power consumption in a processor
185 pipeline. Parabix amortizes the fetch and data access overheads over
186 multiple data parallel operations. Although Parabix requires
187 slightly more power (per instruction), the processing time of Parabix
188 is significantly lower resulting in an overall improvement in energy.
189
190\begin{figure*}[!htbp]
191\begin{center}
192\subfigure[Performance (CPU Cycles per kB)]{
193\includegraphics[width=0.45\textwidth]{plots/corei3_TOT.pdf}
194\label{corei3_TOT}
195}
196\subfigure[Energy Consumption ($\mu$J per kB)]{
197\includegraphics[width=0.45\textwidth]{plots/corei3_energy.pdf}
198\label{corei3_energy}
199}
200\caption{Performance and Energy profile of Parabix on Core i3} 
201\end{center}
202\end{figure*}
203
204
Note: See TracBrowser for help on using the repository browser.