source: docs/HPCA2012/final_ieee/05-corei3.tex @ 1775

Last change on this file since 1775 was 1775, checked in by cameron, 8 years ago

Minor fixes; figure placement

File size: 8.6 KB
Line 
1\section{Efficiency of the Parabix-XML Parser}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11
12
13
14Table \ref{cache_misses} shows cache misses per kilobyte of input
15data. Analytically, the cache misses for the Expat and Xerces parsers
16represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
174, 11, and 36 cycles respectively.} 
18
19
20
21\begin{table}[!htbp]
22\begin{center}
23\begin{tabular}{|c|c|c|c|}
24\hline
25        & Parabix       & Expat         & Xerces  \\ \hline
26L1      & 4.1           & 31.7          & 104.2   \\ \hline
27L2      & 0.1           & 12.0          & 1.7     \\ \hline
28L3      & 0.03          & 3.9           & 0.3     \\ \hline
29\end{tabular}
30\end{center}
31\caption{Cache Misses per kB of input data} 
32\label{cache_misses}
33\end{table}
34
35
36
37This overhead has little impact on the overall performance of these parsers
38as they experience additional overheads related to branch mispredictions.
39Wne compared with Xerces and Expat, the data organization of Parabix-XML
40significantly reduces the overall cache miss rate; specifically, there
41were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to
42the next best parser tested. The improved cache utilization helps keep
43the SIMD units busy by minimizing memory-related stalls and lowers the
44overall energy consumption by reducing the need to access the higher
45levels of the cache hierarchy.  Using microbenchmarks, we estimated
46that the L1, L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ,
47and $\sim$40nJ respectively. On average, with a 1GB XML file, Expat
48and Xerces would consume over 0.6J and 0.9J respectively due to cache
49misses alone.
50%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
51%would consume 0.9J on cache misses alone.
52
53
54
55\subsection{Branch Mispredictions}
56\label{section:XML-branches}
57% Branch mispredictions are known
58% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
59% \cite{CameronHerdyLin2008}.
60
61The performance of traditional parsers is limited by their branch
62behavior.  Xerces experiences up to 13 branches per input XML
63character on the high markup files whereas Expat experiences up to 8.
64In Parabix-XML, the use of SIMD operations eliminates a significant proportion
65of the overall branches.  Most conditional branches can be replaced
66with bitwise operations, which can process up to 128 characters worth
67of branches with one operation or with a series of logical predicate
68operations, which are cheaper to compute since they require only SIMD
69operations.
70
71
72The high branch misprediction rate of conventional parsers is a
73significant overhead, with the cost of a single branch mispredic-
74tion on the order of 10s of CPU cycles spent to restart the processor
75pipeline. Parabix-XML is nearly branch free and exhibits minimal
76dependence on the source markup density. Specifically, our study
77demonstrates that Parabix experiences between 19.5 and
7830.7 branch mispredictions per kB of XML data. Conversely,
79the cost of branch mispredictions for the Expat parser
80can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
81--- which exceeds the average latency of a byte processed by
82Parabix-XML.
83
84The branch misprediction rate of traditional XML parsers is difficult to reduce due to
85a number of factors: (1) the variable length nature of
86the syntactic elements contained within XML documents; (2) input data
87dependent characteristic, and (3) the extensive set of syntax
88constraints imposed by the XML specifications.
89
90
91
92\begin{figure}[!h]
93\begin{center}
94{
95%\subfigure[Branch Instructions / kB]{
96%\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
97%\label{corei3_BR}
98%}
99%\hfill
100%\subfigure[Branch Misses / kB]{
101\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
102}
103\end{center}
104\caption{Branch Mispredictions on the \CITHREE{} per kB input}
105\label{corei3_BM}
106
107%}
108\end{figure}
109
110\subsection{SIMD Instructions vs. Total Instructions}
111
112In Parabix-XML, the ratio of retired SIMD instructions to total
113instructions provides insight into the relative degree to which
114Parabix-XML achieves parallelism over the byte-at-a-time approach.
115Using the Intel Pin tool, we gathered the dynamic instruction mix for
116each XML workload and classified the instructions as either SIMD
117or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
118percentage of SIMD instructions in the Parabix-XML parser.
119The ratio of executed SIMD instructions over total instructions indicates
120the amount of available parallelism.
121The resulting instruction mix consists of 60\% to 80\% SIMD
122instructions. The markup density of the files influence the number of
123scalar instructions needed to handle the tag processing which affects
124the overall parallelism that can be extracted by Parabix.  We find
125that degradation rate is low and thus the performance
126penalty incurred by increasing the markup density is minimal.
127%Expat and Xerce do not use any SIMD instructions and were not
128%included in this portion of the study.
129
130% Parabix gains its performance by using parallel bit streams, which
131% are mostly generated and calculated by SIMD instructions.  We use Intel
132% pin, a dynamic binary instrumentation tool, to gather instruction
133% mix.  Then we adds up all the vector instructions that have been
134% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
135% show the percentage of SIMD instructions of Parabix1 and Parabix
136% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
137% 18\% to 40\% of the executed instructions consists of SIMD
138% instructions.  By using bistream addition for parallel scanning,
139% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
140% decrease as the markup density increase for both Parabix1 and
141% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
142% performance degradation caused by increasing markup density is
143% smaller.
144
145\begin{table}[htbp]
146\begin{center}
147{
148\begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
149\hline
150File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
151SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
152Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
153 \\ \hline
154\end{tabular}
155}
156\end{center}
157\caption{SIMD Instruction Percentage} 
158\label{corei3_INS_p2} 
159\end{table}
160
161
162
163\subsection{Performance and Energy Characteristics}
164
165\begin{figure*}[!htbp]
166\begin{center}
167\subfigure[Performance (CPU Cycles per kB)]{
168\includegraphics[width=0.45\textwidth]{plots/corei3_TOT.pdf}
169\label{corei3_TOT}
170}
171\subfigure[Energy Consumption ($\mu$J per kB)]{
172\includegraphics[width=0.45\textwidth]{plots/corei3_energy.pdf}
173\label{corei3_energy}
174}
175\caption{Performance and Energy profile of Parabix on Core i3} 
176\end{center}
177\end{figure*}
178
179Figure \ref{corei3_TOT} shows overall parser performance in
180terms of CPU cycles per kB. Parabix-XML  is 2.5
181to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
182on data-oriented input.  Traditional parsers can be dramatically
183slowed by dense markup but Parabix-XML is relatively unaffected.
184Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 prior to
185processing; this requires several cycles per byte. However,
186transcoding using parallel bit streams is significantly faster and
187requires less than a single cycle per byte.
188
189
190 The energy trends shown in Figure \ref{corei3_energy} reveal an
191 interesting result. Parabix consumes substantially less energy than
192 the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat
193 and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte
194 respectively. Parabix-XML experiences minimal increase in power consumption
195 ($\sim5\%$) as compared to the conventional parsers. While the SIMD
196 functional units are significantly wider than the scalar
197 counterparts, register width and functional unit power account only
198 for a small fraction of the overall power consumption in a pipeline
199 processor. Parabix amortizes the fetch and data access overheads over
200 multiple data parallel operations. Although Parabix requires
201 slightly more power (per instruction), the processing time of Parabix
202 is significantly lower resulting in an overall improvement in energy.
203
204
205
Note: See TracBrowser for help on using the repository browser.