source: docs/HPCA2012/final_ieee/05-corei3.tex @ 3889

Last change on this file since 3889 was 1783, checked in by ashriram, 7 years ago

Final pass

File size: 8.7 KB
Line 
1\section{Efficiency of Parabix-XML}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix XML parser against the software XML parsers,
5Xerces and Expat, each applied to the problem
6of XML well-formedness checking. For our baseline evaluation, we compare all the XML
7parsers on the \CITHREE{}.
8
9
10%some of the numbers are roughly calculated, needs to be recalculated for final version
11\subsection{Cache behavior}
12
13
14
15Table \ref{cache_misses} shows cache misses per kilobyte of input
16data. Analytically, the cache misses for the Expat and Xerces parsers
17represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
184, 11, and 36 cycles respectively.} 
19
20
21
22\begin{table}[!htbp]
23\begin{center}
24\begin{tabular}{|c|c|c|c|}
25\hline
26        & Parabix       & Expat         & Xerces  \\ \hline
27L1      & 4.1           & 31.7          & 104.2   \\ \hline
28L2      & 0.1           & 12.0          & 1.7     \\ \hline
29L3      & 0.03          & 3.9           & 0.3     \\ \hline
30\end{tabular}
31\end{center}
32\caption{Cache Misses per kB of input data} 
33\label{cache_misses}
34\end{table}
35
36
37
38This overhead has little impact on the overall performance of these parsers
39as they experience additional overheads related to branch mispredictions.
40When compared with Xerces and Expat, the data organization of Parabix-XML
41significantly reduces the overall cache miss rate; specifically, there
42were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to
43the next best parser tested. The improved cache utilization helps keep
44the SIMD units busy by minimizing memory-related stalls and lowers the
45overall energy consumption by reducing the need to access the higher
46levels of the cache hierarchy.  Using microbenchmarks, we estimated
47that the L1, L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ,
48and $\sim$40nJ respectively. On average, with a 1GB XML file, Expat
49and Xerces would consume over 0.6J and 0.9J respectively due to cache
50misses alone.
51%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
52%would consume 0.9J on cache misses alone.
53
54
55
56\subsection{Branch Mispredictions}
57\label{section:XML-branches}
58% Branch mispredictions are known
59% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
60% \cite{CameronHerdyLin2008}.
61
62The performance of traditional parsers is limited by their branch
63behavior.  Xerces experiences up to 13 branches per input XML
64character on the high markup files whereas Expat experiences up to 8.
65In Parabix-XML, the use of SIMD operations eliminates a significant proportion
66of the overall branches.  Most conditional branches can be replaced
67with bitwise operations, which can process up to 128 characters worth
68of branches with one operation or with a series of logical predicate
69operations, which are cheaper to compute since they require only SIMD
70operations.
71
72
73The high branch misprediction rate of conventional parsers is a
74significant overhead, with the cost of a single branch misprediction
75on the order of 10s of CPU cycles spent to restart the processor
76pipeline. Parabix-XML is nearly branch free and exhibits minimal
77dependence on the source markup density. Specifically, our study
78demonstrates that Parabix experiences between 19.5 and
7930.7 branch mispredictions per kB of XML data. Conversely,
80the cost of branch mispredictions for the Expat parser
81can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
82--- which exceeds the average latency of a byte processed by
83Parabix-XML.
84
85The branch misprediction rate of traditional XML parsers is difficult to reduce due to
86a number of factors: (1) the variable length nature of
87the syntactic elements contained within XML documents; (2) input data
88dependent characteristic, and (3) the extensive set of syntax
89constraints imposed by the XML specifications.
90
91
92
93\begin{figure}[!h]
94\begin{center}
95{
96%\subfigure[Branch Instructions / kB]{
97%\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
98%\label{corei3_BR}
99%}
100%\hfill
101%\subfigure[Branch Misses / kB]{
102\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
103}
104\end{center}
105\caption{Branch Mispredictions on the \CITHREE{} per kB input}
106\label{corei3_BM}
107
108%}
109\end{figure}
110
111\subsection{SIMD Instructions vs. Total Instructions}
112
113In Parabix-XML, the ratio of retired SIMD instructions to total
114instructions provides insight into the relative degree to which
115Parabix-XML achieves parallelism over the byte-at-a-time approach.
116Using the Intel Pin tool, we gathered the dynamic instruction mix for
117each XML workload and classified the instructions as either SIMD
118or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
119percentage of SIMD instructions in the Parabix-XML parser.
120The ratio of executed SIMD instructions over total instructions indicates
121the amount of available parallelism.
122The resulting instruction mix consists of 60\% to 80\% SIMD
123instructions. The markup density of the files influence the number of
124scalar instructions needed to handle the tag processing which affects
125the overall parallelism that can be extracted by Parabix.  We find
126that degradation rate is low and thus the performance
127penalty incurred by increasing the markup density is minimal.
128%Expat and Xerce do not use any SIMD instructions and were not
129%included in this portion of the study.
130
131% Parabix gains its performance by using parallel bit streams, which
132% are mostly generated and calculated by SIMD instructions.  We use Intel
133% pin, a dynamic binary instrumentation tool, to gather instruction
134% mix.  Then we adds up all the vector instructions that have been
135% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
136% show the percentage of SIMD instructions of Parabix1 and Parabix
137% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
138% 18\% to 40\% of the executed instructions consists of SIMD
139% instructions.  By using bistream addition for parallel scanning,
140% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
141% decrease as the markup density increase for both Parabix1 and
142% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
143% performance degradation caused by increasing markup density is
144% smaller.
145
146\begin{table}[htbp]
147\begin{center}
148{
149\begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
150\hline
151File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
152SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
153Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
154 \\ \hline
155\end{tabular}
156}
157\end{center}
158\caption{SIMD Instruction Percentage} 
159\label{corei3_INS_p2} 
160\end{table}
161
162
163
164\subsection{Performance and Energy Characteristics}
165
166\begin{figure*}[!htbp]
167\begin{center}
168\subfigure[Performance (CPU Cycles per kB)]{
169\includegraphics[width=0.45\textwidth]{plots/corei3_TOT.pdf}
170\label{corei3_TOT}
171}
172\subfigure[Energy Consumption ($\mu$J per kB)]{
173\includegraphics[width=0.45\textwidth]{plots/corei3_energy.pdf}
174\label{corei3_energy}
175}
176\caption{Performance and Energy profile of Parabix on Core i3} 
177\end{center}
178\end{figure*}
179
180Figure \ref{corei3_TOT} shows overall parser performance in
181terms of CPU cycles per kB. Parabix-XML  is 2.5
182to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
183on data-oriented input.  Traditional parsers can be dramatically
184slowed by dense markup but Parabix-XML is relatively unaffected.
185Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 prior to
186processing; this requires several cycles per byte. However,
187transcoding using parallel bit streams is significantly faster and
188requires less than a single cycle per byte.
189
190
191Parabix consumes substantially less energy (see
192Figure \ref{corei3_energy} ) than
193the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat
194and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte
195respectively. Parabix-XML experiences minimal increase in power
196consumption ($\sim5\%$) as compared to the conventional parsers. While
197the SIMD functional units are significantly wider than the scalar
198counterparts, register width and functional unit power account only
199for a small fraction of the overall power consumption in a pipeline
200processor. Parabix amortizes the fetch and data access overheads over
201multiple data parallel operations. Although Parabix requires slightly
202more power (per instruction), the processing time of Parabix is
203significantly lower resulting in an overall improvement in energy.
204The Parabix parser makes efficient use of the processor pipeline which
205minimizes overall energy wastage.
206
207
208
209
Note: See TracBrowser for help on using the repository browser.