source: docs/HPCA2012/05-corei3.tex @ 1415

Last change on this file since 1415 was 1415, checked in by ashriram, 8 years ago

buf fixes up to 7

File size: 8.6 KB
Line 
1\section{Efficiency of the Parabix-XML Parser}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB)
13are private per core; L3 (4MB) is shared by all the cores.
14Figure \ref{cache_misses} shows the cache misses per kilobyte
15of input data. Analytically, the cache misses for the Expat and Xerces
16parsers represent a 0.5 cycle per XML byte processed. This overhead
17does not necessarily reflect in the overall performance of these
18parsers as they experience other overheads related to branch mispredictions.
19Compared to Xerces and Expat, the data organization of Parabix-XML significantly
20reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$ 
21fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache
22utilization helps keep the SIMD units busy by minimizing memory-related stalls
23and lowers the overall energy consumption
24by reducing the need to access the higher levels of the cache hierarchy.
25Using microbenchmarks, we estimated that the L1,
26L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ
27respectively. On average, with a 1GB XML file, Expat and Xerces would consume over
280.6J and 0.9J respectively due to cache misses alone.
29%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
30%would consume 0.9J on cache misses alone.
31
32
33\begin{figure}[!h]
34\subfigure[L1 Misses]{
35\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
36\label{corei3_L1DM}
37}
38\subfigure[L2 Misses]{
39\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
40\label{corei3_L2DM}
41}
42\subfigure[L3 Misses]{
43\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
44\label{corei3_L3DM}
45}
46\caption{Cache Misses per kB of input data.}
47\label{cache_misses}
48\end{figure}
49
50\subsection{Branch Mispredictions}
51\label{section:XML-branches}
52It is hard to handle branch mispredictions in traditional
53XML parsers due to: (1) the variable length nature of the syntactic
54elements contained within XML documents; (2) a data dependent
55characteristic, and (3) the extensive set of syntax constraints
56imposed by the XML 1.0/1.1 specifications.
57% Branch mispredictions are known
58% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
59% \cite{CameronHerdyLin2008}.
60As shown in Figure \ref{corei3_BR},
61Xerces averages up to 13 branches per XML byte processed on high density
62markup. On modern commodity processors the cost of a single branch
63misprediction is incur over 10s of CPU cycles to restart the processor
64pipeline.
65The high miss prediction rate in conventional parsers is a significant overhead.
66In Parabix-XML, the use of SIMD operations eliminates many branches.
67Most conditional branches can be replaced with
68bitwise operations, which can process up to 128 characters worth of
69branches with one operation
70or with a series of logical predicate operations, which are cheaper
71to compute since they require only SIMD operations.
72
73As shown in Figure \ref{corei3_BR},
74Parabix-XML is nearly branch free and exhibits minimal dependence on the
75source markup density. Specifically, it experiences between 19.5 and
7630.7 branch mispredictions per kB of XML data. Conversely, the cost of
77branch mispredictions for the Expat parser can be over 7 cycles per
78XML byte (see Figure \ref{corei3_BM}) --- which exceeds
79the average latency of a byte processed by Parabix-XML.
80
81
82
83
84\begin{figure}
85\subfigure[Branch Instructions / kB]{
86\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
87\label{corei3_BR}
88}
89\hfill
90\subfigure[Branch Misses / kB]{
91\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
92\label{corei3_BM}
93}
94\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
95\end{figure}
96
97\subsection{SIMD Instructions vs. Total Instructions}
98
99In Parabix-XML, the ratio of retired SIMD instructions to total
100instructions provides insight into the relative degree to which
101Parabix-XML achieves parallelism over the byte-at-a-time approach.
102Using the Intel Pin tool, we gathered the dynamic instruction mix for
103each XML workload and classified the instructions as either SIMD
104or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
105percentage of SIMD instructions in the Parabix-XML parser.
106The ratio of executed SIMD instructions over total instructions indicates
107the amount of available parallelism.
108The resulting instruction mix consists of 60\% to 80\% SIMD
109instructions. The markup density of the files influence the number of
110scalar instructions needed to handle the tag processing which affects
111the overall parallelism that can be extracted by Parabix.  We find
112that degradation rate is low and thus the performance
113penalty incurred by increasing the markup density is minimal.
114%Expat and Xerce do not use any SIMD instructions and were not
115%included in this portion of the study.
116
117% Parabix gains its performance by using parallel bit streams, which
118% are mostly generated and calculated by SIMD instructions.  We use Intel
119% pin, a dynamic binary instrumentation tool, to gather instruction
120% mix.  Then we adds up all the vector instructions that have been
121% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
122% show the percentage of SIMD instructions of Parabix1 and Parabix
123% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
124% 18\% to 40\% of the executed instructions consists of SIMD
125% instructions.  By using bistream addition for parallel scanning,
126% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
127% decrease as the markup density increase for both Parabix1 and
128% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
129% performance degradation caused by increasing markup density is
130% smaller.
131
132\subsection{CPU Cycles}
133
134Figure \ref{corei3_TOT} shows overall parser performance in
135terms of CPU cycles per kB. Parabix-XML  is 2.5
136to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
137on data-oriented input.  Traditional parsers can be dramatically
138slowed by dense markup but Parabix-XML is relatively unaffected.
139Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
140processing it; this requires several cycles per byte. However,
141transcoding using parallel bit streams is significantly faster and
142requires less than a single cycle per byte.
143
144\begin{figure}[htbp]
145\begin{minipage}{0.5\linewidth}
146\centering
147\includegraphics[width=\textwidth]{plots/corei3_INS_p2.pdf}
148\caption{SIMD Instruction Percentage}
149\label{corei3_INS_p2}
150\end{minipage}%
151\hfill
152\begin{minipage}{0.5\linewidth}
153\centering
154\includegraphics[width=\textwidth]{plots/corei3_TOT.pdf}
155\caption{Performance (CPU Cycles per kB)}
156\label{corei3_TOT}
157\end{minipage}
158\end{figure}
159
160
161
162\subsection{Power and Energy}
163In this section, we study the power and energy consumption of Parabix-XML
164in comparison with Expat and Xerces on \CITHREE{}.
165Figure \ref{corei3_power} shows the
166average power consumed by each parser. Parabix-XML, dominated by SIMD
167instructions, uses $\sim5\%$ additional power. While the
168SIMD functional units are significantly wider than the scalar
169counterparts, register width and functional unit power account only
170for a small fraction of the overall power consumption in a processor
171pipeline. More importantly by using data parallel operations Parabix
172amortizes the fetch and data access overheads. This results in minimal
173power increase compared to the conventional parsers.  Perhaps the
174energy trends shown in Figure \ref{corei3_energy} reveal an
175interesting trend. Parabix consumes substantially less energy than the
176other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
177Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
178Although Parabix requires slightly more power (per instruction), the
179processing time of Parabix is significantly lower.
180
181
182
183
184
185
186\begin{figure}
187\subfigure[Avg. Power (Watts)]{
188\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
189\label{corei3_power}
190}
191\hfill
192\subfigure[Energy Consumption ($\mu$J per kB)]{
193\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
194\label{corei3_energy}
195}
196\caption{Power profile of Parabix on \CITHREE{}}
197\end{figure}
198
199
Note: See TracBrowser for help on using the repository browser.