source: docs/HPCA2012/05-corei3.tex @ 1644

Last change on this file since 1644 was 1644, checked in by ksherdy, 7 years ago

Minor edit.

File size: 8.7 KB
Line 
1\section{Efficiency of the Parabix-XML Parser}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB)
13are private per core; L3 (4MB) is shared by all the cores.
14Figure \ref{cache_misses} shows the cache misses per kilobyte
15of input data. Analytically, the cache misses for the Expat and Xerces
16parsers represent a 0.5 cycle per XML byte cost. This overhead
17does not necessarily reflect in the overall performance of these
18parsers as they experience other overheads related to branch mispredictions.
19Compared to Xerces and Expat, the data organization of Parabix-XML significantly
20reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$ 
21fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache
22utilization helps keep the SIMD units busy by minimizing memory-related stalls
23and lowers the overall energy consumption
24by reducing the need to access the higher levels of the cache hierarchy.
25Using microbenchmarks, we estimated that the L1,
26L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ
27respectively. On average, with a 1GB XML file, Expat and Xerces would consume over
280.6J and 0.9J respectively due to cache misses alone.
29%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
30%would consume 0.9J on cache misses alone.
31
32
33\begin{figure}[!h]
34\subfigure[L1 Misses]{
35\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
36\label{corei3_L1DM}
37}
38\subfigure[L2 Misses]{
39\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
40\label{corei3_L2DM}
41}
42\subfigure[L3 Misses]{
43\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
44\label{corei3_L3DM}
45}
46\caption{Cache Misses per kB of input data.}
47\label{cache_misses}
48\end{figure}
49
50\subsection{Branch Mispredictions}
51\label{section:XML-branches}
52In general, performance is limited by branch mispredictions.
53Unfortunately, it is difficult to reduce the branch misprediction rate of
54traditional XML parsers due to:
55(1) the variable length nature of the syntactic elements contained within XML documents;
56(2) a data dependent characteristic, and
57(3) the extensive set of syntax constraints imposed by the XML 1.0/1.1 specifications.
58% Branch mispredictions are known
59% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
60% \cite{CameronHerdyLin2008}.
61As shown in Figure \ref{corei3_BR},
62Xerces averages up to 13 branches per XML byte processed on high density
63markup. On modern commodity processors the cost of a single branch
64misprediction is incur over 10s of CPU cycles to restart the processor
65pipeline.
66The high miss prediction rate in conventional parsers is a significant overhead.
67In Parabix-XML, the use of SIMD operations eliminates many branches.
68Most conditional branches can be replaced with
69bitwise operations, which can process up to 128 characters worth of
70branches with one operation
71or with a series of logical predicate operations, which are cheaper
72to compute since they require only SIMD operations.
73
74As shown in Figure \ref{corei3_BR},
75Parabix-XML is nearly branch free and exhibits minimal dependence on the
76source markup density. Specifically, it experiences between 19.5 and
7730.7 branch mispredictions per kB of XML data. Conversely, the cost of
78branch mispredictions for the Expat parser can be over 7 cycles per
79XML byte (see Figure \ref{corei3_BM}) --- which exceeds
80the average latency of a byte processed by Parabix-XML.
81
82
83
84
85\begin{figure}
86\subfigure[Branch Instructions / kB]{
87\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
88\label{corei3_BR}
89}
90\hfill
91\subfigure[Branch Misses / kB]{
92\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
93\label{corei3_BM}
94}
95\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
96\end{figure}
97
98\subsection{SIMD Instructions vs. Total Instructions}
99
100In Parabix-XML, the ratio of retired SIMD instructions to total
101instructions provides insight into the relative degree to which
102Parabix-XML achieves parallelism over the byte-at-a-time approach.
103Using the Intel Pin tool, we gathered the dynamic instruction mix for
104each XML workload and classified the instructions as either SIMD
105or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
106percentage of SIMD instructions in the Parabix-XML parser.
107The ratio of executed SIMD instructions over total instructions indicates
108the amount of available parallelism.
109The resulting instruction mix consists of 60\% to 80\% SIMD
110instructions. The markup density of the files influence the number of
111scalar instructions needed to handle the tag processing which affects
112the overall parallelism that can be extracted by Parabix.  We find
113that degradation rate is low and thus the performance
114penalty incurred by increasing the markup density is minimal.
115%Expat and Xerce do not use any SIMD instructions and were not
116%included in this portion of the study.
117
118% Parabix gains its performance by using parallel bit streams, which
119% are mostly generated and calculated by SIMD instructions.  We use Intel
120% pin, a dynamic binary instrumentation tool, to gather instruction
121% mix.  Then we adds up all the vector instructions that have been
122% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
123% show the percentage of SIMD instructions of Parabix1 and Parabix
124% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
125% 18\% to 40\% of the executed instructions consists of SIMD
126% instructions.  By using bistream addition for parallel scanning,
127% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
128% decrease as the markup density increase for both Parabix1 and
129% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
130% performance degradation caused by increasing markup density is
131% smaller.
132
133\subsection{CPU Cycles}
134
135Figure \ref{corei3_TOT} shows overall parser performance in
136terms of CPU cycles per kB. Parabix-XML  is 2.5
137to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
138on data-oriented input.  Traditional parsers can be dramatically
139slowed by dense markup but Parabix-XML is relatively unaffected.
140Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
141processing it; this requires several cycles per byte. However,
142transcoding using parallel bit streams is significantly faster and
143requires less than a single cycle per byte.
144
145\begin{figure}[htbp]
146\begin{minipage}{0.5\linewidth}
147\centering
148\includegraphics[width=\textwidth]{plots/corei3_INS_p2.pdf}
149\caption{SIMD Instruction Percentage}
150\label{corei3_INS_p2}
151\end{minipage}%
152\hfill
153\begin{minipage}{0.5\linewidth}
154\centering
155\includegraphics[width=\textwidth]{plots/corei3_TOT.pdf}
156\caption{Performance (CPU Cycles per kB)}
157\label{corei3_TOT}
158\end{minipage}
159\end{figure}
160
161
162
163\subsection{Power and Energy}
164In this section, we study the power and energy consumption of Parabix-XML
165in comparison with Expat and Xerces on \CITHREE{}.
166Figure \ref{corei3_power} shows the
167average power consumed by each parser. Parabix-XML, dominated by SIMD
168instructions, uses $\sim5\%$ additional power. While the
169SIMD functional units are significantly wider than the scalar
170counterparts, register width and functional unit power account only
171for a small fraction of the overall power consumption in a processor
172pipeline. More importantly by using data parallel operations Parabix
173amortizes the fetch and data access overheads. This results in minimal
174power increase compared to the conventional parsers.  Perhaps the
175energy trends shown in Figure \ref{corei3_energy} reveal an
176interesting trend. Parabix consumes substantially less energy than the
177other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
178Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
179Although Parabix requires slightly more power (per instruction), the
180processing time of Parabix is significantly lower.
181
182
183
184
185
186
187\begin{figure}
188\subfigure[Avg. Power (Watts)]{
189\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
190\label{corei3_power}
191}
192\hfill
193\subfigure[Energy Consumption ($\mu$J per kB)]{
194\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
195\label{corei3_energy}
196}
197\caption{Power profile of Parabix on \CITHREE{}}
198\end{figure}
199
200
Note: See TracBrowser for help on using the repository browser.