# source:docs/HPCA2012/05-corei3.tex@1397

Last change on this file since 1397 was 1390, checked in by lindanl, 8 years ago

fix spelling mistakes

File size: 9.1 KB
Line
1\section{Efficiency of the Parabix}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on a fixed platform, the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behavior}
11The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB)
13are private per core, while the 4MB L3 is shared by all the
14cores. Figure \ref{cache_misses} shows the cache misses per kilobyte
15of input data. Analytically, the cache misses for the Expat and Xerces
16parsers represent a 0.5 cycle per XML byte processed. This overhead
17does not necessarily reflect in the overall performance of these
18parsers as they experience other overheads related to branch
19mispredictions. Parabix's data reorganization significantly improves
20the overall cache miss rate. We experience 7$\times$ less misses than
21Expat and 25$\times$ less misses than Xerces at the L1 and 104$\times$ less misses than
22Expat and 15$\times$ less misses than Xerces at the L2 level. The improved cache
23utilization keeps the SIMD units busy and prevent memory related
24stalls. Note that cache misses also cause increased application energy
25consumption due to increased energy required to access higher levels
26in the cache hierarchy. We estimated with microbenchmarks that the L1,
27L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ
28respectively. For a 1GB XML file Expat and Xerces would consume over
290.6J and 0.9J respectively due to cache misses alone.
30%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
31%would consume 0.9J on cache misses alone.
32
33
34\begin{figure}
35\subfigure[L1 Misses]{
36\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
37\label{corei3_L1DM}
38}
39\subfigure[L2 Misses]{
40\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
41\label{corei3_L2DM}
42}
43\subfigure[L3 Misses]{
44\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
45\label{corei3_L3DM}
46}
47\caption{Cache Misses per kB of input data.}
48\label{cache_misses}
49\end{figure}
50
51\subsection{Branch Mispredictions}
52\label{section:XML-branches}
53In general, reducing the branch misprediction rate is difficult in
54text-based XML parsing applications. This is due to (1) variable
55length nature of the syntactic elements contained within XML
56documents, (2) a data dependent characteristic, and (3) the extensive
57set of syntax constraints imposed by the XML. Traditional
58byte-at-a-time XML parser's performance is limited by the number of
59branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces
60averages up to 13 branches per XML byte processed on high density
61markup. On modern commodity processors the cost of a single branch
62misprediction is incur over 10s of CPU cycles to restart the processor
63pipeline. The high miss prediction rate in conventional parsers add
64significant overhead. In Parabix the transformation to SIMD operation
65eliminates many branches. Further optimizations take advantage of
66Parabix's data organization and replace condition branches with {\em
67  bit scan} operations that can process up to 128 characters worth of
68branches with one operation. In many cases, we also replace the
69branches with logical predicate operations. Our predicates are cheaper
70to compute since they involve only bit parallel SIMD operations.
71
72 As shown in Figure \ref{corei3_BR},
73Parabix processing is almost branch free. Parabix exhibits minimal
74dependence on source XML markup density; it experiences between 19.5 and
7530.7 branch mispredictions per thousand of XML byte. The cost of
76branch mispredictions for the Expat parser can be over 7 cycles per
77XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher
78than the average latency of a byte processed by Parabix.
79
80
81
82
83\begin{figure}
84\subfigure[Branch Instructions]{
85\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
86\label{corei3_BR}
87}
88\hfill
89\subfigure[Branch Misses]{
90\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
91\label{corei3_BM}
92}
93\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
94\end{figure}
95
96\subsection{SIMD Instructions vs. Total Instructions}
97
98In Parabix, bit streams are both computed and
99predominately operated upon using the SIMD instructions of commodity
100processors.  The ratio of retired SIMD instructions to total
101instructions provides insight into the relative degree to which
102Parabix achieves parallelism over the byte-at-a-time approach.
103
104
105Using the Intel Pin tool, we gather the dynamic instruction mix for
106each XML workload, and classify instructions as either vector (SIMD)
107or non-vector instructions.  Figure~\ref{corei3_INS_p2} shows the
108percentage of SIMD instructions for the Parabix XML parser. The ratio of executed
109SIMD instructions over total instructions indicates the amount of
110parallel processing we were able to extract.
111%(Expat and Xerce do not use any SIMD instructions)
112The Parabix instruction mix is made up of 60\% to 80\% SIMD
113instructions.  The markup density of the files influence the number of
114scalar instructions needed to handle the tag processing which affects
115the overall parallelism that can be extracted by Parabix.  We find
116that degradation rate is low and thus the performance
117penalty incurred by increasing the markup density is minimal.
118%Expat and Xerce do not use any SIMD instructions and were not
119%included in this portion of the study.
120
121% Parabix gains its performance by using parallel bitstreams, which
122% are mostly generated and calculated by SIMD instructions.  We use Intel
123% pin, a dynamic binary instrumentation tool, to gather instruction
124% mix.  Then we adds up all the vector instructions that have been
125% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
126% show the percentage of SIMD instructions of Parabix1 and Parabix
127% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
128% 18\% to 40\% of the executed instructions consists of SIMD
129% instructions.  By using bistream addition for parallel scanning,
130% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
131% decrease as the markup density increase for both Parabix1 and
132% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
133% performance degradation caused by increasing markup density is
134% smaller.
135
136\subsection{CPU Cycles}
137
138Figure \ref{corei3_TOT} shows overall parser performance evaluated in
139terms of CPU cycles per kilobyte.  The Parabix parser  is 2.5$\times$
140to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster
141on data-oriented input.  Traditional parsers can be dramatically
142slowed by dense markup, while Parabix is affected much less.  The
143results presented are not entirely fair to the Xerces parser since it
144first transcodes input from UTF-8 to UTF-16 before processing. In
145Xerces, this transcoding requires several cycles per byte.  However,
146transcoding using parallel bit streams is significantly faster and
147requires less than a single cycle per byte.
148
149\begin{figure}[htbp]
150\begin{minipage}{0.5\linewidth}
151\centering
152\includegraphics[width=\textwidth]{plots/corei3_INS_p2.pdf}
153\caption{SIMD Instruction Percentage}
154\label{corei3_INS_p2}
155\end{minipage}%
156\hfill
157\begin{minipage}{0.5\linewidth}
158\centering
159\includegraphics[width=\textwidth]{plots/corei3_TOT.pdf}
160\caption{Performance (CPU Cycles per kB)}
161\label{corei3_TOT}
162\end{minipage}
163\end{figure}
164
165
166
167\subsection{Power and Energy}
168In this section, we study the power and energy consumption of Parabix
169in comparison with Expat and Xerces on \CITHREE{}. The average power
170of \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
171average power consumed by each parser.  Parabix, dominated by SIMD
172instructions which uses approximately 5\% additional power. While the
173SIMD functional units are significantly wider than the scalar
174counterparts; register width and functional unit power account only
175for a small fraction of the overall power consumption in a processor
176pipeline. More importantly by using data parallel operations Parabix
177amortizes the fetch and data access overheads. This results in minimal
178power increase compared to the conventional parsers.  Perhaps the
179energy trends shown in Figure \ref{corei3_energy} reveal an
180interesting trend. Parabix consumes substantially less energy than the
181other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
182Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
183Although Parabix requires slightly more power (per instruction), the
184processing time of Parabix is significantly lower.
185
186
187
188
189
190
191\begin{figure}
192\subfigure[Avg. Power (Watts)]{
193\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
194\label{corei3_power}
195}
196\hfill
197\subfigure[Energy Consumption ($\mu$J per kB)]{
198\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
199\label{corei3_energy}
200}
201\caption{Power profile of Parabix on \CITHREE{}}
202\end{figure}
203
204
Note: See TracBrowser for help on using the repository browser.