source: docs/HPCA2012/05-corei3.tex @ 1366

Last change on this file since 1366 was 1366, checked in by ashriram, 8 years ago

Finished Section 5

File size: 8.8 KB
Line 
1\section{Efficiency of the Parabix}
2\label{section:baseline}
3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
6parsers on a fixed platform, the \CITHREE{}.
7
8
9%some of the numbers are roughly calculated, needs to be recalculated for final version
10\subsection{Cache behaviour}
11The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB)
13are private per core, while the 4MB L3 is shared by all the
14cores. Figure \ref{cache_misses} shows the cache misses per kilobyte
15of input data. Analytically, the cache misses for the Expat and Xerces
16parsers represent a 0.5 cycle per XML byte processed. This overhead
17does not necessarily reflect in the overall performance of these
18parsers as they experience other overheads related to branch
19mispredictions. Parabix's data reorganization significantly improves
20the overall cache miss rate. We experience ?$\times$ less misses at
21the L1 and ?$\times$ less misses at the L2 level. The improved cache
22utilization keeps the SIMD units busy and prevent memory related
23stalls. Note that cache misses also cause increased application energy
24consumption due to increased energy required to access higher levels
25in the cache hierarchy. We estimated with microbenchmarks that the L1,
26L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ
27respectively. For a 1GB XML file Expat and Xerces would consume over
280.6J and 0.9J respectively due to cache misses alone.
29%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
30%would consume 0.9J on cache misses alone.
31
32
33\begin{figure}
34\subfigure[L1 Misses]{
35\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
36\label{corei3_L1DM}
37}
38\subfigure[L2 Misses]{
39\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
40\label{corei3_L2DM}
41}
42\subfigure[L3 Misses]{
43\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
44\label{corei3_L3DM}
45}
46\caption{Cache Misses per kB of input data.}
47\label{cache_misses}
48\end{figure}
49
50\subsection{Branch Mispredictions}
51In general, reducing the branch misprediction rate is difficult in
52text-based XML parsing applications. This is due to (1) variable
53length nature of the syntactic elements contained within XML
54documents, (2) a data dependent characteristic, and (3) the extensive
55set of syntax constraints imposed by the XML. Traditional
56byte-at-a-time XML parser's performance is limited by the number of
57branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces
58averages up to 13 branches per XML byte processed on high density
59markup. On modern commodity processors the cost of a single branch
60misprediction is incur over 10s of CPU cycles to restart the processor
61pipeline. The high miss prediction rate in conventional parsers add
62significant overhead. In Parabix the transformation to SIMD operation
63eliminates many branches. Further optimizations take advantage of
64Parabix's data organization and replace condition branches with {\em
65  bit scan} operations that can process up to 64 characters worth of
66branches with one operation. In many cases, we also replace the
67branches with logical predicate operations. Our predicate are cheaper
68to compute since they involve only bit parallel SIMD operations.
69
70 As shown in Figure \ref{corei3_BR},
71Parabix processing is almost branch free. Parabix exhibits minimal
72dependence on source XML markup density; it experiences a constant
73number of branch mispredictions irrespective of the input. The cost of
74branch mispredictions for the Expat parser can be over 7 cycles per
75XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher
76than the average latency of a byte processed by Parabix.
77
78
79
80
81\begin{figure}
82\subfigure[Branch Instructions]{
83\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
84\label{corei3_BR}
85}
86\hfill
87\subfigure[Branch Misses]{
88\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
89\label{corei3_BM}
90}
91\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
92\end{figure}
93
94\subsection{SIMD Instructions vs. Total Instructions}
95
96In Parabix, bit streams are both computed and
97predominately operated upon using the SIMD instructions of commodity
98processors.  The ratio of retired SIMD instructions to total
99instructions provides insight into the relative degree to which
100Parabix achieves parallelism over the byte-at-a-time approach.
101
102
103Using the Intel Pin tool, we gather the dynamic instruction mix for
104each XML workload, and classify instructions as either vector (SIMD)
105or non-vector instructions.  Figures~\ref{corei3_INS_p2} show the
106percentage of SIMD instructions for Parabix. The ratio of executed
107SIMD instructions over total instructions indicates the amount of
108parallel processing we were able to extract.
109%(Expat and Xerce do not use any SIMD instructions)
110The Parabix instruction mix is made up of 60\% to 80\% SIMD
111instructions.  The markup density of the files influence the number of
112scalar instructions needed to handle the tag processing which affects
113the overall parallelism that can be extracted by Parabix.  We find
114that degradation rate is low and thus the performance
115penalty incurred by increasing the markup density is minimal.
116%Expat and Xerce do not use any SIMD instructions and were not
117%included in this portion of the study.
118
119% Parabix gains its performance by using parallel bitstreams, which
120% are mostly generated and calculated by SIMD instructions.  We use Intel
121% pin, a dynamic binary instrumentation tool, to gather instruction
122% mix.  Then we adds up all the vector instructions that have been
123% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
124% show the percentage of SIMD instructions of Parabix1 and Parabix
125% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
126% 18\% to 40\% of the executed instructions consists of SIMD
127% instructions.  By using bistream addition for parallel scanning,
128% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
129% decrease as the markup density increase for both Parabix1 and
130% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
131% performance degradation caused by increasing markup density is
132% smaller.
133
134\subsection{CPU Cycles}
135
136Figure \ref{corei3_TOT} shows overall parser performance evaluated in
137terms of CPU cycles per kilobyte.  Parabix is 2.5$\times$
138to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster
139on data-oriented input.  Traditional parsers can be dramatically
140slowed by dense markup, while Parabix2 is generally unaffected.  The
141results presented are not entirely fair to the Xerces parser since it
142first transcodes input from UTF-8 to UTF-16 before processing. In
143Xerces, this transcoding requires several cycles per byte.  However,
144transcoding using parallel bit streams is significantly faster and
145requires less than a single cycle per byte.
146
147
148\begin{figure}
149\subfigure[Performance : \# Cycles/kb]{
150\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
151\label{corei3_TOT}
152}
153\hfill
154\subfigure[SIMD Instruction Breakdown. Y Axis :  \% SIMD Instruction/kb]{
155\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
156\label{corei3_INS_p2}
157}
158\end{figure}
159
160
161\subsection{Power and Energy}
162In this section, we study the power and energy consumption of Parabix in
163comparison with Expat and Xerces on \CITHREE{}. The average power of
164\CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
165average power consumed by each parser.  Parabix, dominated by SIMD
166instructions which uses approximately 5\% additional power. While the
167SIMD functional units are significantly wider than the scalar
168counterparts; register width and functional unit power account only
169for a small fraction of the overall power consumption in a processor
170pipeline. More importantly by using data parallel operations Parabix
171amortizes the fetch and data access overheads. This results in minimal
172power increase compared to the conventional parsers. 
173Perhaps the energy trends shown in Figure
174\ref{corei3_energy} reveal an interesting trend. Parabix consumes
175substantially less energy than the other parsers. Parabix consumes 50
176to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
177140nJ to 370nJ per byte respectively.  Although Parabix
178requires slightly more power (per instruction), the processing time of
179Parabix is significantly lower.
180
181
182
183
184
185
186\begin{figure}
187\subfigure[Avg. Power (Watts)]{
188\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
189\label{corei3_power}
190}
191\hfill
192\subfigure[Energy Consumption ($\mu$J per kB)]{
193\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
194\label{corei3_energy}
195}
196\end{figure}
197
198
Note: See TracBrowser for help on using the repository browser.