source: docs/HPCA2012/05-corei3.tex @ 1365

Last change on this file since 1365 was 1365, checked in by ashriram, 8 years ago

Fixed methodology

File size: 9.3 KB
Line 
1\section{Efficiency of the Parabix}
2\label{section:baseline}
3
4%[AS] Need to say what we are doing here.
5
6brief, for each of the four XML parsers under study we propose to measure
7and evaluate the energy consumption required to carry out XML
8well-formedness checking, under a variety of workloads, and as
9executed on three different Intel processors.
10
11
12
13
14%some of the numbers are roughly calculated, needs to be recalculated for final version
15\subsection{Cache behavior}
16\CITHREE\  cache hierarchy.  The approximate miss
17penalty for each cache level is 4, 11, and 36 cycles respectively.
18Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
19\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of
20the parsers.  Although XML parsing is non memory intensive
21application, cache misses for the Expat and Xerces parsers represent a
220.5 cycle per XML byte cost whereas the performance of the Parabix
23parsers remains essentially unaffected by data cache misses.  Cache
24misses not only consume additional CPU cycles but increase application
25energy consumption.  L1, L2, and L3 cache misses consume approximately
268.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as
27input, Expat and Xerces would consume over 0.6J and 0.9J respectively
28due to cache misses alone.
29%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
30%would consume 0.9J on cache misses alone.
31
32
33\begin{figure}
34\subfigure[L1 Misses]{
35\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
36\label{corei3_L1DM}
37}
38\subfigure[L2 Misses]{
39\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
40\label{corei3_L2DM}
41}
42\subfigure[L3 Misses]{
43\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
44\label{corei3_L3DM}
45}
46\caption{Cache Misses per kB of input data.}
47\end{figure}
48
49\subsection{Branch Mispredictions}
50Despite improvements in branch prediction, branch misprediction
51penalties contribute significantly to XML parsing performance. On
52modern commodity processors the cost of a single branch misprediction
53is commonly cited as over 10 CPU cycles.  As shown in Figure
54\ref{corei3_BM}, the cost of branch mispredictions for the Expat
55parser can be over 7 cycles per XML byte---this cost alone is equal to
56the average total cost for Parabix2 to process each byte of XML.
57
58In general, reducing the branch misprediction rate is difficult in
59text-based XML parsing applications. This is due in part to the
60variable length nature of the syntactic elements contained within XML
61documents, a data dependent characterstic, as well as the extensive
62set of syntax constraints imposed by the XML 1.0 specification. As
63such, traditional byte-at-a-time XML parsers generate a performance
64limiting number of branch mispredictions.  As shown in Figure
65\ref{corei3_BR}, Xerces averages up to 13 branches per XML byte
66processed on high density markup.
67
68The performance improvement of Parabix1 in terms of branch
69mispredictions results from the veritable elimination of conditional
70branch instructions in scanning. Leveraging the processor built-in
71{\em bit scan} operation together with parallel bit stream technology
72Parabix1 can scan up to 64 bytes of source XML with a single {\em bit
73  scan} instruction. In comparison, a byte-at-a-time parser must
74process a conditional branch instruction per XML byte scanned.
75
76As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost
77branch free. Utilizing a new parallel scanning technique based on bit
78stream addition, Parabix2 exhibits minimal dependence on source XML
79markup density. Figure \ref{corei3_BR} displays this lack of data
80dependence via the constant number of branch mispredictions shown for
81each of the source XML files.
82% Parabix1 minimize the branches by using parallel bit
83% streams.  Parabix1 still have a few branches for each block of 128
84% bytes (SSE) due to the sequential scanning.  But with the new parallel
85% scanning technique, Parabix2 is essentially branch-free as shown in
86% the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
87% dependency on the markup density of the workloads.
88
89
90\begin{figure}
91\subfigure[Branch Instructions]{
92\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
93\label{corei3_BR}
94}
95\hfill
96\subfigure[Branch Misses]{
97\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
98\label{corei3_BM}
99}
100\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
101\end{figure}
102
103\subsection{SIMD Instructions vs. Total Instructions}
104
105Parabix achieves performance via parallel bit stream technology. In
106Parabix XML processing, parallel bit streams are both computed and
107predominately operated upon using the SIMD instructions of commodity
108processors.  The ratio of retired SIMD instructions to total
109instructions provides insight into\ the relative degree to which
110Parabix achieves parallelism over the byte-at-a-time approach.
111
112Using the Intel Pin tool, we gather the dynamic instruction mix for
113each XML workload, and classify instructions as either vector (SIMD)
114or non-vector instructions.  Figures \ref{corei3_INS_p1} and
115\ref{corei3_INS_p2} show the percentage of SIMD instructions for
116Parabix1 and Parabix2 respectively.
117%(Expat and Xerce do not use any SIMD instructions)
118For Parabix1, 18\% to 40\% of the executed instructions are SIMD instructions.  Using
119bit stream addition to scan XML characters in parallel, the Parabix2 instruction mix is made up of 60\% to 80\%
120SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
121for both Parabix1 and Parabix2, the degradation rate of
122Parabix2 is much lower and thus the performance penalty incurred by
123increasing the markup density is reduced.
124%Expat and Xerce do not use any SIMD instructions and were not
125%included in this portion of the study.
126
127% Parabix gains its performance by using parallel bitstreams, which
128% are mostly generated and calculated by SIMD instructions.  The ratio
129% of executed SIMD instructions over total instructions indicates the
130% amount of parallel processing we were able to achieve.  We use Intel
131% pin, a dynamic binary instrumentation tool, to gather instruction
132% mix.  Then we adds up all the vector instructions that have been
133% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
134% show the percentage of SIMD instructions of Parabix1 and Parabix2
135% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
136% 18\% to 40\% of the executed instructions consists of SIMD
137% instructions.  By using bistream addition for parallel scanning,
138% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
139% decrease as the markup density increase for both Parabix1 and
140% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
141% performance degradation caused by increasing markup density is
142% smaller.
143
144\subsection{CPU Cycles}
145
146Figure \ref{corei3_TOT} shows overall parser performance evaluated in
147terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster
148on document-oriented input and 2 to 3 times faster on data-oriented
149input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5
150to 4 times faster on document-oriented input and 4.5 to 7 times faster
151on data-oriented input.  Traditional parsers can be dramatically
152slowed by dense markup, while Parabix2 is generally unaffected.  The
153results presented are not entirely fair to the Xerces parser since it
154first transcodes input from UTF-8 to UTF-16 before processing. In
155Xerces, this transcoding requires several cycles per byte.  However,
156transcoding using parallel bit streams is significantly faster and
157requires less than a single cycle per byte.  \cite{Cameron2008}.
158
159
160\begin{figure}
161\subfigure[Performance : \# Cycles/kb]{
162\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
163\label{corei3_TOT}
164}
165\hfill
166\subfigure[SIMD Instruction Breakdown. Y Axis :  \% SIMD Instruction/kb]{
167\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
168\label{corei3_INS_p2}
169}
170\end{figure}
171
172
173\subsection{Power and Energy}
174In response to the growing industry concerns on power consumption and
175energy efficiency, chip producers work hard to not only improve
176performance but also achieve high energy efficiency in processors
177design. We study the power and energy consumption of Parabix in
178comparison with Expat and Xerces on \CITHREE{}. The average power of
179\CITHREE\ 530 is about 21 watts.  This Intel model has a good
180reputation for power efficiency. Figure \ref{corei3_power} shows the
181average power consumed by each parser.  Parabix2, dominated by SIMD
182instructions, uses approximately 5\% additional power.
183
184
185
186
187\begin{figure}
188\subfigure[Avg. Power (Watts)]{
189\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
190\label{corei3_power}
191}
192\hfill
193\subfigure[Energy Consumption ($\mu$J per kB)]{
194\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
195\label{corei3_energy}
196}
197\end{figure}
198
199As shown in Figure \ref{corei3_energy}, a comparison of energy
200efficiency demonstrates a more interesting result. Although Parabix2
201requires slightly more power (per instruction), the processing time of
202Parabix2 is significantly lower, and therefore Parabix2 consumes
203substantially less energy than the other parsers. Parabix2 consumes 50
204to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
205140nJ to 370nJ per byte respectively.
206
Note: See TracBrowser for help on using the repository browser.