source: docs/HPCA2012/05-corei3.tex @ 1335

Last change on this file since 1335 was 1335, checked in by ashriram, 8 years ago

Working on evaluation. Fixed Figure sizes

File size: 9.0 KB
Line 
1\section{Baseline Evaluation on \CITHREE{}}
2
3%some of the numbers are roughly calculated, needs to be recalculated for final version
4\subsection{Cache behavior}
5\CITHREE\ has a three level cache hierarchy.  The approximate miss
6penalty for each cache level is 4, 11, and 36 cycles respectively.
7Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of
9the parsers.  Although XML parsing is non memory intensive
10application, cache misses for the Expat and Xerces parsers represent a
110.5 cycle per XML byte cost whereas the performance of the Parabix
12parsers remains essentially unaffected by data cache misses.  Cache
13misses not only consume additional CPU cycles but increase application
14energy consumption.  L1, L2, and L3 cache misses consume approximately
158.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as
16input, Expat and Xerces would consume over 0.6J and 0.9J respectively
17due to cache misses alone.
18%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
19%would consume 0.9J on cache misses alone.
20
21
22\begin{figure}
23\subfigure[L1 Misses]{
24\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
25\label{corei3_L1DM}
26}
27\subfigure[L2 Misses]{
28\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
29\label{corei3_L2DM}
30}
31\subfigure[L3 Misses]{
32\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
33\label{corei3_L3DM}
34}
35\caption{Cache Misses per kB of input data.}
36\end{figure}
37
38\subsection{Branch Mispredictions}
39Despite improvements in branch prediction, branch misprediction
40penalties contribute significantly to XML parsing performance. On
41modern commodity processors the cost of a single branch misprediction
42is commonly cited as over 10 CPU cycles.  As shown in Figure
43\ref{corei3_BM}, the cost of branch mispredictions for the Expat
44parser can be over 7 cycles per XML byte---this cost alone is equal to
45the average total cost for Parabix2 to process each byte of XML.
46
47In general, reducing the branch misprediction rate is difficult in
48text-based XML parsing applications. This is due in part to the
49variable length nature of the syntactic elements contained within XML
50documents, a data dependent characterstic, as well as the extensive
51set of syntax constraints imposed by the XML 1.0 specification. As
52such, traditional byte-at-a-time XML parsers generate a performance
53limiting number of branch mispredictions.  As shown in Figure
54\ref{corei3_BR}, Xerces averages up to 13 branches per XML byte
55processed on high density markup.
56
57The performance improvement of Parabix1 in terms of branch
58mispredictions results from the veritable elimination of conditional
59branch instructions in scanning. Leveraging the processor built-in
60{\em bit scan} operation together with parallel bit stream technology
61Parabix1 can scan up to 64 bytes of source XML with a single {\em bit
62  scan} instruction. In comparison, a byte-at-a-time parser must
63process a conditional branch instruction per XML byte scanned.
64
65As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost
66branch free. Utilizing a new parallel scanning technique based on bit
67stream addition, Parabix2 exhibits minimal dependence on source XML
68markup density. Figure \ref{corei3_BR} displays this lack of data
69dependence via the constant number of branch mispredictions shown for
70each of the source XML files.
71% Parabix1 minimize the branches by using parallel bit
72% streams.  Parabix1 still have a few branches for each block of 128
73% bytes (SSE) due to the sequential scanning.  But with the new parallel
74% scanning technique, Parabix2 is essentially branch-free as shown in
75% the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
76% dependency on the markup density of the workloads.
77
78
79\begin{figure}
80\subfigure[Branch Instructions]{
81\includegraphics[width=0.45\textwidth]{plots/corei3_BR.pdf}
82\label{corei3_BR}
83}
84\hfill
85\subfigure[Branch Misses]{
86\includegraphics[width=0.42\textwidth]{plots/corei3_BM.pdf}
87\label{corei3_BM}
88}
89\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
90\end{figure}
91
92\subsection{SIMD Instructions vs. Total Instructions}
93
94Parabix achieves performance via parallel bit stream technology. In
95Parabix XML processing, parallel bit streams are both computed and
96predominately operated upon using the SIMD instructions of commodity
97processors.  The ratio of retired SIMD instructions to total
98instructions provides insight into\ the relative degree to which
99Parabix achieves parallelism over the byte-at-a-time approach.
100
101Using the Intel Pin tool, we gather the dynamic instruction mix for
102each XML workload, and classify instructions as either vector (SIMD)
103or non-vector instructions.  Figures \ref{corei3_INS_p1} and
104\ref{corei3_INS_p2} show the percentage of SIMD instructions for
105Parabix1 and Parabix2 respectively.
106%(Expat and Xerce do not use any SIMD instructions)
107For Parabix1, 18\% to 40\% of the executed instructions are SIMD instructions.  Using
108bit stream addition to scan XML characters in parallel, the Parabix2 instruction mix is made up of 60\% to 80\%
109SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
110for both Parabix1 and Parabix2, the degradation rate of
111Parabix2 is much lower and thus the performance penalty incurred by
112increasing the markup density is reduced.
113%Expat and Xerce do not use any SIMD instructions and were not
114%included in this portion of the study.
115
116% Parabix gains its performance by using parallel bitstreams, which
117% are mostly generated and calculated by SIMD instructions.  The ratio
118% of executed SIMD instructions over total instructions indicates the
119% amount of parallel processing we were able to achieve.  We use Intel
120% pin, a dynamic binary instrumentation tool, to gather instruction
121% mix.  Then we adds up all the vector instructions that have been
122% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
123% show the percentage of SIMD instructions of Parabix1 and Parabix2
124% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
125% 18\% to 40\% of the executed instructions consists of SIMD
126% instructions.  By using bistream addition for parallel scanning,
127% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
128% decrease as the markup density increase for both Parabix1 and
129% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
130% performance degradation caused by increasing markup density is
131% smaller.
132
133\subsection{CPU Cycles}
134
135Figure \ref{corei3_TOT} shows overall parser performance evaluated in
136terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster
137on document-oriented input and 2 to 3 times faster on data-oriented
138input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5
139to 4 times faster on document-oriented input and 4.5 to 7 times faster
140on data-oriented input.  Traditional parsers can be dramatically
141slowed by dense markup, while Parabix2 is generally unaffected.  The
142results presented are not entirely fair to the Xerces parser since it
143first transcodes input from UTF-8 to UTF-16 before processing. In
144Xerces, this transcoding requires several cycles per byte.  However,
145transcoding using parallel bit streams is significantly faster and
146requires less than a single cycle per byte.  \cite{Cameron2008}.
147
148
149\begin{figure}
150\subfigure[Performance : \# Cycles/kb]{
151\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
152\label{corei3_TOT}
153}
154\hfill
155\subfigure[SIMD Instruction Breakdown. Y Axis :  \% SIMD Instruction/kb]{
156\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
157\label{corei3_INS_p2}
158}
159\end{figure}
160
161
162\subsection{Power and Energy}
163In response to the growing industry concerns on power consumption and
164energy efficiency, chip producers work hard to not only improve
165performance but also achieve high energy efficiency in processors
166design. We study the power and energy consumption of Parabix in
167comparison with Expat and Xerces on \CITHREE{}. The average power of
168\CITHREE\ 530 is about 21 watts.  This Intel model has a good
169reputation for power efficiency. Figure \ref{corei3_power} shows the
170average power consumed by each parser.  Parabix2, dominated by SIMD
171instructions, uses approximately 5\% additional power.
172
173
174
175
176\begin{figure}
177\subfigure[Avg. Power (Watts)]{
178\includegraphics[width=0.4\textwidth]{plots/corei3_power.pdf}
179\label{corei3_power}
180}
181\hfill
182\subfigure[Energy Consumption ($\mu$J per kB)]{
183\includegraphics[width=0.4\textwidth]{plots/corei3_energy.pdf}
184\label{corei3_energy}
185}
186\end{figure}
187
188As shown in Figure \ref{corei3_energy}, a comparison of energy
189efficiency demonstrates a more interesting result. Although Parabix2
190requires slightly more power (per instruction), the processing time of
191Parabix2 is significantly lower, and therefore Parabix2 consumes
192substantially less energy than the other parsers. Parabix2 consumes 50
193to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
194140nJ to 370nJ per byte respectively.
195
Note: See TracBrowser for help on using the repository browser.