source: docs/HPCA2012/05-corei3.tex @ 1738

Last change on this file since 1738 was 1692, checked in by lindanl, 8 years ago

Some figure adjustment to the new template

File size: 8.7 KB
[1400]1\section{Efficiency of the Parabix-XML Parser}
[1366]3In this section we analyze the energy and performance characteristics
4of the Parabix-based XML parser against the software XML parsers,
5Xerces and Expat. For our baseline evaluation, we compare all the XML
[1400]6parsers on the \CITHREE{}.
[1302]9%some of the numbers are roughly calculated, needs to be recalculated for final version
[1390]10\subsection{Cache behavior}
[1400]11The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
124, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB)
13are private per core; L3 (4MB) is shared by all the cores.
14Figure \ref{cache_misses} shows the cache misses per kilobyte
[1366]15of input data. Analytically, the cache misses for the Expat and Xerces
[1644]16parsers represent a 0.5 cycle per XML byte cost. This overhead
[1645]17does not necessarily impact the overall performance of these
18parsers as they experience additional overheads related to branch mispredictions.
[1400]19Compared to Xerces and Expat, the data organization of Parabix-XML significantly
20reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$ 
21fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache
22utilization helps keep the SIMD units busy by minimizing memory-related stalls
23and lowers the overall energy consumption
24by reducing the need to access the higher levels of the cache hierarchy.
25Using microbenchmarks, we estimated that the L1,
26L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ
27respectively. On average, with a 1GB XML file, Expat and Xerces would consume over
[1366]280.6J and 0.9J respectively due to cache misses alone.
[1302]29%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
30%would consume 0.9J on cache misses alone.
[1335]34\subfigure[L1 Misses]{
38\subfigure[L2 Misses]{
42\subfigure[L3 Misses]{
46\caption{Cache Misses per kB of input data.}
50\subsection{Branch Mispredictions}
[1421]52In general, performance is limited by branch mispredictions.
53Unfortunately, it is difficult to reduce the branch misprediction rate of
54traditional XML parsers due to:
55(1) the variable length nature of the syntactic elements contained within XML documents;
56(2) a data dependent characteristic, and
57(3) the extensive set of syntax constraints imposed by the XML 1.0/1.1 specifications.
[1400]58% Branch mispredictions are known
59% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
60% \cite{CameronHerdyLin2008}.
61As shown in Figure \ref{corei3_BR},
62Xerces averages up to 13 branches per XML byte processed on high density
[1366]63markup. On modern commodity processors the cost of a single branch
[1646]64misprediction is on the order of 10s of CPU cycles spent to restart the processor
[1647]67The high miss prediction rate in conventional parsers is a significant overhead.
[1400]68In Parabix-XML, the use of SIMD operations eliminates many branches.
69Most conditional branches can be replaced with
70bitwise operations, which can process up to 128 characters worth of
71branches with one operation
72or with a series of logical predicate operations, which are cheaper
73to compute since they require only SIMD operations.
[1400]75As shown in Figure \ref{corei3_BR},
76Parabix-XML is nearly branch free and exhibits minimal dependence on the
77source markup density. Specifically, it experiences between 19.5 and
7830.7 branch mispredictions per kB of XML data. Conversely, the cost of
[1366]79branch mispredictions for the Expat parser can be over 7 cycles per
[1400]80XML byte (see Figure \ref{corei3_BM}) --- which exceeds
81the average latency of a byte processed by Parabix-XML.
[1400]87\subfigure[Branch Instructions / kB]{
[1400]92\subfigure[Branch Misses / kB]{
96\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
99\subsection{SIMD Instructions vs. Total Instructions}
[1400]101In Parabix-XML, the ratio of retired SIMD instructions to total
[1366]102instructions provides insight into the relative degree to which
[1400]103Parabix-XML achieves parallelism over the byte-at-a-time approach.
104Using the Intel Pin tool, we gathered the dynamic instruction mix for
105each XML workload and classified the instructions as either SIMD
106or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
107percentage of SIMD instructions in the Parabix-XML parser.
108The ratio of executed SIMD instructions over total instructions indicates
109the amount of available parallelism.
110The resulting instruction mix consists of 60\% to 80\% SIMD
111instructions. The markup density of the files influence the number of
[1366]112scalar instructions needed to handle the tag processing which affects
113the overall parallelism that can be extracted by Parabix.  We find
114that degradation rate is low and thus the performance
115penalty incurred by increasing the markup density is minimal.
[1335]116%Expat and Xerce do not use any SIMD instructions and were not
117%included in this portion of the study.
[1400]119% Parabix gains its performance by using parallel bit streams, which
[1366]120% are mostly generated and calculated by SIMD instructions.  We use Intel
[1335]121% pin, a dynamic binary instrumentation tool, to gather instruction
122% mix.  Then we adds up all the vector instructions that have been
123% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
[1366]124% show the percentage of SIMD instructions of Parabix1 and Parabix
[1335]125% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
126% 18\% to 40\% of the executed instructions consists of SIMD
127% instructions.  By using bistream addition for parallel scanning,
128% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
129% decrease as the markup density increase for both Parabix1 and
130% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
131% performance degradation caused by increasing markup density is
132% smaller.
134\subsection{CPU Cycles}
[1400]136Figure \ref{corei3_TOT} shows overall parser performance in
137terms of CPU cycles per kB. Parabix-XML  is 2.5
138to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
[1302]139on data-oriented input.  Traditional parsers can be dramatically
[1400]140slowed by dense markup but Parabix-XML is relatively unaffected.
141Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
142processing it; this requires several cycles per byte. However,
[1335]143transcoding using parallel bit streams is significantly faster and
[1366]144requires less than a single cycle per byte.
[1381]149\caption{SIMD Instruction Percentage}
[1380]155\caption{Performance (CPU Cycles per kB)}
[1302]161\subsection{Power and Energy}
[1400]162In this section, we study the power and energy consumption of Parabix-XML
163in comparison with Expat and Xerces on \CITHREE{}.
164Figure \ref{corei3_power} shows the
165average power consumed by each parser. Parabix-XML, dominated by SIMD
166instructions, uses $\sim5\%$ additional power. While the
[1366]167SIMD functional units are significantly wider than the scalar
[1400]168counterparts, register width and functional unit power account only
[1366]169for a small fraction of the overall power consumption in a processor
170pipeline. More importantly by using data parallel operations Parabix
171amortizes the fetch and data access overheads. This results in minimal
[1380]172power increase compared to the conventional parsers.  Perhaps the
173energy trends shown in Figure \ref{corei3_energy} reveal an
174interesting trend. Parabix consumes substantially less energy than the
175other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
176Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
177Although Parabix requires slightly more power (per instruction), the
178processing time of Parabix is significantly lower.
[1335]186\subfigure[Avg. Power (Watts)]{
191\subfigure[Energy Consumption ($\mu$J per kB)]{
[1380]195\caption{Power profile of Parabix on \CITHREE{}}
Note: See TracBrowser for help on using the repository browser.