# Changeset 1400

Ignore:
Timestamp:
Aug 30, 2011, 6:54:45 PM (8 years ago)
Message:

edits

File:
1 edited

### Legend:

Unmodified
 r1390 \section{Efficiency of the Parabix} \section{Efficiency of the Parabix-XML Parser} \label{section:baseline} In this section we analyze the energy and performance characteristics of the Parabix-based XML parser against the software XML parsers, Xerces and Expat. For our baseline evaluation, we compare all the XML parsers on a fixed platform, the \CITHREE{}. parsers on the \CITHREE{}. %some of the numbers are roughly calculated, needs to be recalculated for final version \subsection{Cache behavior} The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is 4, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB) are private per core, while the 4MB L3 is shared by all the cores. Figure \ref{cache_misses} shows the cache misses per kilobyte The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is 4, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB) are private per core; L3 (4MB) is shared by all the cores. Figure \ref{cache_misses} shows the cache misses per kilobyte of input data. Analytically, the cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte processed. This overhead does not necessarily reflect in the overall performance of these parsers as they experience other overheads related to branch mispredictions. Parabix's data reorganization significantly improves the overall cache miss rate. We experience 7$\times$ less misses than Expat and 25$\times$ less misses than Xerces at the L1 and 104$\times$ less misses than Expat and 15$\times$ less misses than Xerces at the L2 level. The improved cache utilization keeps the SIMD units busy and prevent memory related stalls. Note that cache misses also cause increased application energy consumption due to increased energy required to access higher levels in the cache hierarchy. We estimated with microbenchmarks that the L1, L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ respectively. For a 1GB XML file Expat and Xerces would consume over parsers as they experience other overheads related to branch mispredictions. Compared to Xerces and Expat, the data organization of Parabix-XML significantly reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache utilization helps keep the SIMD units busy by minimizing memory-related stalls and lowers the overall energy consumption by reducing the need to access the higher levels of the cache hierarchy. Using microbenchmarks, we estimated that the L1, L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ respectively. On average, with a 1GB XML file, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone. %With a 1GB input file, Expat would consume more than 0.6J and Xercesn \subsection{Branch Mispredictions} \label{section:XML-branches} In general, reducing the branch misprediction rate is difficult in text-based XML parsing applications. This is due to (1) variable length nature of the syntactic elements contained within XML documents, (2) a data dependent characteristic, and (3) the extensive set of syntax constraints imposed by the XML. Traditional byte-at-a-time XML parser's performance is limited by the number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte processed on high density In general, performance is limited by branch mispredictions. Unfortunetly, it is difficult to reduce the branch misprediction rate of traditional XML parsers due to: (1) the variable length nature of the syntactic elements contained within XML documents; (2) a data dependent characteristic, and (3) the extensive set of syntax constraints imposed by the XML 1.0/1.1 specifications. % Branch mispredictions are known % to signficantly degrade XML parsing performance in proportion to the markup density of the source document % \cite{CameronHerdyLin2008}. As shown in Figure \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte processed on high density markup. On modern commodity processors the cost of a single branch misprediction is incur over 10s of CPU cycles to restart the processor pipeline. The high miss prediction rate in conventional parsers add significant overhead. In Parabix the transformation to SIMD operation eliminates many branches. Further optimizations take advantage of Parabix's data organization and replace condition branches with {\em bit scan} operations that can process up to 128 characters worth of branches with one operation. In many cases, we also replace the branches with logical predicate operations. Our predicates are cheaper to compute since they involve only bit parallel SIMD operations. As shown in Figure \ref{corei3_BR}, Parabix processing is almost branch free. Parabix exhibits minimal dependence on source XML markup density; it experiences between 19.5 and 30.7 branch mispredictions per thousand of XML byte. The cost of pipeline. The high miss prediction rate in conventional parsers is a significant overhead. In Parabix-XML, the use of SIMD operations eliminates many branches. Most conditional branches can be replaced with bitwise operations, which can process up to 128 characters worth of branches with one operation or with a series of logical predicate operations, which are cheaper to compute since they require only SIMD operations. As shown in Figure \ref{corei3_BR}, Parabix-XML is nearly branch free and exhibits minimal dependence on the source markup density. Specifically, it experiences between 19.5 and 30.7 branch mispredictions per kB of XML data. Conversely, the cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher than the average latency of a byte processed by Parabix. XML byte (see Figure \ref{corei3_BM}) --- which exceeds the average latency of a byte processed by Parabix-XML. \begin{figure} \subfigure[Branch Instructions]{ \subfigure[Branch Instructions / kB]{ \includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf} \label{corei3_BR} } \hfill \subfigure[Branch Misses]{ \subfigure[Branch Misses / kB]{ \includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf} \label{corei3_BM} \subsection{SIMD Instructions vs. Total Instructions} In Parabix, bit streams are both computed and predominately operated upon using the SIMD instructions of commodity processors.  The ratio of retired SIMD instructions to total In Parabix-XML, the ratio of retired SIMD instructions to total instructions provides insight into the relative degree to which Parabix achieves parallelism over the byte-at-a-time approach. Using the Intel Pin tool, we gather the dynamic instruction mix for each XML workload, and classify instructions as either vector (SIMD) or non-vector instructions.  Figure~\ref{corei3_INS_p2} shows the percentage of SIMD instructions for the Parabix XML parser. The ratio of executed SIMD instructions over total instructions indicates the amount of parallel processing we were able to extract. %(Expat and Xerce do not use any SIMD instructions) The Parabix instruction mix is made up of 60\% to 80\% SIMD instructions.  The markup density of the files influence the number of Parabix-XML achieves parallelism over the byte-at-a-time approach. Using the Intel Pin tool, we gathered the dynamic instruction mix for each XML workload and classified the instructions as either SIMD or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the percentage of SIMD instructions in the Parabix-XML parser. The ratio of executed SIMD instructions over total instructions indicates the amount of available parallelism. The resulting instruction mix consists of 60\% to 80\% SIMD instructions. The markup density of the files influence the number of scalar instructions needed to handle the tag processing which affects the overall parallelism that can be extracted by Parabix.  We find %included in this portion of the study. % Parabix gains its performance by using parallel bitstreams, which % Parabix gains its performance by using parallel bit streams, which % are mostly generated and calculated by SIMD instructions.  We use Intel % pin, a dynamic binary instrumentation tool, to gather instruction \subsection{CPU Cycles} Figure \ref{corei3_TOT} shows overall parser performance evaluated in terms of CPU cycles per kilobyte.  The Parabix parser  is 2.5$\times$ to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster Figure \ref{corei3_TOT} shows overall parser performance in terms of CPU cycles per kB. Parabix-XML  is 2.5 to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster on data-oriented input.  Traditional parsers can be dramatically slowed by dense markup, while Parabix is affected much less.  The results presented are not entirely fair to the Xerces parser since it first transcodes input from UTF-8 to UTF-16 before processing. In Xerces, this transcoding requires several cycles per byte.  However, slowed by dense markup but Parabix-XML is relatively unaffected. Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before processing it; this requires several cycles per byte. However, transcoding using parallel bit streams is significantly faster and requires less than a single cycle per byte. \subsection{Power and Energy} In this section, we study the power and energy consumption of Parabix in comparison with Expat and Xerces on \CITHREE{}. The average power of \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the average power consumed by each parser.  Parabix, dominated by SIMD instructions which uses approximately 5\% additional power. While the In this section, we study the power and energy consumption of Parabix-XML in comparison with Expat and Xerces on \CITHREE{}. Figure \ref{corei3_power} shows the average power consumed by each parser. Parabix-XML, dominated by SIMD instructions, uses $\sim5\%$ additional power. While the SIMD functional units are significantly wider than the scalar counterparts; register width and functional unit power account only counterparts, register width and functional unit power account only for a small fraction of the overall power consumption in a processor pipeline. More importantly by using data parallel operations Parabix