# Changeset 1366 for docs/HPCA2012

Ignore:
Timestamp:
Aug 24, 2011, 3:28:57 PM (8 years ago)
Message:

Finished Section 5

File:
1 edited

### Legend:

Unmodified
 r1365 \section{Efficiency of the Parabix} \label{section:baseline} %[AS] Need to say what we are doing here. brief, for each of the four XML parsers under study we propose to measure and evaluate the energy consumption required to carry out XML well-formedness checking, under a variety of workloads, and as executed on three different Intel processors. In this section we analyze the energy and performance characteristics of the Parabix-based XML parser against the software XML parsers, Xerces and Expat. For our baseline evaluation, we compare all the XML parsers on a fixed platform, the \CITHREE{}. %some of the numbers are roughly calculated, needs to be recalculated for final version \subsection{Cache behavior} \CITHREE\  cache hierarchy.  The approximate miss penalty for each cache level is 4, 11, and 36 cycles respectively. Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of the parsers.  Although XML parsing is non memory intensive application, cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost whereas the performance of the Parabix parsers remains essentially unaffected by data cache misses.  Cache misses not only consume additional CPU cycles but increase application energy consumption.  L1, L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as input, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone. \subsection{Cache behaviour} The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is 4, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB) are private per core, while the 4MB L3 is shared by all the cores. Figure \ref{cache_misses} shows the cache misses per kilobyte of input data. Analytically, the cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte processed. This overhead does not necessarily reflect in the overall performance of these parsers as they experience other overheads related to branch mispredictions. Parabix's data reorganization significantly improves the overall cache miss rate. We experience ?$\times$ less misses at the L1 and ?$\times$ less misses at the L2 level. The improved cache utilization keeps the SIMD units busy and prevent memory related stalls. Note that cache misses also cause increased application energy consumption due to increased energy required to access higher levels in the cache hierarchy. We estimated with microbenchmarks that the L1, L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ respectively. For a 1GB XML file Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone. %With a 1GB input file, Expat would consume more than 0.6J and Xercesn %would consume 0.9J on cache misses alone. } \caption{Cache Misses per kB of input data.} \label{cache_misses} \end{figure} \subsection{Branch Mispredictions} Despite improvements in branch prediction, branch misprediction penalties contribute significantly to XML parsing performance. On modern commodity processors the cost of a single branch misprediction is commonly cited as over 10 CPU cycles.  As shown in Figure \ref{corei3_BM}, the cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte---this cost alone is equal to the average total cost for Parabix2 to process each byte of XML. In general, reducing the branch misprediction rate is difficult in text-based XML parsing applications. This is due to (1) variable length nature of the syntactic elements contained within XML documents, (2) a data dependent characteristic, and (3) the extensive set of syntax constraints imposed by the XML. Traditional byte-at-a-time XML parser's performance is limited by the number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte processed on high density markup. On modern commodity processors the cost of a single branch misprediction is incur over 10s of CPU cycles to restart the processor pipeline. The high miss prediction rate in conventional parsers add significant overhead. In Parabix the transformation to SIMD operation eliminates many branches. Further optimizations take advantage of Parabix's data organization and replace condition branches with {\em bit scan} operations that can process up to 64 characters worth of branches with one operation. In many cases, we also replace the branches with logical predicate operations. Our predicate are cheaper to compute since they involve only bit parallel SIMD operations. In general, reducing the branch misprediction rate is difficult in text-based XML parsing applications. This is due in part to the variable length nature of the syntactic elements contained within XML documents, a data dependent characterstic, as well as the extensive set of syntax constraints imposed by the XML 1.0 specification. As such, traditional byte-at-a-time XML parsers generate a performance limiting number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte processed on high density markup. As shown in Figure \ref{corei3_BR}, Parabix processing is almost branch free. Parabix exhibits minimal dependence on source XML markup density; it experiences a constant number of branch mispredictions irrespective of the input. The cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher than the average latency of a byte processed by Parabix. The performance improvement of Parabix1 in terms of branch mispredictions results from the veritable elimination of conditional branch instructions in scanning. Leveraging the processor built-in {\em bit scan} operation together with parallel bit stream technology Parabix1 can scan up to 64 bytes of source XML with a single {\em bit scan} instruction. In comparison, a byte-at-a-time parser must process a conditional branch instruction per XML byte scanned. As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost branch free. Utilizing a new parallel scanning technique based on bit stream addition, Parabix2 exhibits minimal dependence on source XML markup density. Figure \ref{corei3_BR} displays this lack of data dependence via the constant number of branch mispredictions shown for each of the source XML files. % Parabix1 minimize the branches by using parallel bit % streams.  Parabix1 still have a few branches for each block of 128 % bytes (SSE) due to the sequential scanning.  But with the new parallel % scanning technique, Parabix2 is essentially branch-free as shown in % the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal % dependency on the markup density of the workloads. \subsection{SIMD Instructions vs. Total Instructions} Parabix achieves performance via parallel bit stream technology. In Parabix XML processing, parallel bit streams are both computed and In Parabix, bit streams are both computed and predominately operated upon using the SIMD instructions of commodity processors.  The ratio of retired SIMD instructions to total instructions provides insight into\ the relative degree to which instructions provides insight into the relative degree to which Parabix achieves parallelism over the byte-at-a-time approach. Using the Intel Pin tool, we gather the dynamic instruction mix for each XML workload, and classify instructions as either vector (SIMD) or non-vector instructions.  Figures \ref{corei3_INS_p1} and \ref{corei3_INS_p2} show the percentage of SIMD instructions for Parabix1 and Parabix2 respectively. or non-vector instructions.  Figures~\ref{corei3_INS_p2} show the percentage of SIMD instructions for Parabix. The ratio of executed SIMD instructions over total instructions indicates the amount of parallel processing we were able to extract. %(Expat and Xerce do not use any SIMD instructions) For Parabix1, 18\% to 40\% of the executed instructions are SIMD instructions.  Using bit stream addition to scan XML characters in parallel, the Parabix2 instruction mix is made up of 60\% to 80\% SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density for both Parabix1 and Parabix2, the degradation rate of Parabix2 is much lower and thus the performance penalty incurred by increasing the markup density is reduced. The Parabix instruction mix is made up of 60\% to 80\% SIMD instructions.  The markup density of the files influence the number of scalar instructions needed to handle the tag processing which affects the overall parallelism that can be extracted by Parabix.  We find that degradation rate is low and thus the performance penalty incurred by increasing the markup density is minimal. %Expat and Xerce do not use any SIMD instructions and were not %included in this portion of the study. % Parabix gains its performance by using parallel bitstreams, which % are mostly generated and calculated by SIMD instructions.  The ratio % of executed SIMD instructions over total instructions indicates the % amount of parallel processing we were able to achieve.  We use Intel % are mostly generated and calculated by SIMD instructions.  We use Intel % pin, a dynamic binary instrumentation tool, to gather instruction % mix.  Then we adds up all the vector instructions that have been % executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} % show the percentage of SIMD instructions of Parabix1 and Parabix2 % show the percentage of SIMD instructions of Parabix1 and Parabix % (Expat and Xerce do not use any SIMD instructions).  For Parabix1, % 18\% to 40\% of the executed instructions consists of SIMD Figure \ref{corei3_TOT} shows overall parser performance evaluated in terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster on document-oriented input and 2 to 3 times faster on data-oriented input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5 to 4 times faster on document-oriented input and 4.5 to 7 times faster terms of CPU cycles per kilobyte.  Parabix is 2.5$\times$ to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster on data-oriented input.  Traditional parsers can be dramatically slowed by dense markup, while Parabix2 is generally unaffected.  The Xerces, this transcoding requires several cycles per byte.  However, transcoding using parallel bit streams is significantly faster and requires less than a single cycle per byte.  \cite{Cameron2008}. requires less than a single cycle per byte. \subsection{Power and Energy} In response to the growing industry concerns on power consumption and energy efficiency, chip producers work hard to not only improve performance but also achieve high energy efficiency in processors design. We study the power and energy consumption of Parabix in In this section, we study the power and energy consumption of Parabix in comparison with Expat and Xerces on \CITHREE{}. The average power of \CITHREE\ 530 is about 21 watts.  This Intel model has a good reputation for power efficiency. Figure \ref{corei3_power} shows the average power consumed by each parser.  Parabix2, dominated by SIMD instructions, uses approximately 5\% additional power. \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the average power consumed by each parser.  Parabix, dominated by SIMD instructions which uses approximately 5\% additional power. While the SIMD functional units are significantly wider than the scalar counterparts; register width and functional unit power account only for a small fraction of the overall power consumption in a processor pipeline. More importantly by using data parallel operations Parabix amortizes the fetch and data access overheads. This results in minimal power increase compared to the conventional parsers. Perhaps the energy trends shown in Figure \ref{corei3_energy} reveal an interesting trend. Parabix consumes substantially less energy than the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.  Although Parabix requires slightly more power (per instruction), the processing time of Parabix is significantly lower. \end{figure} As shown in Figure \ref{corei3_energy}, a comparison of energy efficiency demonstrates a more interesting result. Although Parabix2 requires slightly more power (per instruction), the processing time of Parabix2 is significantly lower, and therefore Parabix2 consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.