# Changeset 1335 for docs

Ignore:
Timestamp:
Aug 21, 2011, 4:20:30 PM (8 years ago)
Message:

Working on evaluation. Fixed Figure sizes

Location:
docs/HPCA2012
Files:
10 edited

Unmodified
Removed
• ## docs/HPCA2012/04-methodology.tex

 r1302 \begin{table*} \begin{center} { \footnotesize \begin{tabular}{|l||l|l|l|l|l|} \hline Markup Density          & 0.07                  & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline \end{tabular} } \end{center} \caption{XML Document Characteristics} \subsection{Workloads}\label{workloads} Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.  This metric has substantial influence on the performance of traditional recursive descent XML parser implementations. We use a mixture of document-oriented and data-oriented XML files in our study to provide workloads with a full spectrum of markup densities. Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.  This metric has substantial influence on the performance of traditional recursive descent XML parser implementations.  We use a mixture of document-oriented and data-oriented XML files in our study to provide workloads with a full spectrum of markup densities. Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for this performance study.  The jawiki.xml and dewiki.xml XML files represent document-oriented XML inputs and contain the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese and German characters respectively.  The remaining data files are data-oriented XML documents and consist entirely of single byte $7$-bit encoded ASCII characters. dewiki.xml XML files represent document-oriented XML inputs and contain the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese and German characters respectively.  The remaining data files are data-oriented XML documents and consist entirely of single byte $7$-bit encoded ASCII characters. Intel. Table \ref{core2info} gives the hardware description of the Intel \CO{} machine. \begin{table}[h] \begin{center} \begin{tabular}{|l||l|} \begin{table*}[h] \footnotesize \begin{tabular}{|l||l|l|l|} \hline Processor & Intel Core2 Duo processor 6400  (2.13GHz) \\ \hline L1 Cache & 32KB I-Cache, 32KB D-Cache \\ \hline L2 Cache & 2MB \\ \hline Front Side Bus &  1066 MHz\\ \hline Memory  & 2GB \\ \hline Hard disk & 80GB SCSI \\ \hline Max TDP & 65W \\ \hline Processor & Core2 Duo (2.13GHz) & i3-530 (2.93GHz) & Sandybridge (2.80GHz) \\ \hline L1 D Cache & 32KB & 32KB & 32KB \\ \hline L2 Cache & Shared 2MB & 256KB/core & 256KB/core \\ \hline L3 Cache & --- & 4MB  & 6MB \\ \hline Bus or QPI &  1066Mhz Bus & 1333Mhz QPI & 1333Mhz QPI \\ \hline Memory  & 2GB & 4GB & 6GB\\ \hline Max TDP & 65W & 73W &  95W \\ \hline \end{tabular} \end{center} \caption{\CO{}} \label{core2info} \end{table} \caption{Platform Hardware Specs} \end{table*} \paragraph {Intel \CITHREE{}} Intel \CITHREE\ processor, code name Nehalem, produced by Intel. The intent of the selection of this processor is to serve as an example of a low end server processor. Table \ref{i3info} gives the hardware description of the Intel \CITHREE\ machine. \begin{table}[h] \begin{center} \begin{tabular}{|l||l|} \hline Processor & Intel i3-530 (2.93GHz) \\ \hline L1 Cache & 32KB I-Cache, 32K D-Cache \\ \hline L2 Cache & 256KB \\ \hline L3 Cache & 4-MB \\ \hline Front Side Bus & 1333 MHz \\ \hline Memory  & 4GB \\ \hline Hard disk & SCSI 1TB \\ \hline Max TDP & 73W \\ \hline \end{tabular} \end{center} \caption{\CITHREE{}} \label{i3info} \end{table} \paragraph{Intel \CIFIVE{}} Intel \CIFIVE\  processor, code name \SB\, produced by Intel \CITHREE\ machine. Intel \CIFIVE\  processor, code name \SB\, produced by Intel. Table \ref{sandybridgeinfo} gives the hardware description of the Intel \CITHREE\ machine. \begin{table}[h] \begin{center} \begin{tabular}{|l||l|} \hline Processor & Intel Sandybridge i5-2300 (2.80GHz) \\ \hline L1 Cache &  32KB I-Cache, 32K D-Cache \\ \hline L2 Cache &  4 X 256KB \\ \hline L3 Cache & 6-MB \\ \hline Front Side Bus &  1333 MHz\\ \hline Memory  &  6GB DDDR\\ \hline Hard disk &  SATA 1TB\\ \hline Max TDP & 95W \\ \hline \end{tabular} \end{center} \caption{\SB{}} \label{sandybridgeinfo} \end{table} \subsection{PMC Hardware Events}\label{events} Each of the hardware events selected relates to performance and energy features associated with one or more hardware units.   For example, total branch mispredictions relate to the branch predictor and branch target buffer capacity. Each of the hardware events selected relates to performance and energy features associated with one or more hardware units.  For example, total branch mispredictions relate to the branch predictor and branch target buffer capacity. The set of PMC events used included in this study are as follows. \begin{itemize} \item Processor Cycles \item Branch Instructions \item Branch Mispredictions \item Integer Instructions \item SIMD Instructions \item Cache Misses \end{itemize} Processor Cycles, Branch Instructions, Branch Mispredictions, Integer Instructions, SIMD Instructions and Cache Misses. \subsection{Energy Measurement}
• ## docs/HPCA2012/05-corei3.tex

 r1302 %some of the numbers are roughly calculated, needs to be recalculated for final version \subsection{Cache behavior} \CITHREE\ has a three level cache hierarchy.  The approximate miss penalty for each cache level is 4, 11, and 36 cycles respectively.  Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of the parsers.  Although XML parsing is non memory intensive application, cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost whereas the performance of the Parabix parsers remains essentially unaffected by data cache misses.  Cache misses not only consume additional CPU cycles but increase application energy consumption.  L1, L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as input, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone. \CITHREE\ has a three level cache hierarchy.  The approximate miss penalty for each cache level is 4, 11, and 36 cycles respectively. Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of the parsers.  Although XML parsing is non memory intensive application, cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost whereas the performance of the Parabix parsers remains essentially unaffected by data cache misses.  Cache misses not only consume additional CPU cycles but increase application energy consumption.  L1, L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as input, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone. %With a 1GB input file, Expat would consume more than 0.6J and Xercesn %would consume 0.9J on cache misses alone. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf} \end{center} \caption{\CITHREE\ --- L1 Data Cache Misses (y-axis: Cache Misses per kB)} \subfigure[L1 Misses]{ \includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf} \label{corei3_L1DM} \end{figure} \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf} \end{center} \caption{\CITHREE\ --- L2 Data Cache Misses (y-axis: Cache Misses per kB)} } \subfigure[L2 Misses]{ \includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf} \label{corei3_L2DM} \end{figure} \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf} \end{center} \caption{\CITHREE\ --- L3 Cache Misses (y-axis: Cache Misses per kB)} \label{corei3_L3TM} } \subfigure[L3 Misses]{ \includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf} \label{corei3_L3DM} } \caption{Cache Misses per kB of input data.} \end{figure} \subsection{Branch Mispredictions} Despite improvements in branch prediction, branch misprediction penalties contribute significantly to XML parsing performance. On modern commodity processors the cost of a single branch misprediction is commonly cited as over 10 CPU cycles.  As shown in Figure \ref{corei3_BM}, the cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte---this cost alone is equal to the average total cost for Parabix2 to process each byte of XML. Despite improvements in branch prediction, branch misprediction penalties contribute significantly to XML parsing performance. On modern commodity processors the cost of a single branch misprediction is commonly cited as over 10 CPU cycles.  As shown in Figure \ref{corei3_BM}, the cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte---this cost alone is equal to the average total cost for Parabix2 to process each byte of XML. In general, reducing the branch misprediction rate is difficult in text-based XML parsing applications. This is due in part to the variable length nature of the syntactic elements contained within XML documents, a data dependent characterstic, as well as the extensive set of syntax constraints imposed by the XML 1.0 specification. As such, traditional byte-at-a-time XML parsers generate a performance limiting number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte processed on high density markup. In general, reducing the branch misprediction rate is difficult in text-based XML parsing applications. This is due in part to the variable length nature of the syntactic elements contained within XML documents, a data dependent characterstic, as well as the extensive set of syntax constraints imposed by the XML 1.0 specification. As such, traditional byte-at-a-time XML parsers generate a performance limiting number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte processed on high density markup. The performance improvement of Parabix1 in terms of branch mispredictions results from the veritable elimination of conditional branch instructions in scanning. Leveraging the processor built-in {\em bit scan} operation together with parallel bit stream technology Parabix1 can scan up to 64 bytes of source XML with a single {\em bit scan} instruction. In comparison, a byte-at-a-time parser must The performance improvement of Parabix1 in terms of branch mispredictions results from the veritable elimination of conditional branch instructions in scanning. Leveraging the processor built-in {\em bit scan} operation together with parallel bit stream technology Parabix1 can scan up to 64 bytes of source XML with a single {\em bit scan} instruction. In comparison, a byte-at-a-time parser must process a conditional branch instruction per XML byte scanned. As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost branch free. Utilizing a new parallel scanning technique based on bit stream addition, Parabix2 exhibits minimal dependence on source XML markup density. Figure \ref{corei3_BR} displays this lack of data dependence via the constant number of branch mispredictions shown for each of the source XML files. As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost branch free. Utilizing a new parallel scanning technique based on bit stream addition, Parabix2 exhibits minimal dependence on source XML markup density. Figure \ref{corei3_BR} displays this lack of data dependence via the constant number of branch mispredictions shown for each of the source XML files. % Parabix1 minimize the branches by using parallel bit % streams.  Parabix1 still have a few branches for each block of 128 % dependency on the markup density of the workloads. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf} \end{center} \caption{\CITHREE\ --- Branch Instructions (y-axis: Branches per kB)} \label{corei3_BR} \end{figure} \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf} \end{center} \caption{\CITHREE\ --- Branch Mispredictions (y-axis: Branch Mispredictions per kB)} \subfigure[Branch Instructions]{ \includegraphics[width=0.45\textwidth]{plots/corei3_BR.pdf} \label{corei3_BR} } \hfill \subfigure[Branch Misses]{ \includegraphics[width=0.42\textwidth]{plots/corei3_BM.pdf} \label{corei3_BM} } \caption{Branch characteristics on the \CITHREE\ per kB of input data.} \end{figure} \subsection{SIMD Instructions vs. Total Instructions} Parabix achieves performance via parallel bit stream technology. In Parabix XML processing, parallel bit streams are both computed and predominately operated upon using the SIMD instructions of commodity processors.  The ratio of retired SIMD instructions to total instructions provides insight into\ the relative degree to which Parabix achieves parallelism over the byte-at-a-time approach. Parabix achieves performance via parallel bit stream technology. In Parabix XML processing, parallel bit streams are both computed and predominately operated upon using the SIMD instructions of commodity processors.  The ratio of retired SIMD instructions to total instructions provides insight into\ the relative degree to which Parabix achieves parallelism over the byte-at-a-time approach. Using the Intel Pin tool, we gather the dynamic instruction mix for each XML workload, and classify instructions as either vector (SIMD) or non-vector instructions. Figures \ref{corei3_INS_p1} and \ref{corei3_INS_p2} show the percentage of SIMD instructions for Parabix1 and Parabix2 respectively. Using the Intel Pin tool, we gather the dynamic instruction mix for each XML workload, and classify instructions as either vector (SIMD) or non-vector instructions.  Figures \ref{corei3_INS_p1} and \ref{corei3_INS_p2} show the percentage of SIMD instructions for Parabix1 and Parabix2 respectively. %(Expat and Xerce do not use any SIMD instructions) For Parabix1, 18\% to 40\% of the executed instructions are SIMD instructions.  Using Parabix2 is much lower and thus the performance penalty incurred by increasing the markup density is reduced. %Expat and Xerce do not use any SIMD instructions and were not included in this portion of the study. %Expat and Xerce do not use any SIMD instructions and were not %included in this portion of the study. % Parabix gains its performance by using parallel bitstreams, which are % mostly generated and calculated by SIMD instructions.  The ratio of % executed SIMD instructions over total instructions indicates the % Parabix gains its performance by using parallel bitstreams, which % are mostly generated and calculated by SIMD instructions.  The ratio % of executed SIMD instructions over total instructions indicates the % amount of parallel processing we were able to achieve.  We use Intel % pin, a dynamic binary instrumentation tool, to gather instruction mix. % Then we adds up all the vector instructions that have been executed. % Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the % percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and % Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\% % of the executed instructions consists of SIMD instructions.  By using % bistream addition for parallel scanning, Parabix2 uses 60\% to 80\% % SIMD instructions.  Although the ratio decrease as the markup density % increase for both Parabix1 and Parabix2, the decreasing rate of % Parabix2 is much lower and thus the performance degradation caused by % increasing markup density is smaller. % pin, a dynamic binary instrumentation tool, to gather instruction % mix.  Then we adds up all the vector instructions that have been % executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} % show the percentage of SIMD instructions of Parabix1 and Parabix2 % (Expat and Xerce do not use any SIMD instructions).  For Parabix1, % 18\% to 40\% of the executed instructions consists of SIMD % instructions.  By using bistream addition for parallel scanning, % Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio % decrease as the markup density increase for both Parabix1 and % Parabix2, the decreasing rate of Parabix2 is much lower and thus the % performance degradation caused by increasing markup density is % smaller. \subsection{CPU Cycles} Figure \ref{corei3_TOT} shows overall parser performance evaluated in terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster on document-oriented input and 2 to 3 times faster on data-oriented input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5 to 4 times faster on document-oriented input and 4.5 to 7 times faster on data-oriented input.  Traditional parsers can be dramatically slowed by dense markup, while Parabix2 is generally unaffected.  The results presented are not entirely fair to the Xerces parser since it first transcodes input from UTF-8 to UTF-16 before processing. In Xerces, this transcoding requires several cycles per byte.  However, transcoding using parallel bit streams is significantly faster and requires less than a single cycle per byte.  \cite{Cameron2008}. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p1.pdf} \end{center} \caption{Parabix1 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions} \label{corei3_INS_p1} \subfigure[Performance : \# Cycles/kb]{ \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf} \label{corei3_TOT} } \hfill \subfigure[SIMD Instruction Breakdown. Y Axis :  \% SIMD Instruction/kb]{ \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf} \label{corei3_INS_p2} } \end{figure} \subsection{Power and Energy} In response to the growing industry concerns on power consumption and energy efficiency, chip producers work hard to not only improve performance but also achieve high energy efficiency in processors design. We study the power and energy consumption of Parabix in comparison with Expat and Xerces on \CITHREE{}. The average power of \CITHREE\ 530 is about 21 watts.  This Intel model has a good reputation for power efficiency. Figure \ref{corei3_power} shows the average power consumed by each parser.  Parabix2, dominated by SIMD instructions, uses approximately 5\% additional power. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf} \end{center} \caption{Parabix2 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)} \label{corei3_INS_p2} \subfigure[Avg. Power (Watts)]{ \includegraphics[width=0.4\textwidth]{plots/corei3_power.pdf} \label{corei3_power} } \hfill \subfigure[Energy Consumption ($\mu$J per kB)]{ \includegraphics[width=0.4\textwidth]{plots/corei3_energy.pdf} \label{corei3_energy} } \end{figure} \subsection{CPU Cycles} As shown in Figure \ref{corei3_energy}, a comparison of energy efficiency demonstrates a more interesting result. Although Parabix2 requires slightly more power (per instruction), the processing time of Parabix2 is significantly lower, and therefore Parabix2 consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively. Figure \ref{corei3_TOT} shows overall parser performance evaluated in terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster on document-oriented input and 2 to 3 times faster on data-oriented input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5 to 4 times faster on document-oriented input and 4.5 to 7 times faster on data-oriented input.  Traditional parsers can be dramatically slowed by dense markup, while Parabix2 is generally unaffected.  The results presented are not entirely fair to the Xerces parser since it first transcodes input from UTF-8 to UTF-16 before processing. In Xerces, this transcoding requires several cycles per byte.  However, transcoding using parallel bit streams is significantly faster and requires less than a single cycle per byte. \cite{Cameron2008}. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf} \end{center} \caption{\CITHREE\ --- Performance (y-axis: CPU Cycles per kB)} \label{corei3_TOT} \end{figure} \subsection{Power and Energy} In response to the growing industry concerns on power consumption and energy efficiency, chip producers work hard to not only improve performance but also achieve high energy efficiency in processors design. We study the power and energy consumption of Parabix in comparison with Expat and Xerces on \CITHREE{}. The average power of \CITHREE\ 530 is about 21 watts. This Intel model has a good reputation for power efficiency. Figure \ref{corei3_power} shows the average power consumed by each parser. Parabix2, dominated by SIMD instructions, uses approximately 5\% additional power. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf} \end{center} \caption{\CITHREE\ --- Average Power Consumption (watts)} \label{corei3_power} \end{figure} As shown in Figure \ref{corei3_energy}, a comparison of energy efficiency demonstrates a more interesting result. Although Parabix2 requires slightly more power (per instruction), the processing time of Parabix2 is significantly lower, and therefore Parabix2 consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf} \end{center} \caption{\CITHREE\ --- Energy Consumption ($\mu$J per kB)} \label{corei3_energy} \end{figure}
• ## docs/HPCA2012/06-scalability.tex

 r1302 \section{Scalability} \subsection{Performance} Figure \ref{Scalability} (a) demonstrates the average XML well-formedness checking performance of Parabix2 for each of the workloads and as executed on each of the processor cores --- \CO\, \CITHREE\ and \SB{}. Processing time is shown in terms of bit stream based operations executed in bit-space' and postprocessing operations executed in byte-space'. In the Parabix2 parser, bit-space parallel bit stream parser operations consist primarily of SIMD instructions; byte-space operations consist of byte comparisons across arrays of values. Executing Parabix2 on \CITHREE{} over \CO\ results in an average performance improvement of 17\% in bit stream processing whereas migrating Parabix2 from \CITHREE{} to \SB{} results in a 22\% average performance gain. Bit space measurements are stable and consistent across each of the source inputs and cores. Postprocessing operations demonstrate data dependent variance. Performance gains from 18\% to 31\% performance are observered in migrating Parabix2 from \CO\ to \CITHREE{}; 0\% to 17\% performance from \CITHREE\ to \SB{}. For the purpose of comparison, Figure \ref{Scalability} (b) shows the performance of the Expat parser on each of the processor cores. A performance improvement of less than 5\% is observed when executing Expat on \CITHREE\ over \CO\ and less than 10\% on \SB\ over \CITHREE{}. Figure \ref{Scalability} (a) demonstrates the average XML well-formedness checking performance of Parabix2 for each of the workloads and as executed on each of the processor cores --- \CO\, \CITHREE\ and \SB{}.  Processing time is shown in terms of bit stream based operations executed in bit-space' and postprocessing operations executed in byte-space'.  In the Parabix2 parser, bit-space parallel bit stream parser operations consist primarily of SIMD instructions; byte-space operations consist of byte comparisons across arrays of values. Executing Parabix2 on \CITHREE{} over \CO\ results in an average performance improvement of 17\% in bit stream processing whereas migrating Parabix2 from \CITHREE{} to \SB{} results in a 22\% average performance gain. Bit space measurements are stable and consistent across each of the source inputs and cores. Postprocessing operations demonstrate data dependent variance. Performance gains from 18\% to 31\% performance are observered in migrating Parabix2 from \CO\ to \CITHREE{}; 0\% to 17\% performance from \CITHREE\ to \SB{}. For the purpose of comparison, Figure \ref{Scalability} (b) shows the performance of the Expat parser on each of the processor cores.  A performance improvement of less than 5\% is observed when executing Expat on \CITHREE\ over \CO\ and less than 10\% on \SB\ over \CITHREE{}. Overall, Parabix2 scales better than Expat. Simply executing identical Parabix2 object code on \SB\ results in an overall performance improvement up to 26\%. Additional performance aspects of Parabix2 on \SB\ with AVX instructions are discussed in the following sections. Overall, Parabix2 scales better than Expat. Simply executing identical Parabix2 object code on \SB\ results in an overall performance improvement up to 26\%. Additional performance aspects of Parabix2 on \SB\ with AVX instructions are discussed in the following sections. \begin{figure} \subsection{Power and Energy} Figure \ref{power_Parabix2} shows the average power consumption of Parabix2 over each workload and as executed on each of the processor cores --- \CO{}, \CITHREE\ and \SB{}. Average power consumption on \CO{} is 32 watts. Execution on \CITHREE\ results in 30\% power saving over \CO{}. \SB\ saves 25\% of the power compared with \CITHREE\ and consumes only 15 watts. Figure \ref{power_Parabix2} shows the average power consumption of Parabix2 over each workload and as executed on each of the processor cores --- \CO{}, \CITHREE\ and \SB{}.  Average power consumption on \CO{} is 32 watts. Execution on \CITHREE\ results in 30\% power saving over \CO{}.  \SB\ saves 25\% of the power compared with \CITHREE\ and consumes only 15 watts. In XML parsing we observe energy consumption is dependent on processing time. That is, a reduction in processing time results in a directly proportional reduction in energy consumption. With newer processor cores comes improvements in application performance. As a result, Parabix2 executed on \SB\ consumes 72\% to 75\% less energy than Parabix2 on \CO{}. \begin{figure} \begin{center} \includegraphics[width=85mm]{plots/power_Parabix2.pdf} \end{center} \caption{Average Power of Parabix2 (watts)} \label{power_Parabix2} \end{figure} \begin{figure} \begin{center} \centering \subfigure[Avg. Power of Parabix on various hardware (Watts)]{ \includegraphics[width=85mm]{plots/power_Parabix2.pdf} \label{power_Parabix2} } \hfill \centering \subfigure[Avg. Energy Consumption on various hardware (nJ per kB)]{ \includegraphics[width=85mm]{plots/energy_Parabix2.pdf} \end{center} \caption{Energy consumption of Parabix2 (nJ/B)} \label{energy_Parabix2} } \end{figure}
• ## docs/HPCA2012/07-avx.tex

 r1302 \subsection{256-bit AVX Operations} With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX implementation, Intel has focused primarily on floating point operations as opposed to the integer based operations. 256-bit SIMD is available for loads, stores, bitwise logic and floating operations, whereas SIMD integer operations and shifts are only available in the 128-bit form.  Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated but not achieved. With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX implementation, Intel has focused primarily on floating point operations as opposed to the integer based operations.  256-bit SIMD is available for loads, stores, bitwise logic and floating operations, whereas SIMD integer operations and shifts are only available in the 128-bit form.  Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated but not achieved. \subsection{Performance Results} 256-bit AVX technology. Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected, however, the number of bitwise SIMD'' operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.   Ideally one one may expect up to a 50\% reduction in these instructions versus the 128-bit AVX.  The actual reduction measured was 32\%--39\% depending on workload.   Because some bitwise logic is needed in implementation of simulated 256-bit operations, the full 50\% reduction in bitwise logic was not achieved. Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected, however, the number of bitwise SIMD'' operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.  Ideally one one may expect up to a 50\% reduction in these instructions versus the 128-bit AVX.  The actual reduction measured was 32\%--39\% depending on workload.  Because some bitwise logic is needed in implementation of simulated 256-bit operations, the full 50\% reduction in bitwise logic was not achieved. The other SIMD'' class shows a substantial 30\%-35\% reduction While the successive reductions in SIMD instruction counts are quite dramatic with the two AVX implementations of Parabix2, the performance benefits are another story.   As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits. Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance. Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%. We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2. benefits are another story.  As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits.  Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.  Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%, saw a performance improvement of 8\%. We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.  If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2.
• ## docs/HPCA2012/08-arm.tex

 r1302 \def\CORTEXA8{Cortex-A8} \section {Parabix2 on GT-P1000M} \section {Parabix on Mobile Platforms} The Samsung Galaxy Tab GT-P1000M device houses a Samsung S5PC110 ARM \CORTEXA8{} single-core, dual-issue, superscalar microprocessor. In addition to the standard feature set found in such low-power 32-bit microprocessors, the S5PC110 includes the ARM NEON general-purpose SIMD engine. ARM NEON makes available a 128-bit SIMD instruction set similar in functionality to Intel SSE3 instruction set. In this section, we present our performance comparison of a NEON-based port of Parabix2 versus the Expat parser, and executed on the Samsung Galaxy Tab GT-P1000M hardware. Parabix1 and Xerces are excluded from this portion of our study due to the complexity of the cross-platform build process in porting native C/C++ applications to the Android platform. The Samsung Galaxy Tab GT-P1000M device houses a Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue, superscalar microprocessor. It includes a 32kB L1 data cache and a 512kB L2 shared cache. In addition to the standard feature set found in such low-power 32-bit microprocessors, the S5PC110 includes the ARM NEON general-purpose SIMD engine. ARM NEON makes available a 128-bit SIMD instruction set similar in functionality to Intel SSE3 instruction set. In this section, we present our performance comparison of a NEON-based port of Parabix2 versus the Expat parser, and executed on the Samsung Galaxy Tab GT-P1000M hardware.  Xerces is excluded from this portion of our study due to the complexity of the cross-platform build process in porting native C/C++ applications to the Android platform. \subsection{Platform Hardware} %\paragraph{GT-P1000M} Samsung Galaxy Tab GT-P1000M was produced by Samsung and incorporates the ARM \CORTEXA8{} microprocessor. Table \ref{arminfo} describes the Samsung Galaxy Tab GT-P1000M hardware. \begin{table}[h] \begin{center} \begin{tabular}{|l||l|} \hline Processor & ARM \CORTEXA8{} (1GHz) \\ \hline L1 Cache & 32kB I-Cache, 32kB D-Cache \\ \hline L2 Cache & 512kB \\ \hline Flash & 16GB \\ \hline \end{tabular} \end{center} \caption{GT-P1000M} \label{arminfo} \end{table} \subsection{Performance Results} \begin{figure} \begin{center} \subfigure[ARM Neon Performance]{ \includegraphics[width=0.5\textwidth]{plots/arm_TOT.pdf} \end{center} \caption{Parabix2 Performance on GT-P1000M (y-axis: CPU Cycles per kB)} \label{arm_processing_time} } \hfill \subfigure[Performance ARM Neon vs Core i3 SSE.]{ \includegraphics[width=0.5\textwidth]{plots/RelativePerformanceARMvsCoreI3.pdf} \label{relative_performance_arm_vs_i3} } \end{figure} Migration of Parabix2 to the Android platform began with the retargetting of a subset of the Parabix2 IDISA SIMD library for ARM NEON. This library code was cross-compiled for Android using the Android NDK. The Android NDK is a companion tool to the Android SDK that allows developers to build performance-critical portions of applications in native code. The majority of the Parabix2 SIMD functionality ported directly. However, for a small subset of the SIMD functions of Parabix2 NEON equivalents did not exist. In such cases we simply simulated logical equivalencies using the available the instruction set. A comparison of Figure \ref{arm_processing_time} and Figure \ref{corei3_TOT} demonstrates that the performance of both Parabix2 and Expat degrades substantially on \CORTEXA8{}.  This result was expected given the combarably performance limited \CORTEXA8{} hardware architecture.  Surprisingly on \CORTEXA8{}  Expat outperforms Parabix2 on each of the lower markup density workloads, dew.xml and jaw.xm. On the remaining higher-density workloads, Parabix2 performs only moderately better than Expat. The higher latency of the NEON instructions on \CORTEXA8{} is the likely contributor to this loss in performance. A more interesting aspect of this result is demonstrated in a comparison of Figure \ref{relative_performance_arm_vs_i3} and Figure \ref{relative_performance_arm_vs_i3}. These figure demonstrate that the relative performance of each parser degrades in a relatively constant manner. That is, compared to the \CITHREE{}, on the GT-P1000M, Parabix2 and Expat operate at approximately 17.2\% and 55.7\% efficiency respectively. Figure \ref{relative_performance_arm_vs_i3} shows that the baseline cost of Parabix2 operations implemented using the NEON instruction set--- and thereby the baseline cost of Parabix2---is substantially higher on the \CORTEXA8{} processor. Given that Parabix2 was not designed with the limitations of the \CORTEXA8{} in mind, in the future a careful analysis of the cost of each instruction provided in the ARMv7 ISA may allow us to better utilize the hardware resources provided. In particular, future performance enhancement to ARM NEON could result in substantial overall improvement in Parabix2 execution time. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/RelativePerformanceARMvsCoreI3.pdf} \end{center} \caption{Relative Slow Down of Parbix2 and Expat on GT-P1000M vs. \CITHREE{} } \label{relative_performance_arm_vs_i3} \end{figure} Migration of Parabix2 to the Android platform began with the retargetting of a subset of the Parabix2 IDISA SIMD library for ARM NEON.  This library code was cross-compiled for Android using the Android NDK. The Android NDK is a companion tool to the Android SDK that allows developers to build performance-critical portions of applications in native code. The majority of the Parabix2 SIMD functionality ported directly. However, for a small subset of the SIMD functions of Parabix2 NEON equivalents did not exist. In such cases we simply simulated logical equivalencies using the available the instruction set. A comparison of Figure \ref{arm_processing_time} and Figure \ref{corei3_TOT} demonstrates that the performance of both Parabix2 and Expat degrades substantially on \CORTEXA8{}.  This result was expected given the combarably performance limited \CORTEXA8{} hardware architecture.  Surprisingly on \CORTEXA8{} Expat outperforms Parabix2 on each of the lower markup density workloads, dew.xml and jaw.xm. On the remaining higher-density workloads, Parabix2 performs only moderately better than Expat.  The higher latency of the NEON instructions on \CORTEXA8{} is the likely contributor to this loss in performance. A more interesting aspect of this result is demonstrated in a comparison of Figure \ref{relative_performance_arm_vs_i3} and Figure \ref{relative_performance_arm_vs_i3}. These figure demonstrate that the relative performance of each parser degrades in a relatively constant manner.  That is, compared to the \CITHREE{}, on the GT-P1000M, Parabix2 and Expat operate at approximately 17.2\% and 55.7\% efficiency respectively. Figure \ref{relative_performance_arm_vs_i3} shows that the baseline cost of Parabix2 operations implemented using the NEON instruction set--- and thereby the baseline cost of Parabix2---is substantially higher on the \CORTEXA8{} processor.  Given that Parabix2 was not designed with the limitations of the \CORTEXA8{} in mind, in the future a careful analysis of the cost of each instruction provided in the ARMv7 ISA may allow us to better utilize the hardware resources provided. In particular, future performance enhancement to ARM NEON could result in substantial overall improvement in Parabix2 execution time.
• ## docs/HPCA2012/09-pipeline.tex

 r1331 \begin{table*}[t] { \centering \footnotesize \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|} \begin{tabular}{|c|@{~}c@{~}|c|@{~}c@{~}|c@{~}|@{~}c@{~}|c|@{~}c@{~}|c|@{~}c@{~}|c|@{~}c@{~}|} \hline & & \multicolumn{10}{|c|}{Data Structures}\\ \hline \end{center} \caption{Relationship between Each Pass and Data Structures} \label{pass_structure} \label{pass_structure} } \end{table*} The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/performance.pdf} \end{center} \caption{Processing Time (y axis: CPU cycles per byte)} \label{multithread_perf} \end{figure} Figure \ref{power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version. \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/power.pdf} \end{center} \caption{Average Power (watts)} \subfigure[Performance (Cycles / Byte)]{ \includegraphics[width=0.32\textwidth]{plots/performance.pdf} \label{performance} } \subfigure[Avg. Power Consumption]{ \includegraphics[width=0.32\textwidth]{plots/power.pdf} \label{power} \end{figure} \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{plots/energy.pdf} \end{center} \caption{Energy Consumption (nJ per byte)} } \subfigure[Avg. Energy Consumption (nJ / Byte)]{ \includegraphics[width=0.32\textwidth]{plots/energy.pdf} \label{energy} } \caption{Multithreaded Parabix} \label{multithread_perf} \end{figure}
• ## docs/HPCA2012/latex/iccv.sty

 r1327 \newpage \null \vskip .375in %  \vskip .375in \begin{center} {\Large \bf \@title \par}
• ## docs/HPCA2012/main.tex

 r1331 \input{10-conclusions.tex} % tighten spacing: \let\oldthebibliography\thebibliography \def\thebibliography#1{\oldthebibliography{#1}\parsep-5pt\itemsep0pt} % \vspace{-\baselineskip} { \setstretch{1} \footnotesize % \scriptsize \bibliographystyle{abbrv} \bibliography{reference}
• ## docs/HPCA2012/preamble-submit.tex

 r1326 % \iccvfinalcopy % *** Uncomment this line for the final submission \marginparsep 0in \marginparwidth 0in \topmargin -0.4in \topmargin -0.2in %\headheight 0in %\headsep 0in %\footskip 0.3in \textheight 9.5in \textheight 9.2in %\textfloatsep 0.1in %\floatsep 0.1in
Note: See TracChangeset for help on using the changeset viewer.