# Changeset 1365

Ignore:
Timestamp:
Aug 24, 2011, 1:55:27 PM (8 years ago)
Message:

Fixed methodology

Location:
docs/HPCA2012
Files:
4 edited

### Legend:

Unmodified
 r1362 \section{Methodology} \section{Evaluation Framework} \label{section:methodology} In this section we describe our methodology for the measurements and investigation of XML parser energy consumption and performance.  In brief, for each of the four XML parsers under study we propose to measure and evaluate the energy consumption required to carry out XML well-formedness checking, under a variety of workloads, and as executed on three different Intel processors. \paragraph{XML Parsers}\label{parsers} To begin our study we propose to first investigate each of the XML parsers in terms of the Performance Monitoring Counter (PMC) hardware events listed in the PMC Hardware Events subsection. Based on the findings of previous work \cite{bellosa2001, bertran2010, bircher2007} we have chosen several key hardware performance events for which the authors indicate a strong correlation with overall performance and energy consumption of the application. In addition, we measure the runtime counts of SIMD instructions and bitwise operations using the Intel Pin binary instrumentation framework. Based on these data we gain further insight into XML parser execution characteristics and compare and constrast each of the Parabix parser versions against the performance of standard industry parsers. The foundational work by Bellosa in \cite{bellosa2001} as well as more recent work in \cite {bircher2007, bertran2010} demonstrate that hardware-usage patterns have a significant impact on the energy consumption characteristics of an application \cite{bellosa2001, bircher2007, bertran2010}. Further, the authors demonstrate a strong correlation between specific PMC events and energy usage. However, each author differs slightly in their opinion of the exact set of PMCs to use. The following subsections describe the XML parsers under study, XML workloads, the hardware architectures, PMC hardware events selected for measurement, and the energy measurement instrumentation set up. We analyze the performance of each of the XML parsers under study based on PMC hardware event counts and contrast their energy consumption measurements based on direct measurements. In our evaluation we evaluate Parabix against two widely available software parsers.  Xerces-C++, and Expat XML parsers. Parabix is our open-sourced XML parser that leverages Parallel Bit Stream technology and the SIMD capabilities of modern commodity processors.  Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a validating open source XML parser written in C++ available as part of the the Apache project. Expat version 2.0.1 \cite{expat} is a non-validating XML parser library written in C. \subsection{Parsers}\label{parsers} \paragraph{XML Workloads}\label{workloads} XML is used for a variety of purposes ranging from databases to config files in mobile phones. A key feature of these XML files that affects the overall parsing performance is the \textit{Markup density}. \textit{Markup density} is defined as the ratio of the total markup contained within an XML file to the total XML document size.  This metric has substantial influence on the performance of traditional recursive descent XML parser implementations.  We use a mixture of document-oriented and data-oriented XML files in our study to provide workloads with a full spectrum of markup densities. The XML parsing technologies selected for this study are the Parabix1, Parabix2, Xerces-C++, and Expat XML parsers. Parabix1 (parallel bit Streams for XML) is our first generation SIMD and Parallel Bit Stream technology based XML parser \cite{Parabix1}.  Parabix1 leverages the processor built-in {\em bitscan} operation for high-performance XML character scanning as well as the SIMD capabilities of modern commodity processors to achieve high performance.  Parabix2 \cite{parabix2} represents the second generation of the Parabix1 parser. Parabix2 is an open-source XML parser that also leverages Parallel Bit Stream technology and the SIMD capabilities of modern commodity processors. However, Parabix2 differs from Parabix1 in that it employs new parallelization techniques, such as a multiple cursor approach to parallel parsing together with bit stream addition techniques to advance multiple cursors independently and in parallel. Parabix2 delivers dramatic performance improvements over traditional byte-at-a-time parsing technology.  Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a validating open source XML parser written in C++ by the Apache project.  Expat version 2.0.1 \cite{expat} is a non-validating XML parser library written in C. Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for this performance study.  The jawiki.xml and dewiki.xml XML files represent document-oriented XML inputs and contain the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese and German characters respectively.  The remaining data files are data-oriented XML documents and consist entirely of single byte $7$-bit encoded ASCII characters. \begin{table*} \end{table*} \subsection{Workloads}\label{workloads} Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.  This metric has substantial influence on the performance of traditional recursive descent XML parser implementations.  We use a mixture of document-oriented and data-oriented XML files in our study to provide workloads with a full spectrum of markup densities. \paragraph{Platform Hardware} SSE extensions have been available on commodity Intel processors for over a decade since the Pentium III. They have steadily evolved with improvements in instruction latency, cache interface, and register resources, and the addition domain specific instructions. Here we investigate SIMD extensions across three different generations of intel processors. Table \ref{hwinfo} describes the Intel multicores we investigate. We compare the energy and performance profile of the Parabix under the platforms.  We also analyze the implementation specifics of SIMD extensions under various microarchitecture. We we evalute both the legacy SSE and newer AVX extensions supported by Sandybridge. Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for this performance study.  The jawiki.xml and dewiki.xml XML files represent document-oriented XML inputs and contain the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese and German characters respectively.  The remaining data files are data-oriented XML documents and consist entirely of single byte $7$-bit encoded ASCII characters. We propose to investigate each the execution profiles of XML parsers using the the Performance Monitoring Counter (PMC) hardware event found in the processor. We have chosen several key hardware performance events which provide insight into the profile of our application and indicate if the processor is doing useful work~\cite{bellosa2001, bertran2010}.  The set of performance counters included in our study are Branch instructions, Branch mispredictions, Integer instructions, SIMD instructions, and Cache misses. In addition, we characterize the SIMD operations and study the type and class of SIMD operations using the Intel Pin binary instrumentation framework. \subsection{Platform Hardware} \paragraph{Intel \CO{}} Intel \CO{} processor, code name Conroe, produced by Intel. Table \ref{core2info} gives the hardware description of the Intel \CO{} machine. \begin{table*}[h] \end{tabular} \caption{Platform Hardware Specs} \label{hwinfo} \end{table*} Intel \CITHREE\ processor, code name Nehalem, produced by Intel. The intent of the selection of this processor is to serve as an example of a low end server processor. Table \ref{i3info} gives the hardware description of the Intel \CITHREE\ machine. Intel \CIFIVE\  processor, code name \SB\, produced by Intel. Table \ref{sandybridgeinfo} gives the hardware description of the Intel \CITHREE\ machine. Each of the hardware events selected relates to performance and energy features associated with one or more hardware units.  For example, total branch mispredictions relate to the branch predictor and branch target buffer capacity. The set of PMC events used included in this study are as follows. Processor Cycles, Branch Instructions, Branch Mispredictions, Integer Instructions, SIMD Instructions and Cache Misses. \subsection{Energy Measurement} We measure energy consumption using the Fluke i410 current clamp applied on the 12V wires that supply power to the processor sockets. The clamp detects the magnetic field created by the flowing current and converts it into voltage levels (1mV per 1A current). The voltage levels are then monitored by an Agilent 34410a multimeter at the granularity of 100 samples per second. This measurement captures the power to the processor package, including cores, caches, Northbridge memory controller, and the quick-path interconnects \cite{clamp}. \paragraph{Energy Measurement} A key benefit of the Parabix parser is its more efficient use of the processor pipeline which reflects in the overall energy usage.  We measure the energy consumption of the processor directly using a current clamp. We apply the Fluke i410 current clamp \cite{clamp} to the 12V wires that supply power to the processor sockets. The clamp detects the magnetic field created by the flowing current and converts it into voltage levels (1mV per 1A current). The voltage levels are then monitored by an Agilent 34410a digital multimeter at the granularity of 100 samples per second. This measurement captures the instantaneous power to the processor package, including cores, caches, northbridge memory controller, and the quick-path interconnects. We obtain samples throughout the entire execution of the program and then calculate overall total energy as  $12V*\sigma^{N_{samples}}_{i=1} Sample_i$.
 r1361 \section{Scaling Parabix2 for AVX} \section{Scaling Parabix for AVX} \label{section:avx} In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix2 port. Parabix2 originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but has recently been ported to AVX. AVX technology is commercially available on the latest the \SB\ microarchitecture Intel processors. In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The Parabix SIMD libraries originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but has recently been ported to AVX. AVX technology is commercially available on the latest the \SB\ microarchitecture Intel processors. While we have to port our runtime framework the application didn't need to be modified. \begin{figure*} \includegraphics[height=0.25\textheight]{plots/InsMix.pdf} \end{center} \caption{Parabix2 Instruction Counts (y-axis: Instructions per kB)} \caption{Parabix Instruction Counts (y-axis: Instructions per kB)} \label{insmix} \end{figure*} \includegraphics[width=0.5\textwidth]{plots/avx.pdf} \end{center} \caption{Parabix2 Performance (y-axis: ns per kB)} \caption{Parabix Performance (y-axis: ns per kB)} \label{avx} \end{figure} \subsection{Three Operand Form} \paragraph{3-Operand Form} In addition to the widening of 128-bit operations to 256-bit operations, AVX technology uses a nondestructive 3-operand instruction format. Previous SSE implementations used a destructive 2-operand instruction format. In the 2-operand format a single register is used as both a source and destination register. For example, $a = a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the value of both $a$ and $b$, must either copy an additional register value beforehand, or reconstitute or reload a register value afterwards to recover the value.  With the 3-operand format, output may now be directed to the third register independently of the source operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the copying or reconstituting of operand values, a considerable reduction in instructions required for unloading from and loading into registers.  AVX technology makes available the 3-operand form for both the new 256-bit operations as well as the base 128-bit SSE operations. In addition to the widening of 128-bit operations to 256-bit operations, AVX technology uses a nondestructive 3-operand instruction format. Previous SSE implementations used a destructive 2-operand instruction format. In the 2-operand format a single register is used as both a source and destination register. For example, $a = a~\texttt{[op]}~b$. As such, 2-operand instructions that require the value of both $a$ and $b$, must either copy an additional register value beforehand, or reconstitute or reload a register value afterwards to recover the value. With the 3-operand format, output may now be directed to the third register independently of the source operands. For example, $c = a~\texttt{[op]}~b$. By avoiding the copying or reconstituting of operand values, a considerable reduction in instruction count in the form of reduced load and store instructions is possible. AVX technology makes available the 3-operand form for both the new 256-bit operations as well as the base 128-bit SSE operations. \subsection{256-bit AVX Operations} \subsection{256-bit Operations} With the introduction of 256-bit SIMD registers, and under ideal conditions, one would anticipate a corresponding 50\% reduction in the SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX implementation, Intel has focused primarily on floating point operations as opposed to the integer based operations.  256-bit SIMD is available for loads, stores, bitwise logic and floating operations, whereas SIMD integer operations and shifts are only available in the 128-bit form.  Nevertheless, with loads, stores and bitwise logic comprising a major portion of the Parabix2 SIMD instruction mix, a substantial reduction in instruction count and consequent performance improvement was anticipated but not achieved. 128-bit form. \subsection{Performance Results} We implemented two versions of Parabix2 using AVX technology.  The first was simply the recompilation of the existing Parabix2 source code written to take advantage of the 3-operand form of AVX instructions while retaining a uniform 128-bit SIMD processing width.  The second involved rewriting the core library functions of Parabix2 to leverage the 256-bit AVX operations wherever possible and to simulate the remaining operations using pairs of 128-bit operations. Figure \ref{insmix} shows the reduction in instruction counts achieved in these two versions.   For each workload, the base instruction count of the Parabix2 binary compiled in SSE-only mode is shown with the caption sse,'' the version obtained by simple recompilation with AVX-mode enabled is labeled 128-bit avx,'' and the version reimplemented to use 256-bit operations wherever possible is labelled 256-bit avx.''    The instruction counts are divided into three classes.  The non-SIMD'' operations are the general purpose instructions that use neither SSE nor AVX technology.   The bitwise SIMD'' class comprises the bitwise logic operations, that are available in both 128-bit form and 256-bit form.  The other SIMD'' class comprises all other SIMD operations, primarily comprising the integer SIMD operations that are available only at 128-bit widths even with 256-bit AVX technology. We implemented two versions of Parabix using AVX technology.  The first was simply the recompilation of the existing Parabix source code written to take advantage of the 3-operand form of AVX instructions while retaining a uniform 128-bit SIMD processing width.  The second involved rewriting the internal library functions of Parabix to leverage the 256-bit AVX operations wherever possible and to simulate the remaining operations using pairs of 128-bit operations.Figure \ref{insmix} shows the reduction in instruction counts achieved in these two versions.  For each workload, the base instruction count of the Parabix binary compiled in SSE-only mode is indicated by sse,'' the version which only takes advantage of the AVX 3-operand mode is labeled 128-bit avx,'' and the version reimplemented to use 256-bit operations wherever possible is labelled 256-bit avx.''  The instruction counts are divided into three classes: non-SIMD'' operations are the general purpose instructions.  The bitwise SIMD'' class comprises the bitwise logic operations, that are available in both 128-bit form and 256-bit form.  The other SIMD'' class comprises all other SIMD operations, primarily comprising the integer SIMD operations that are available only at 128-bit widths even under AVX. Note that, in each workload, the number of non-SIMD instructions remains relatively constant with each workload.  As may be expected, however, the number of bitwise SIMD'' operations remains the same remains relatively constant with each workload.  As may be expected the number of \textit{bit-parallel SIMD} operations remains the same for both SSE and 128-bit while dropping dramatically when operating 256-bits at a time.  Ideally one one may expect up to a 50\% reduction in these instructions versus the 128-bit AVX.  The actual reduction measured was 32\%--39\% depending on workload.  Because some bitwise logic is needed in implementation of simulated 256-bit operations, the full 50\% reduction in bitwise logic was not achieved. 256-bits at a time.  The reduction measured was 32\%--39\% depending on workload because some bitwise logic needed in implementation is composed of 128-bit operations. The limits the performance gains achieved when using the AVX instructions.  The other SIMD'' class shows a substantial 30\%-35\% reduction with AVX 128-bit technology compared to SSE.  This reduction is due to elimination of register unloading and reloading when SIMD operations are compiled using 3-operand AVX form versus 2-operand SSE form.  A further 10\%--20\% reduction is observed with Parabix version rewritten to use 256-bit operations. The other SIMD'' class shows a substantial 30\%-35\% reduction with AVX 128-bit technology compared to SSE.  This reduction is due to eliminated copies or reloads when SIMD operations are compiled using 3-operand AVX form versus 2-operand SSE form. A further 10\%--20\% reduction is observed with Parabix2 version rewritten to use 256-bit operations. While the successive reductions in SIMD instruction counts are quite dramatic with the two AVX implementations of Parabix2, the performance benefits are another story.  As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits.  Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.  Instead, perhaps bizzarely, the performance of Parabix2 in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%, saw a performance improvement of 8\%. We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.  If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix2. %[AS] Check numbers. The reductions in instruction counts are quite dramatic with the AVX extensions in Parabix demonstrating the ability of our runtime framework to exploit the available hardware resources. As shown in Figure \ref{avx}, the benefits of the reduced SIMD instruction count are achieved only in the AVX 128-bit version.  In this case, the benefits of 3-operand form seem to fully translate to performance benefits.  Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.  Instead, perhaps bizzarely, the performance of Parabix in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density (average 11\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%, saw a performance improvement of 8\%.  We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall throughput.  If these latency issues can be addressed in future AVX implementations, further substantial performance and energy benefits could be realized in XML parsing with Parabix.